In statistics, we tend to assume that our data will fit some kind of trend and that most things will fit into a "normal" range. This is why we look at measures of center, such as the mean, median and mode.

A measure of center is a way to describe where the center of a set of data is. However, not all measures describe the center in the same way and some measures are heavily affected by extreme data values, or outliers.

Drag the blue point to a position that is either much larger or smaller than other data points.

What do you notice about the mean as you move the position of the blue point to be much larger than the data?

What do you notice about the mean as you move the position of the blue point to be much smaller than the data?

What do you notice about the median as you move the position of the blue point to be much larger than the data?

What do you notice about the median as you move the position of the blue point to be much smaller than the data?

Outliers are data points that lie far outside the majority of a data set and can significantly affect the measures of center (mean, median, and mode) as well as the range.

- The mean is most affected by outliers. Extreme data values cause the mean to increase or decrease significantly.
- The median is less affected by outliers because it only shifts based on how many data values are added or removed from the set, their values do not matter.
- The mode is least affected by an outlier because an outlier should be far away from the rest of the data so it is unlikely to impact the mode which is the value that appears most often in the set.
- The range is extremely affected by outliers because an outlier greatly increases the distance between the largest and smallest data value.

Often, analysts choose to remove outliers to better understand the trends in the majority of the data, providing a more accurate picture of the patterns within a data set.

Outliers are important to identify as they point to unusual pieces of data that may require further investigation. For example, if we had data on the temperature in a volcanic lake and we had a high outlier, it is worth investigating if the sensor is faulty or we need to prepare a nearby town for evacuation.

The following data sets show some examples of outliers:

5 is the outlier

15 is the outlier

47 is the outlier

15 is the outlier

Identify the outlier in the data set:

63,\, 67,\, 71,\, 76,\, 111

Worked Solution

The stem and leaf plot shows the number of hours worked per week by a group of people. Identify the outlier(s).

Worked Solution

Sarah celebrated her 13th birthday at a bowling alley. She invited 20 friends, and they played a game of bowling. The scores for the game were: 12 \, , 17 \, , 23 \, , 31 \, , 35 \, , 42 \, , 45 \, , 49 \, , 49 \, , 49 \, , 49 \, , 53 \, , 56 \, , 65 \, , 69 \, , 75 \, , 75 \, , 83 \, , 83\, , 300

a

State the mean and median of the data.

Worked Solution

b

Identify the outlier.

Worked Solution

c

State the mean and median of the data without the outlier.

Worked Solution

The data set 6,\,8,\,10,\,10,\,12 has measures of:

Mean =9.2

Median =10

Mode =10

Range =6

Suppose we add the number 20 to the data set. Predict how the addition of this outlier will affect the mean, median, mode, and range of the new data set.

a

Will the mean be higher, lower, or remain the same? Explain.

Worked Solution

b

Will the median be higher, lower, or remain the same? Explain

Worked Solution

c

Will the mode be higher, lower, or remain the same? Explain.

Worked Solution

d

Will the range be higher, lower, or remain the same? Explain.

Worked Solution

Idea summary

An **outlier** is a data point that varies significantly from the rest of the data. An outlier will be a value that is either significantly larger or smaller than other observations.

Removing outliers will have the following effects on the summary statistics:

A really low outlier | A really high outlier |
---|---|

The range will decrease | The range will decrease |

The median might increase | The median might decrease |

The mean will increase | The mean will decrease |

The mode will not change | The mode will not change |