In statistics, we tend to assume that our data will fit some kind of trend and that most things will fit into a "normal" range. This is why we look at measures of center, such as the mean, median and mode.
A measure of center is a way to describe where the center of a set of data is. However, not all measures describe the center in the same way and some measures are heavily affected by extreme data values, or outliers.
Drag the blue point to a position that is either much larger or smaller than other data points.
What do you notice about the mean as you move the position of the blue point to be much larger than the data?
What do you notice about the mean as you move the position of the blue point to be much smaller than the data?
What do you notice about the median as you move the position of the blue point to be much larger than the data?
What do you notice about the median as you move the position of the blue point to be much smaller than the data?
Outliers are data points that lie far outside the majority of a data set and can significantly affect the measures of center (mean, median, and mode) as well as the range.
Often, analysts choose to remove outliers to better understand the trends in the majority of the data, providing a more accurate picture of the patterns within a data set.
Outliers are important to identify as they point to unusual pieces of data that may require further investigation. For example, if we had data on the temperature in a volcanic lake and we had a high outlier, it is worth investigating if the sensor is faulty or we need to prepare a nearby town for evacuation.
The following data sets show some examples of outliers:
Identify the outlier in the data set:
63,\, 67,\, 71,\, 76,\, 111
The stem and leaf plot shows the number of hours worked per week by a group of people. Identify the outlier(s).
Sarah celebrated her 13th birthday at a bowling alley. She invited 20 friends, and they played a game of bowling. The scores for the game were: 12 \, , 17 \, , 23 \, , 31 \, , 35 \, , 42 \, , 45 \, , 49 \, , 49 \, , 49 \, , 49 \, , 53 \, , 56 \, , 65 \, , 69 \, , 75 \, , 75 \, , 83 \, , 83\, , 300
State the mean and median of the data.
Identify the outlier.
State the mean and median of the data without the outlier.
The data set 6,\,8,\,10,\,10,\,12 has measures of:
Mean =9.2
Median =10
Mode =10
Range =6
Suppose we add the number 20 to the data set. Predict how the addition of this outlier will affect the mean, median, mode, and range of the new data set.
Will the mean be higher, lower, or remain the same? Explain.
Will the median be higher, lower, or remain the same? Explain
Will the mode be higher, lower, or remain the same? Explain.
Will the range be higher, lower, or remain the same? Explain.
An outlier is a data point that varies significantly from the rest of the data. An outlier will be a value that is either significantly larger or smaller than other observations.
Removing outliers will have the following effects on the summary statistics:
A really low outlier | A really high outlier |
---|---|
The range will decrease | The range will decrease |
The median might increase | The median might decrease |
The mean will increase | The mean will decrease |
The mode will not change | The mode will not change |