An outlier is a data point that varies significantly from the body of the data. An outlier will be a value that is either significantly larger or smaller than other observations. In this lesson, we will visually identify outliers and their impact on the measures of centre. Outliers are important to identify as they point to unusual bits of data that may require further investigation. For example, if you had data on the temperature in a volcanic lake and we had a high outlier, it is worth investigating if the sensor is faulty or we need to prepare a nearby town for evacuation.
Consider the dot plot below. We would call $9$9 an outlier as it is well above the body of the data.
Identify the outlier(s) in the data set $\left\{73,77,81,86,131\right\}${73,77,81,86,131}.
Once we identify an outlier we should further investigate the underlying cause of the outlier. If the outlier is simply a mistake then it should be removed from the data - this can often occur when recording or transferring data by hand or conducting a survey where a respondent may not take the questionnaire seriously. If the data is not a mistake it should not be removed from the data set as while it is unusual it is representative of possible outcomes - for example, you would not remove a very tall student's height from data for a class just because it was unusual for the class.
When data contains an outlier we should be aware of its impact on any calculations we make. Let's look at the effect that outliers have on the three measures of centre - mean, median and mode:
Measure of centre | Effect of outlier |
---|---|
Mean |
The mean will be significantly affected by the inclusion of an outlier:
|
Median |
The median is the middle value of a data set, the inclusion of an outlier will not generally have a significant impact on the median unless there is a large gap in the centre of the data. Thus, generally:
|
Mode |
The mode is the most frequent value, as an outlier is an unusual value it will not be the mode. Hence:
|
|
A set of data has a mean of $x$x, the outlier is removed and the mean rises. The outlier must have had:
a value, but we cannot tell if it was larger or smaller
a value smaller than the values that remain
a value larger than the values that remain
Consider the following set of data:
$37,46,35,56,56,35,125,36,48,56$37,46,35,56,56,35,125,36,48,56
Fill in this table of summary statistics.
Mean | $\editable{}$ |
---|---|
Median | $\editable{}$ |
Mode | $\editable{}$ |
Which data value is an outlier?
Fill in this table of summary statistics after removing the outlier.
Mean | $\editable{}$ |
---|---|
Median | $\editable{}$ |
Mode | $\editable{}$ |
Let $A$A be the original data set and $B$B be the data set without the outlier.
Fill in this table using the symbols $>$>, $<$< and $=$= to compare the statistics before and after removing the outlier.
With outlier | Without outlier | |
---|---|---|
Mean: | $A\editable{}B$AB | |
Median: | $A\editable{}B$AB | |
Mode: | $A\editable{}B$AB |