An outlier is a data point that varies significantly from the body of the data. An outlier will be a value that is either significantly larger or smaller than other observations. In this lesson, we will visually identify outliers and their impact on the measures of centre. Outliers are important to identify as they point to unusual bits of data that may require further investigation. For example, if you had data on the temperature in a volcanic lake and we had a high outlier, it is worth investigating if the sensor is faulty or we need to prepare a nearby town for evacuation.
Consider the dot plot below. We would call $9$9 an outlier as it is well above the body of the data.
Identify the outlier(s) in the data set $\left\{73,77,81,86,131\right\}${73,77,81,86,131}.
To determine if a data value is an outlier, there is a rule that involves the interquartile range (IQR). This rule calculates the upper and lower fences of a box plot. A fence refers to the upper and lower boundaries, and any score which lies outside of the fences are classified as outliers.
A data point is classified as an outlier if it lies above the upper fence or below the lower fence. These are calculated as follows:
Lower fence$=$=Lower quartile $-1.5\times$−1.5× Interquartile range
Upper fence$=$=Upper quartile $+1.5\times$+1.5× Interquartile range
Using the five-number summary and the upper and lower fences we can construct an box plot and identify any outliers.
The above diagram shows the construction of the box plots and how the upper and lower fences are constructed. Any data points outside of these are outliers.
Consider the data set below:
$1,1,3,21,7,9,10,6,11$1,1,3,21,7,9,10,6,11
Think: The five number summary consists of the minimum and maximum scores, the lower and upper quartiles, and the median. To find these values, we should first sort the data set to be in order.
Do: Ordering the data set, we have:
$1,1,3,6,7,9,10,11,21$1,1,3,6,7,9,10,11,21
Now we can clearly see that the minimum is $1$1 and the maximum is $21$21.
This data set has an odd number of scores, so the median is the middle score of $7$7.
To find the quartiles, we take the parts of the list on either side of the median. The lower part is $1,1,3,6$1,1,3,6, and the upper part is $9,10,11,21$9,10,11,21. From this, we find that $Q1=\frac{1+3}{2}=2$Q1=1+32=2, and $Q3=\frac{10+11}{2}=10.5$Q3=10+112=10.5.
So our five number summary is:
Minimum | $1$1 |
Q1 | $2$2 |
Median | $7$7 |
Q3 | $10.5$10.5 |
Maximum | $21$21 |
We can now check for outliers by calculating the lower and upper fences. For this set, the $IQR$IQR is $Q3-Q1=10.5-2=8.5$Q3−Q1=10.5−2=8.5.
Lower fence | $=$= | $Q1-1.5\times IQR$Q1−1.5×IQR |
$=$= | $2-1.5\times8.5$2−1.5×8.5 | |
$=$= | $-10.75$−10.75 | |
Upper fence | $=$= | $Q3+1.5\times IQR$Q3+1.5×IQR |
$=$= | $10.5+1.5\times8.5$10.5+1.5×8.5 | |
$=$= | $23.25$23.25 |
So there are no outliers in this data set.
Reflect: Notice that although $21$21 is quite a bit larger than the other scores in the set, it is not far enough away to be considered an outlier. We should always check the upper and lower fence values to determine if a score is an outlier or not.
Think: We have already constructed a five-number summary for this set, which we can use to draw the box plot.
Do:
Consider the dot plot below.
Determine the median, lower quartile score and the upper quartile score.
Median $=$= $\editable{}$
Lower quartile $=$= $\editable{}$
Upper quartile $=$= $\editable{}$
Hence, calculate the interquartile range.
Calculate $1.5\times IQR$1.5×IQR, where IQR is the interquartile range.
An outlier is a score that is more than $1.5\times IQR$1.5×IQR above or below the Upper Quartile or Lower Quartile respectively. State the outlier.
Once an outlier is identified, the underlying cause of the outlier should be investigated. If the outlier is simply a mistake then it should be removed from the data–this can often occur when recording or transferring data by hand or conducting a survey where a respondent may not take the questionnaire seriously. If the data is not a mistake it should not be removed from the data set as while it is unusual it is representative of possible outcomes–for example, you would not remove a very tall student's height from data for a class just because it was unusual for the class.
When data contains an outlier we should be aware of its impact on any calculations we make. Let's look at the effect that outliers have on the three measures of centre–mean, median and mode:
Measure of centre | Effect of outlier |
---|---|
Mean |
The mean will be significantly affected by the inclusion of an outlier:
|
Median |
The median is the middle value of a data set, the inclusion of an outlier will not generally have a significant impact on the median unless there is a large gap in the centre of the data. Thus, generally:
|
Mode |
The mode is the most frequent value, as an outlier is an unusual value it will not be the mode. Hence:
|
Consider the following set of data:
$37,46,35,56,56,35,125,36,48,56$37,46,35,56,56,35,125,36,48,56
Fill in this table of summary statistics.
Mean | $\editable{}$ |
---|---|
Median | $\editable{}$ |
Mode | $\editable{}$ |
Which data value is an outlier?
Fill in this table of summary statistics after removing the outlier.
Mean | $\editable{}$ |
---|---|
Median | $\editable{}$ |
Mode | $\editable{}$ |
Let $A$A be the original data set and $B$B be the data set without the outlier.
Fill in this table using the symbols $>$>, $<$< and $=$= to compare the statistics before and after removing the outlier.
With outlier | Without outlier | |
---|---|---|
Mean: | $A\editable{}B$AB | |
Median: | $A\editable{}B$AB | |
Mode: | $A\editable{}B$AB |
The mean, median or mode can all be used to describe the centre of a data set. Sometimes, one measure may better represent the data than another and sometimes we want just one statistic for an article or report rather than detail on the different measures. When deciding which to use, consider which measure would best represent the type of data (e.g. the shape, skew and other features). Some main considerations are:
The salaries of part-time employees at a company are given in the dot plot below. Which measure of centre best reflects the typical wage of a part-time employee?
The mean.
The mode.
The median.