In statistics we often tend to assume that our data will fit some kind of trend, though we expect a certain amount of variability. This is why we look at measures of central tendency, such as the mean, median and mode, and measures of spread, such as the range and interquartile range.
An outlier is a value that appears inconsistent with the rest of the data. It may stem from an erroneous event that results in a score that is significantly above or below the rest of the data. Outliers can affect some of the measures of central tendency and spread, so we need to understand how to identify outliers and what to do when we find them in our data set.
For example, suppose there are ten people in a long jump contest. Nine of those people made a jump between $4$4 and $5$5 metres, while the other one person only made a jump of $1$1 metre. This last score is an outlier, as the jump is much shorter than the jumps of everyone else in the group.
In the example above, it seems a trivial exercise to identify the $1$1 metre jump as an outlier, as it is so obviously at odds with the rest of the data set. When data is more spread out, however, it is necessary to define boundaries beyond which outliers are found.
In the previous lesson, we explored the five number summary and interquartile range of a data set. Once we know the minimum, lower quartile, median, upper quartile and maximum, we can use this information to determine whether a particular data point can be considered an outlier. To do this, we calculate values that we call the fences.
Once we have calculated the lower fence and the upper fence, any data that falls within the fence lines is not an outlier. Any data that fall outside the fence lines will be considered an outlier.
A data point is classified as an outlier if it lies above the upper fence or below the lower fence. These are calculated as follows:
Lower fence | $=$= | $\text{lower quartile }-1.5\times\text{interquartile range }$lower quartile −1.5×interquartile range |
$=$= | $Q_1-1.5\times IQR$Q1−1.5×IQR |
Upper fence | $=$= | $\text{upper quartile }+1.5+\text{interquartile range }$upper quartile +1.5+interquartile range |
$=$= | $Q_3+1.5\times IQR$Q3+1.5×IQR |
Identify the outlier(s) in the data set $\left\{73,77,81,86,131\right\}${73,77,81,86,131}.
Consider the dot plot below.
Determine the median, lower quartile score and the upper quartile score.
Median $=$= $\editable{}$
Lower quartile $=$= $\editable{}$
Upper quartile $=$= $\editable{}$
Hence, calculate the interquartile range.
Calculate $1.5\times IQR$1.5×IQR, where IQR is the interquartile range.
An outlier is a score that is more than $1.5\times IQR$1.5×IQR above or below the Upper Quartile or Lower Quartile respectively. State the outlier.
Consider the following set of data:
$9$9 $5$5 $3$3 $2$2 $6$6 $1$1
Complete the five-number summary for this data set.
Minimum | $\editable{}$ |
Lower quartile | $\editable{}$ |
Median | $\editable{}$ |
Upper quartile | $\editable{}$ |
Maximum | $\editable{}$ |
Calculate the interquartile range.
Calculate the value of the lower fence.
Calculate the value of the upper fence.
Would the value $-3$−3 be considered an outlier?
No
Yes
Outliers can skew or change the shape of our data.
If we compare two data sets that are identical apart from one outlier, we will see the following:
In particular, adding an outlier to a set of data will change the shape as follows:
Adding a low outlier | Adding a high outlier |
---|---|
The range increases significantly | The range increases significantly |
The median could decrease a little | The median could increase a little |
The mean decreases significantly | The mean increases significantly |
The mode will not change | The mode will not change |
Use the following applet to explore how the outlier affects the mean of the data set.
|
Consider the following set of data: $36,36,70,45,52,48,36,43,44,51$36,36,70,45,52,48,36,43,44,51
(a) Calculate the mean of the data set, rounding your answer to one decimal place if necessary.
Think: The mean is the average of the scores.
Do:
$\text{Mean }$Mean | $=$= | $\frac{36+36+70+45+52+48+36+43+44+51}{10}$36+36+70+45+52+48+36+43+44+5110 |
$=$= | $\frac{461}{10}$46110 | |
$=$= | $46.1$46.1 |
(b) Calculate the median of the data set.
Think: The median is the middle score once the data set is ordered.
Do: Let's put the data in ascending order first: $36,36,36,43,44,45,48,51,52,70$36,36,36,43,44,45,48,51,52,70
There are ten numbers in this data set, so the median will be the $5.5$5.5th score. This will be the average between the fifth and sixth numbers in the set, and so we see that the median is $\frac{44+45}{2}=44.5$44+452=44.5.
(c) Calculate the mode of the data set.
Think: Which score occurs most frequently?
Do: The mode is $36$36.
(d) Calculate the range of the data set.
Think: The range is the difference between the highest and lowest scores.
Do: The range is $70-36=34$70−36=34.
(e) Remove the outlier from the data set and recalculate the mean, rounding your answer to one decimal place if necessary.
Think: Is the highest or the lowest score significantly different from the other scores.
Do: We're going to remove $70$70 from the data set because it is much higher than the other scores in the set.
Now let's calculate the mean. Remember that there are only nine scores in the set now:
$\text{Mean }$Mean | $=$= | $\frac{36+36+45+52+48+36+43+44+51}{9}$36+36+45+52+48+36+43+44+519 |
$=$= | $43.444\ldots$43.444… | |
$=$= | $43.4$43.4 ($1$1 d.p.) |
Notice that this is substantially lower than the mean when the outlier was included.
(f) With the outlier removed from the data set, recalculate the median.
Think: The median will now be the fifth score in the set.
Do: So the median is $44$44, which is only slightly lower than the median when the outlier was included.
(g) With the outlier removed from the data set, recalculate the mode.
Think: Which score occurs most frequently now?
Do: The mode hasn't changed, and is still $36$36.
(h) With the outlier removed from the data set, recalculate the range.
Think: Let's calculate the difference between the new highest and lowest scores.
Do: The new range is $52-36=16$52−36=16, which is substantially smaller than the range was with the outlier included.
Consider the following set of data:
$53,46,25,50,30,30,40,30,47,109$53,46,25,50,30,30,40,30,47,109
Fill in this table of summary statistics.
Mean | $\editable{}$ |
---|---|
Median | $\editable{}$ |
Mode | $\editable{}$ |
Range | $\editable{}$ |
Which data value is an outlier?
Fill in this table of summary statistics after removing the outlier $109$109.
Mean | $\editable{}$ |
---|---|
Median | $\editable{}$ |
Mode | $\editable{}$ |
Range | $\editable{}$ |
Let $A$A be the original data set and $B$B be the data set without the outlier.
Fill in this table using the symbols $>$>, $<$< and $=$= to compare the statistics before and after removing the outlier.
With outlier | Without outlier | |
---|---|---|
Mean: | $A\editable{}B$AB | |
Median: | $A\editable{}B$AB | |
Mode: | $A\editable{}B$AB | |
Range: | $A\editable{}B$AB |