topic badge

3.05 Outliers

Lesson

In statistics, we often tend to assume that our data will fit some kind of trend, though we expect a certain amount of variability. This is why we look at measures of central tendency, such as the mean, median and mode, and measures of spread, such as the range and interquartile range.

An outlier is a value that appears inconsistent with the rest of the data. It may stem from an erroneous event that results in a score that is significantly above or below the rest of the data. Outliers can affect some of the measures of central tendency and spread, so we need to understand how to identify outliers and what to do when we find them in our data set.

For example, suppose there are ten people in a long jump contest. Nine of those people made a jump between $4$4 and $5$5 metres, while the other one person made a jump of $1$1 metre. This last score is an outlier, as the jump is much shorter than the jumps of everyone else in the group.

Quantifying outliers

Some outliers are more obvious than others and there is a formal definition that can be used to calculate if a particular data value is an outlier.

Determining outliers

A data point is classified as an outlier if it lies more than 1.5 interquartile ranges above the upper quartile or more than 1.5 interquartile ranges below the lower quartile. 

Below $Q_1$Q1 $-1.5\times$1.5×IQR

OR

More than $Q_3$Q3 $+1.5\times$+1.5×IQR

These boundaries can also be referred to as the "lower fence" and the "upper fence"

Practice questions

Question 1

Consider the dot plot below.

  1. Determine the median, lower quartile score and the upper quartile score.

    Median $=$= $\editable{}$

    Lower quartile $=$= $\editable{}$

    Upper quartile $=$= $\editable{}$

  2. Hence, calculate the interquartile range.

  3. Calculate $1.5\times IQR$1.5×IQR, where IQR is the interquartile range.

  4. An outlier is a score that is more than $1.5\times IQR$1.5×IQR above or below the Upper Quartile or Lower Quartile respectively. State the outlier.

Question 2

Consider the following set of data:

$9$9 $5$5 $3$3 $2$2 $6$6 $1$1

  1. Complete the five-number summary for this data set.

    Minimum $\editable{}$
    Lower quartile $\editable{}$
    Median $\editable{}$
    Upper quartile $\editable{}$
    Maximum $\editable{}$
  2. Calculate the interquartile range.

  3. Calculate the value of the lower fence.

  4. Calculate the value of the upper fence.

  5. Would the value $-3$3 be considered an outlier?

    No

    A

    Yes

    B

Question 3

Consider the following set of data:

$1$1 $6$6 $4$4 $9$9 $8$8 $5$5 $2$2

  1. Complete the five-number summary for this data set.

    Minimum $\editable{}$
    Lower quartile $\editable{}$
    Median $\editable{}$
    Upper quartile $\editable{}$
    Maximum $\editable{}$
  2. Would the value $15$15 be considered an outlier?

    Yes

    A

    No

    B

 

Box plots and outliers

Using the five-number summary and the upper and lower fences we can construct and box and whisker plot and identify any outliers.

The above diagram shows the construction of the box plots and how the upper and lower fences are constructed.  Any data points outside of these are outliers.

Worked example

Example 1

Consider the data set below:

$1,1,3,21,7,9,10,6,11$1,1,3,21,7,9,10,6,11

a) Construct a five-number summary and determine if there are any outliers.

Think: The five-number summary consists of the minimum and maximum scores, the lower and upper quartiles, and the median. To find these values, we should first sort the data set to be in order.

Do: Ordering the data set, we have:

$1,1,3,6,7,9,10,11,21$1,1,3,6,7,9,10,11,21

Now we can clearly see that the minimum is $1$1 and the maximum is $21$21.

This data set has an odd number of scores, so the median is the middle score of $7$7.

To find the quartiles, we take the parts of the list on either side of the median. The lower part is $1,1,3,6$1,1,3,6, and the upper part is $9,10,11,21$9,10,11,21. From this, we find that Q1$=$=$\frac{1+3}{2}=2$1+32=2, and Q3$=$=$\frac{10+11}{2}=10.5$10+112=10.5.

So our five-number summary is:

Minimum $1$1
Q1 $2$2
Median $7$7
Q3 $10.5$10.5
Maximum $21$21

We can now check for outliers by calculating the lower and upper fences. For this set, the IQR is Q3$-$Q1$=$=$10.5-2=8.5$10.52=8.5.

Lower fence $=$= Q1$-1.5\times$1.5×IQR
  $=$= $2-1.5\times8.5$21.5×8.5
  $=$= $-10.75$10.75
Upper fence $=$= Q3$+1.5\times$+1.5×IQR
  $=$= $10.5+1.5\times8.5$10.5+1.5×8.5
  $=$= $23.25$23.25
 

So there are no outliers in this data set.

Reflect: Notice that although $21$21 is quite a bit larger than the other scores in the set, it is not far enough away to be considered an outlier. We should always check the upper and lower fence values to determine if a score is an outlier or not.

b) Draw a box plot to represent this data.

Think: We have already constructed a five-number summary for this set, which we can use to draw the box plot.

Do:

Practice question

Question 4

A set of data has the following box plot.

2
4
6
8
10
12
14
16
18

  1. Calculate the interquartile range.

  2. Calculate the value of the lower fence.

  3. Calculate the value of the upper fence.

Effects of outliers

Outliers can skew or change the shape of our data. In particular, it can skew the mean, standard deviation and variance as these measures consider all scores in a dataset.

If we compare two data sets that are identical apart from one outlier, we will see the following:

  • The set with an outlier will have a much larger range.
  • The median of each set will be close together and might be the same.
  • The mean of each set will be noticeably different.
  • The mode of each set will be the same since an outlier added to a data set will never be the most frequent score.

In particular, adding an outlier to a set of data will change the shape as follows:

Adding a low outlier Adding a high outlier
The range increases significantly The range increases significantly
The median could decrease a little The median could increase a little
The mean decreases significantly The mean increases significantly
The mode will not change The mode will not change

 

Worked example

Example 2

Consider the following set of data: $36,36,70,45,52,48,36,43,44,51$36,36,70,45,52,48,36,43,44,51

a) Calculate the mean of the data set, rounding your answer to one decimal place if necessary.

Think: The mean is the average of the scores.

Do:

$\text{Mean }$Mean $=$= $\frac{36+36+70+45+52+48+36+43+44+51}{10}$36+36+70+45+52+48+36+43+44+5110
  $=$= $\frac{461}{10}$46110
  $=$= $46.1$46.1

 

(b) Calculate the median of the data set.

Do: Let's put the data in ascending order first: $36,36,36,43,44,45,48,51,52,70$36,36,36,43,44,45,48,51,52,70

There are ten numbers in this data set, so the median will be the $5.5$5.5th score. This will be the average between the fifth and sixth numbers in the set, and so we see that the median is $\frac{44+45}{2}=44.5$44+452=44.5.

c) Calculate the mode of the data set.

Do: The mode is $36$36.

d) Calculate the range of the data set.

Do: The range is $70-36=34$7036=34.

e) Remove the outlier from the data set and recalculate the mean, rounding your answer to one decimal place if necessary.

Do: We're going to remove $70$70 from the data set because it is much higher than the other scores in the set.

Now let's calculate the mean. Remember that there are only nine scores in the set now:

$\text{Mean }$Mean $=$= $\frac{36+36+45+52+48+36+43+44+51}{9}$36+36+45+52+48+36+43+44+519
  $=$= $43.444\ldots$43.444
  $=$= $43.4$43.4 ($1$1 d.p.)

 

Notice that this is substantially lower than the mean when the outlier was included.

f) With the outlier removed from the data set, recalculate the median.

Think: The median will now be the fifth score in the set.

Do: So the median is $44$44, which is only slightly lower than the median when the outlier was included.

g) With the outlier removed from the data set, recalculate the mode.

Do: The mode hasn't changed, and is still $36$36.

h) With the outlier removed from the data set, recalculate the range.

Do: The new range is $52-36=16$5236=16, which is substantially smaller than the range was with the outlier included.

Practice question

Question 5

Consider the following set of data:

$53,46,25,50,30,30,40,30,47,109$53,46,25,50,30,30,40,30,47,109

  1. Fill in this table of summary statistics.

    Mean $\editable{}$
    Median $\editable{}$
    Mode $\editable{}$
    Range $\editable{}$
  2. Which data value is an outlier?

  3. Fill in this table of summary statistics after removing the outlier $109$109.

    Mean $\editable{}$
    Median $\editable{}$
    Mode $\editable{}$
    Range $\editable{}$
  4. Let $A$A be the original data set and $B$B be the data set without the outlier.

    Fill in this table using the symbols $>$>, $<$< and $=$= to compare the statistics before and after removing the outlier.

      With outlier Without outlier
    Mean: $A\editable{}B$AB
    Median: $A\editable{}B$AB
    Mode: $A\editable{}B$AB
    Range: $A\editable{}B$AB

Outcomes

MA12-8

solves problems using appropriate statistical processes

What is Mathspace

About Mathspace