topic badge

11.05 Outliers

Lesson

In statistics, we tend to assume that our data will fit some kind of trend and that most things will fit into a "normal" range. This is why we look at measures of central tendency, such as the mean, median and mode.

An outlier is an event that is very different from the norm and results in a score that is really above or below average. For example, if there are five people in a group and four people were between $120$120cm and $130$130cm, whereas Jim was $165$165cm, Jim would be an outlier as he is much taller than everyone else in the group.

 

Effects of Outliers

Outliers can skew or change the shape of our data. This can be a problem (especially for small data sets) because the mean, median and range might not properly represent the situation. We can counteract this by removing outliers. Removing outliers will have the following effects.

If we remove...
A really low outlier A really high outlier
The range will decrease The range will decrease
The median might increase The median might decrease
The mean will increase The mean will decrease
The mode will not change The mode will not change

 

Worked example

Consider the following set of data:

$36,36,70,45,52,48,36,43,44,51$36,36,70,45,52,48,36,43,44,51

A) Calculate the mean of the data set. Leave your answer to one decimal place if necessary.

Think: The mean is the average of the scores.

Do:

$\text{Mean }$Mean $=$= $\frac{36+36+70+45+52+48+36+43+44+51}{10}$36+36+70+45+52+48+36+43+44+5110
  $=$= $\frac{461}{10}$46110
  $=$= $46.1$46.1

 

B) Calculate the median of the data set. Leave your answer to one decimal place if necessary.

Think: The median is the middle score once the data set is ordered.

Do:

Let's put the data in ascending order first:

$36,36,36,43,44,45,48,51,52,70$36,36,36,43,44,45,48,51,52,70

There are ten numbers in this data set, so the median will be the average between the fifth and sixth numbers in the set:

$\frac{44+45}{2}=44.5$44+452=44.5

 

C) Calculate the mode of the data set.

Think: Which score occurs most frequently?

Do: $36$36

 

D) Calculate the range of the data set.

Think: The range is the difference between the highest and lowest scores.

Do: $70-36=34$7036=34

 

E) Remove the outlier from the data set and recalculate the mean. Leave your answer to one decimal place if necessary.

Think: Is the highest or the lowest score significantly different from the other scores.

Do:

We're going to remove $70$70 from the data set because it is much higher than the other scores in the set.

Now let's calculate the mean. Remember that there are only nine scores in the set now:

$\text{Mean }$Mean $=$= $\frac{36+36+45+52+48+36+43+44+51}{9}$36+36+45+52+48+36+43+44+519
  $=$= $43.444$43.444...
  $=$= $43.4$43.4

F) Is this higher or lower than the mean that was calculated with the outlier?

Think: Let's compare the two means.

Do: The mean is lower than the mean that was calculated with the outlier.

 

G) With the outlier removed from the data set, recalculate the median.

Think: The mean will now be the fifth score in the set.

Do: $44$44

 

H) Is this higher or lower than the median that was calculated with the outlier?

The median is lower than the median that included the outlier.

 

I) With the outlier removed from the data set, recalculate the mode.

Think: Which score occurs most frequently now?

Do: $36$36

 

J) Is this higher or lower than the mode that was calculated with the outlier?

This is the same as the mode that was calculated with the outlier.

 

K) With the outlier removed from the data set, recalculate the range.

Think: Let's calculate the difference between the new highest and lowest scores.

Do: $52-36=16$5236=16

 

L) Is this greater or smaller than the range that was calculated with the outlier?

This range is smaller than the range that was calculated with the outlier.

 

Practice questions

Question 1

Identify the outlier in the data set:

$5,6,7,8,9,9,18$5,6,7,8,9,9,18

Question 2

The dot plot shows the temperature ($^\circ C$°C) in a town over a several week period. Identify the temperature that is an outlier.

Question 3

Consider the following set of data:

$27,50,24,37,47,41,27,126,44,27$27,50,24,37,47,41,27,126,44,27

  1. Fill in this table of summary statistics.

    Mean $\editable{}$
    Median $\editable{}$
    Mode $\editable{}$
    Range $\editable{}$
  2. Which data value is an outlier?

  3. Fill in this table of summary statistics after removing the outlier.

    Mean $\editable{}$
    Median $\editable{}$
    Mode $\editable{}$
    Range $\editable{}$
  4. Let $A$A be the original data set and $B$B be the data set without the outlier.

    Fill in this table using the symbols $>$>, $<$< and $=$= to compare the statistics before and after removing the outlier.

      With outlier Without outlier
    Mean: $A\editable{}B$AB
    Median: $A\editable{}B$AB
    Mode: $A\editable{}B$AB
    Range: $A\editable{}B$AB

Outcomes

MA4-20SP

analyses single sets of data using measures of location, and range

What is Mathspace

About Mathspace