topic badge

9.03 Outliers and measures of centre

Lesson

Identifying outliers

An outlier is a data point that varies significantly from the body of the data. An outlier will be a value that is either significantly larger or smaller than other observations. In this lesson, we will visually identify outliers and their impact on the measures of centre. Outliers are important to identify as they point to unusual bits of data that may require further investigation. For example, if you had data on the temperature in a volcanic lake and we had a high outlier, it is worth investigating if the sensor is faulty or we need to prepare a nearby town for evacuation. 

Exploration

Consider the dot plot below. We would call $9$9 an outlier as it is well above the body of the data.

Practice questions

Question 1

Identify the outlier(s) in the data set $\left\{73,77,81,86,131\right\}${73,77,81,86,131}.

Question 2

The graph shows the annual net profit (in millions) of a company over a several year period. Identify the year in which the annual net profit is an outlier.

YearNet Profit (millions)102030405060708090100110120200420052006200720082009

 

Effect of outliers

Once we identify an outlier we should further investigate the underlying cause of the outlier. If the outlier is simply a mistake then it should be removed from the data - this can often occur when recording or transferring data by hand or conducting a survey where a respondent may not take the questionnaire seriously. If the data is not a mistake it should not be removed from the data set as while it is unusual it is representative of possible outcomes - for example, you would not remove a very tall student's height from data for a class just because it was unusual for the class.

When data contains an outlier we should be aware of its impact on any calculations we make. Let's look at the effect that outliers have on the three measures of centre - mean, median and mode:

Measure of centre Effect of outlier
Mean

The mean will be significantly affected by the inclusion of an outlier:

  • Including a high outlier will increase the mean
  • Including a low outlier will decrease the mean
Median

The median is the middle value of a data set, the inclusion of an outlier will not generally have a significant impact on the median unless there is a large gap in the centre of the data. Thus, generally:

  • Including a high outlier may increase the median slightly or it may remain unchanged
  • Including a low outlier may decrease the median slightly or it may remain unchanged
Mode

The mode is the most frequent value, as an outlier is an unusual value it will not be the mode. Hence:

  • The inclusion of an outlier will have no impact on the mode

Practice questions

Question 3

A set of data has a mean of $x$x, the outlier is removed and the mean rises. The outlier must have had:

  1. a value, but we cannot tell if it was larger or smaller

    A

    a value smaller than the values that remain

    B

    a value larger than the values that remain

    C

    a value, but we cannot tell if it was larger or smaller

    A

    a value smaller than the values that remain

    B

    a value larger than the values that remain

    C

Question 4

Consider the following set of data:

$37,46,35,56,56,35,125,36,48,56$37,46,35,56,56,35,125,36,48,56

  1. Fill in this table of summary statistics.

    Mean $\editable{}$
    Median $\editable{}$
    Mode $\editable{}$
  2. Which data value is an outlier?

  3. Fill in this table of summary statistics after removing the outlier.

    Mean $\editable{}$
    Median $\editable{}$
    Mode $\editable{}$
  4. Let $A$A be the original data set and $B$B be the data set without the outlier.

    Fill in this table using the symbols $>$>, $<$< and $=$= to compare the statistics before and after removing the outlier.

      With outlier Without outlier
    Mean: $A\editable{}B$AB
    Median: $A\editable{}B$AB
    Mode: $A\editable{}B$AB

 

The suitability of a measure of centre

We can use the mean, median or mode to describe the centre of a data set. Sometimes one measure may better represent the data than another and sometimes we want just one statistic for an article or report rather than detail on the different measures. When deciding which to use we need to ask ourselves which measure would best represent the type of data we have. Some main considerations are:

  • Is there a repeated value? If there are no repeated values or only a couple of randomly repeated values then the mode will not be representative of the data. If there is one or two highly frequent data points these may be a fair representation of the centre of the data.
  • Is there an outlier? As we have seen an outlier will significantly affect the mean–this may give a distorted view of the centre of the data. For example, if we had a list of houses sold in an area and a historic mansion was sold for a price well above the other houses in the area, then using the median would be a better representation of average house prices in the area than the mean.
  • Do you need all the data values to be taken into account? Only the mean uses all the values in its calculation. 

 

Practice questions

Question 5

The salaries of part-time employees at a company are given in the dot plot below. Which measure of centre best reflects the typical wage of a part-time employee?

  1. The mean.

    A

    The mode.

    B

    The median.

    C

    The mean.

    A

    The mode.

    B

    The median.

    C

Question 6

A journalist wanted to report on road speed cameras being used as revenue raisers. She obtained data that showed the number of times $20$20 speed cameras issued a fine to motorists in one month. The results were:

$101,102,115,115,121,124,127,128,130,130,143,143,146,162,162,163,178,183,194,977$101,102,115,115,121,124,127,128,130,130,143,143,146,162,162,163,178,183,194,977

  1. Determine the mean number of times a speed camera issued a fine in that month. Give your answer correct to one decimal place.

  2. Determine the median number of times a speed camera issued a fine in that month. Give your answer correct to one decimal place.

  3. Which measure is most representative of the number of fines issued by each speed camera in one month?

    the mean

    A

    the median

    B

    the mean

    A

    the median

    B
  4. Which score causes the mean to be much greater than the median?

  5. The journalist wants to give the impression that speed cameras are just being used to raise revenue. Which statement should she make?

    A sample of $20$20 speed cameras found that the median number of fines in one month was $136.5$136.5.

    A

    A sample of $20$20 speed cameras found that, on average, $182.2$182.2 fines were issued in one month.

    B

    A sample of $20$20 speed cameras found that the median number of fines in one month was $136.5$136.5.

    A

    A sample of $20$20 speed cameras found that, on average, $182.2$182.2 fines were issued in one month.

    B

Outcomes

ACMEM047

recognise and identify outliers

ACMEM051

investigate the suitability of measures of central tendency in various real-world contexts

ACMEM052

investigate the effect of outliers on the mean and the median

ACMEM054

use informal ways of describing spread, such as spread out/dispersed, tightly packed, clusters, gaps, more/less dense regions, outliers

What is Mathspace

About Mathspace