topic badge

8.05 Outliers and measures of center

Outliers

An outlier is a data point that varies significantly from the rest of the data. An outlier will be a value that is either significantly larger or smaller than other observations.

Consider the dot plot below. We would call 9 an outlier as it is well above the rest of the data.

The image shows a dot plot with numbers 0 to 10 on the axis. Ask your teacher for more information.

Examples

Example 1

Identify the outlier(s) in the data set: 63,\,67,\,71,\,76,\,111

Worked Solution
Create a strategy

Choose the value that is much smaller or greater than the rest of the data set.

Apply the idea

\text{Outlier} = 111

Example 2

Rochelle recorded the heights (in centimeters) of all the students in her class. She recorded the following: 148,\,153,\,137,\,142,\,140,\,146,\,136,\,143,\,135,\,144,\,189,\,138,\,149,\,139,\,145,\,150

Identify the outlier.

Worked Solution
Create a strategy

Write the data from smallest to largest and choose the value that is much greater or smaller than the rest of the data set.

Apply the idea

Scores in order:135,\,136,\,137,\,138,\,139,\,140,\,142,\,143,\,144,\,145,\,146,\,148,\,149,\,150,\,153,\,189

The outlier is 189.

Idea summary

An outlier is a data point that is either significantly larger or smaller than other observations.

Effect of outliers

Once we identify an outlier we should further investigate the underlying cause of the outlier. If the outlier is a mistake then it should be removed from the data. If the outlier is not a mistake it should not be removed from the data set as while it is unusual it is representative of possible outcomes.

For example, you would not remove a very tall student's height from data for a class just because it was unusual for the class.

When data contains an outlier we should be aware of its impact on any calculations we make. Let's look at the effect that outliers have on the three measures of center - mean, median and mode:

The mean will be significantly affected when an outlier is included:

  • Including a high outlier will increase the mean.

  • Including a low outlier will decrease the mean.

The median is not usually significantly affected when an outlier is included, unless there is a large gap in the center of the data.

  • Including a high outlier may increase the median slightly or it may remain unchanged.

  • Including a low outlier may decrease the median slightly or it may remain unchanged.

The mode is the most frequent value, since an outlier is an unusual value it will not be frequent. Therefore, including an outlier will have no effect on the mode.

Examples

Example 3

A set of data has a mean of x, the outlier is removed and the mean rises. The outlier must have had:

A
A value, but we cannot tell if it was larger or smaller.
B
A value smaller than the values that remain.
C
A value larger than the values that remain.
Worked Solution
Apply the idea

The correct option is B because to raise the mean, we should remove the outlier which had a smaller value than the values that remain.

Example 4

Consider the following set of data: 37,\,46,\,35,\,56,\,56,\,35,\,125,\,36,\,48,\,56

a

Find the mean, median and mode.

Worked Solution
Create a strategy

To find the mean, use the formula: \text{mean}=\dfrac{\text{sum of values}}{\text{number of values}}

To find the median find the middle value, to find the mode find the most frequent value.

Apply the idea
\displaystyle \text{mean}\displaystyle =\displaystyle \dfrac{35+35+36+37+46+48+56+56+56+125}{10}Substitute all the values
\displaystyle =\displaystyle \dfrac{530}{10}Evaluate the addition
\displaystyle =\displaystyle 53Evaluate the division

To find the median, order the values: 35, \,35, \,36, \,37, \,46, \,48, \,56, \,56, \,56, \,125

The middle values are: 46,\,48

\displaystyle \text{median}\displaystyle =\displaystyle \dfrac{46+48}{2}Find the average of the middle values
\displaystyle =\displaystyle 47Evaluate

To find the mode, choose the value which occurs most often.

\text{mode}=56

b

Which data value is an outlier?

Worked Solution
Create a strategy

Choose the value that is much greater or much smaller than the rest of the data set.

Apply the idea

\text{Outlier}=125

c

Find the mean, median and mode after removing the outlier.

Worked Solution
Apply the idea
\displaystyle \text{mean}\displaystyle =\displaystyle \dfrac{35+35+36+37+46+48+56+56+56}{10}Substitute all the values
\displaystyle =\displaystyle 45Evaluate

To find the median, order the values: 35, \,35, \,36, \,37, \,46, \,48, \,56, \,56, \,56

The middle value is: 46

\displaystyle \text{median}\displaystyle =\displaystyle 46

To find the mode, choose the value which occurs most often.

\text{mode}=56

d

Let A be the original data set and B be the data set without the outlier.

Complete the table using the symbols >,< and = to compare the statistics before and after removing the outlier.

\text{With outlier}\text{Without\ outlier}
Mean:AB
Median:AB
Mode:AB
Worked Solution
Create a strategy

Compare the statistics in part (a) and in part (c).

Apply the idea

Statistics from part (a) with the outlier:

Mean53
Median47
Mode56

Statistics in part (c) without the outlier:

Mean45
Median46
Mode56

Comparison table:

\text{With outlier}\text{Without\ outlier}
Mean:A>B
Median:A>B
Mode:A=B
Idea summary

The mean will be significantly affected when an outlier is included:

  • Including a high outlier will increase the mean.

  • Including a low outlier will decrease the mean.

The median is not usually significantly affected when an outlier is included.

The mode is the most frequent value, since an outlier is an unusual value it will not be frequent. Therefore, including an outlier will have no effect on the mode.

Best choice of center

Recall that we can use the mean, median or mode to describe the center of a data set.

Sometimes one measure may better represent the data than another. When deciding which to use we need to ask ourselves "Which measure would best represent the type of data we have?"

The following summarizes when each choice of center is most appropriate. It's a good idea to verify this using different data sets.

MeasureBest choice when...
MeanThere are no extreme values in the data set.
MedianThere are extreme values in the data set. There are no big gaps in the middle of the data.
ModeThe data set has repeated numbers.

Examples

Example 5

The salaries of part-time employees at a company are given in the dot plot below. Which measure of center best reflects the typical wage of a part-time employee?

A line plot titled Salaries in thousand dollars, ranging from 18 to 38 in steps of 1. Ask your teacher for more information.
Worked Solution
Create a strategy

Choose the measure that is appropriate for with extreme values.

Apply the idea

The median is the best measure of center that reflects the typical wage of a part-time employee.

Example 6

A journalist wanted to report on road speed cameras being used as revenue raisers. She obtained data that showed the number of times 20 speed cameras issued a fine to motorists in one month. The results were: 101,\,102,\,115,\,115,\,121,\,124,\,127,\,128,\,130,\, 130,\,\\ 143,\, 143,\,146,\,162,\,162,\,163,\,178,\,183,\,194,\,977

a

Determine the mean number of times a speed camera issued a fine in that month. Give your answer correct to one decimal place.

Worked Solution
Create a strategy

To find the mean, use the formula: \text{mean}=\dfrac{\text{sum of values}}{\text{number of values}}

Apply the idea

Add all the number of times a speed camera issued a fine and divide by the total number of cameras:

\displaystyle \text{Mean}\displaystyle =\displaystyle \dfrac{3644}{20}Find the sum of the values
\displaystyle =\displaystyle 182.2Divide and round your answer
b

Determine the median number of times a speed camera issued a fine in that month. Give your answer correct to one decimal place.

Worked Solution
Create a strategy

The median in a data set with an even number of values is the average of the two middle data values in the set.

Apply the idea

First half of the set: 101,\,102,\,115,\,115,\,121,\,124,\,127,\,128,\,130

Second half of the set: 143,\,146,\,162,\,162,\,163,\,178,\,183,\,194,\,977

The middle value of the set: 130,\,143

\displaystyle \text{median}\displaystyle =\displaystyle \dfrac{130+143}{2}Find the average of the middle values
\displaystyle =\displaystyle \dfrac{273}{2}Evaluate the addition
\displaystyle =\displaystyle 136.5Evaluate the division
c

Which measure is most representative of the number of fines issued by each speed camera in one month?

Worked Solution
Create a strategy

Use the measure that is less affected by the outliers.

Apply the idea

The correct measure is median, because it is less affected when the outliers is removed.

d

Which value causes the mean to be much greater than the median?

Worked Solution
Create a strategy

Choose the value which is really far away from the other values.

Apply the idea

Outlier =977

e

The journalist wants to give the impression that speed cameras are just being used to raise revenue. Which statement should she make?

A
A sample of 20 speed cameras found that the median number of fines in one month was 136.5.
B
A sample of 20 speed cameras found that, on average, 182.2 fines were issued in one month.
Worked Solution
Create a strategy

Choose the option which uses the larger number out of the mean and median.

Apply the idea

The correct option is B: A sample of 20 speed cameras found that, on average, 182.2 fines were issued in one month.

Idea summary
MeasureBest choice when...
MeanThere are no extreme values in the data set.
MedianThere are extreme values in the data set. There are no big gaps in the middle of the data.
ModeThe data set has repeated numbers.

Outcomes

6.SP.B.5

Summarize numerical data sets in relation to their context, such as by:

6.SP.B.5.C

Giving quantitative measures of center (median and/or mean) and variability (interquartile range and/or mean absolute deviation), as well as describing any overall pattern and any striking deviations from the overall pattern with reference to the context in which the data was gathered.

6.SP.B.5.D

Relating the choice of measures of center and variability to the shape of the data distribution and the context in which the data was gathered

What is Mathspace

About Mathspace