An outlier is a data point that varies significantly from the rest of the data. An outlier will be a value that is either significantly larger or smaller than other observations.
Consider the dot plot below. We would call 9 an outlier as it is well above the rest of the data.
Identify the outlier(s) in the data set: 63,\,67,\,71,\,76,\,111
Rochelle recorded the heights (in centimeters) of all the students in her class. She recorded the following: 148,\,153,\,137,\,142,\,140,\,146,\,136,\,143,\,135,\,144,\,189,\,138,\,149,\,139,\,145,\,150
Identify the outlier.
An outlier is a data point that is either significantly larger or smaller than other observations.
Once we identify an outlier we should further investigate the underlying cause of the outlier. If the outlier is a mistake then it should be removed from the data. If the outlier is not a mistake it should not be removed from the data set as while it is unusual it is representative of possible outcomes.
For example, you would not remove a very tall student's height from data for a class just because it was unusual for the class.
When data contains an outlier we should be aware of its impact on any calculations we make. Let's look at the effect that outliers have on the three measures of center - mean, median and mode:
The mean will be significantly affected when an outlier is included:
Including a high outlier will increase the mean.
Including a low outlier will decrease the mean.
The median is not usually significantly affected when an outlier is included, unless there is a large gap in the center of the data.
Including a high outlier may increase the median slightly or it may remain unchanged.
Including a low outlier may decrease the median slightly or it may remain unchanged.
The mode is the most frequent value, since an outlier is an unusual value it will not be frequent. Therefore, including an outlier will have no effect on the mode.
A set of data has a mean of x, the outlier is removed and the mean rises. The outlier must have had:
Consider the following set of data: 37,\,46,\,35,\,56,\,56,\,35,\,125,\,36,\,48,\,56
Find the mean, median and mode.
Which data value is an outlier?
Find the mean, median and mode after removing the outlier.
Let A be the original data set and B be the data set without the outlier.
Complete the table using the symbols >,< and = to compare the statistics before and after removing the outlier.
\text{With outlier} | \text{Without\ outlier} | ||
Mean: | A | ⬚ | B |
Median: | A | ⬚ | B |
Mode: | A | ⬚ | B |
The mean will be significantly affected when an outlier is included:
Including a high outlier will increase the mean.
Including a low outlier will decrease the mean.
The median is not usually significantly affected when an outlier is included.
The mode is the most frequent value, since an outlier is an unusual value it will not be frequent. Therefore, including an outlier will have no effect on the mode.
Recall that we can use the mean, median or mode to describe the center of a data set.
Sometimes one measure may better represent the data than another. When deciding which to use we need to ask ourselves "Which measure would best represent the type of data we have?"
The following summarizes when each choice of center is most appropriate. It's a good idea to verify this using different data sets.
Measure | Best choice when... |
---|---|
Mean | There are no extreme values in the data set. |
Median | There are extreme values in the data set. There are no big gaps in the middle of the data. |
Mode | The data set has repeated numbers. |
The salaries of part-time employees at a company are given in the dot plot below. Which measure of center best reflects the typical wage of a part-time employee?
A journalist wanted to report on road speed cameras being used as revenue raisers. She obtained data that showed the number of times 20 speed cameras issued a fine to motorists in one month. The results were: 101,\,102,\,115,\,115,\,121,\,124,\,127,\,128,\,130,\, 130,\,\\ 143,\, 143,\,146,\,162,\,162,\,163,\,178,\,183,\,194,\,977
Determine the mean number of times a speed camera issued a fine in that month. Give your answer correct to one decimal place.
Determine the median number of times a speed camera issued a fine in that month. Give your answer correct to one decimal place.
Which measure is most representative of the number of fines issued by each speed camera in one month?
Which value causes the mean to be much greater than the median?
The journalist wants to give the impression that speed cameras are just being used to raise revenue. Which statement should she make?
Measure | Best choice when... |
---|---|
Mean | There are no extreme values in the data set. |
Median | There are extreme values in the data set. There are no big gaps in the middle of the data. |
Mode | The data set has repeated numbers. |