Outliers

In statistics, we tend to assume that our data will fit some kind of trend and that most things will fit into a "normal" range. This is why we look at measures of center, such as the mean, median and mode.

A measure of center is a way to describe where the center of a set of data is. However, not all measures describe the center in the same way and some measures are heavily affected by extreme data values, or outliers.

Outlier

A data value that is an abnormal distance from the other data values in the set (much larger or much smaller)

Exploration

Drag the blue point to a position that is either much larger or smaller than other data points.

Loading interactive...
  1. What do you notice about the mean as you move the position of the blue point to be much larger than the data?

  2. What do you notice about the mean as you move the position of the blue point to be much smaller than the data?

  3. What do you notice about the median as you move the position of the blue point to be much larger than the data?

  4. What do you notice about the median as you move the position of the blue point to be much smaller than the data?

Outliers are data points that lie far outside the majority of a data set and can significantly affect the measures of center (mean, median, and mode) as well as the range.

  • The mean is most affected by outliers. Extreme data values cause the mean to increase or decrease significantly.
  • The median is less affected by outliers because it only shifts based on how many data values are added or removed from the set, their values do not matter.
  • The mode is least affected by an outlier because an outlier should be far away from the rest of the data so it is unlikely to impact the mode which is the value that appears most often in the set.
  • The range is extremely affected by outliers because an outlier greatly increases the distance between the largest and smallest data value.

Often, analysts choose to remove outliers to better understand the trends in the majority of the data, providing a more accurate picture of the patterns within a data set.

Outliers are important to identify as they point to unusual pieces of data that may require further investigation. For example, if we had data on the temperature in a volcanic lake and we had a high outlier, it is worth investigating if the sensor is faulty or we need to prepare a nearby town for evacuation.

The following data sets show some examples of outliers:

The image shows set of numbers: 18, 19, 17, 18, 18, 16, 5.
5 is the outlier
A stem and leaf plot of the list of data points listed above: 8, 9, 7, 8, 8, 6, 15.
15 is the outlier
A dot plot ranging from 47 to 59 in steps of 1. The number of dots is as follows: at 47, 1; at 53, 2; at 54, 3; at 56, 1; at 57, 4; at 58, 2; at 59, 1.
47 is the outlier
A line graph with the following points: (1, 8), (2, 9), (3, 7), (4, 8), (5, 8), (6, 6), (7, 15).
15 is the outlier

Examples

Example 1

Identify the outlier in the data set:

63,\, 67,\, 71,\, 76,\, 111

Worked Solution
Create a strategy

Identify the value that is much greater or much smaller than the other values.

Apply the idea

The outlier is 111.

Example 2

The stem and leaf plot shows the number of hours worked per week by a group of people. Identify the outlier(s).

A stem and leaf plot. The left column is titled Stem, with numbers 0 through 7, and right column titled Leaf. Ask your teacher for more information.
Worked Solution
Create a strategy

Choose the value that is much greater or much smaller than the rest of the data set.

Apply the idea

Most of the values in the data set are located between 52 and 78, except for the value of 7.

\text{Outlier} = 7

Most people work between 52 and 78 hours per week, but one person only worked 7 hours each week, which is the outlier of the data set.

Example 3

Sarah celebrated her 13th birthday at a bowling alley. She invited 20 friends, and they played a game of bowling. The scores for the game were: 12 \, , 17 \, , 23 \, , 31 \, , 35 \, , 42 \, , 45 \, , 49 \, , 49 \, , 49 \, , 49 \, , 53 \, , 56 \, , 65 \, , 69 \, , 75 \, , 75 \, , 83 \, , 83\, , 300

This image shows dot plot of the game scores in bowling. Ask your teacher for more information.
a

State the mean and median of the data.

Worked Solution
Create a strategy

To find the mean, use the formula: \text{Mean}=\dfrac{\text{Sum of scores}}{\text{Number of scores}}

To find the median, find the middle score.

Apply the idea
\displaystyle \text{Mean}\displaystyle =\displaystyle \dfrac{1260}{20}Add all the scores and divide by the total 20
\displaystyle =\displaystyle 63Evaluate the division

Since the number of scores is even, and the two middle scores are both 49 this means that:

\text{Median }= 49

b

Identify the outlier.

Worked Solution
Create a strategy

Identify the game score that is much greater or smaller than most of the scores.

Apply the idea

We can see that the dot for 300 is far away from the rest of the dots.

\text{Outlier }=300

c

State the mean and median of the data without the outlier.

Worked Solution
Create a strategy

To find the mean, use the formula: \text{Mean}=\dfrac{\text{Sum of score}}{\text{Number of scores}}

To find the median, find the middle score.

Apply the idea
\displaystyle \text{Mean}\displaystyle =\displaystyle \dfrac{960}{19}Add all the scores and divide by the total 19
\displaystyle \approx\displaystyle 50.53Evaluate the division

Since the number of scores is odd, then the middle score is 49.

\text{Median }= 49

Reflect and check

Notice that the mean changed from 63 to 50.53 and the median stayed the same. This is because the mean is extremely affected by outliers while the median is much less affected.

Example 4

The data set 6,\,8,\,10,\,10,\,12 has measures of:

  • Mean =9.2

  • Median =10

  • Mode =10

  • Range =6

Suppose we add the number 20 to the data set. Predict how the addition of this outlier will affect the mean, median, mode, and range of the new data set.

a

Will the mean be higher, lower, or remain the same? Explain.

Worked Solution
Create a strategy

The mean is the average of all the data values. Values significantly higher or lower than the mean will cause the mean to shift in the direction of that high or low value.

Apply the idea

The mean will be higher.

The mean is calculated as the sum of all values divided by the number of values. Adding a number as high as 20 significantly increases the sum of the data set (numerator) while only increasing the number of values (denominator) by 1.Since 20 is much higher than the original mean, 9.2, the new mean must be higher.

Reflect and check

Let's check our prediction by calculating the new mean.

To find the mean, use the formula: \text{Mean}=\dfrac{\text{Sum of Values}}{\text{Number of Values}}

\displaystyle \text{Mean}\displaystyle =\displaystyle \dfrac{6+8+10+10+12+20}{6}The sum of all values divided by the total number of values
\displaystyle =\displaystyle \dfrac{66}{6}Evaluate the addition
\displaystyle =\displaystyle 11Evaluate the division

The new mean, 11, is higher than the original mean, 9.2.

b

Will the median be higher, lower, or remain the same? Explain

Worked Solution
Create a strategy

The median of a data set is the middle value when the data is organized from least to greatest. Adding a value could cause a shift in the location of the middle of the data.

Apply the idea

Originally, the median was 10, which is the middle value when the numbers are ordered.

After adding the number 20, the ordered data set becomes {6,\,8,\,10,\,10,\,12,\,20}. The median will now between, 10 and 10.

The median remains the same.

Reflect and check

Let's check our prediction by calculating the new median.

First, we list the data set with the new value added. 6,\,8,\,10,\,10,\,12,\,20

There are now 6 values in the data set, so the median will fall between the 3rd and 4th terms.

The 3rd and 4th terms are both 10. To find the average these values we need to find the sum of the values and divide by two.\dfrac{10+10}{2}=\dfrac{20}{2}=10The median remains the same at 10.

c

Will the mode be higher, lower, or remain the same? Explain.

Worked Solution
Create a strategy

The mode is the most frequently occurring value in the data set.

Apply the idea

The original mode is 10, and since the added number 20 was not in the original data set, it will not appear more frequently than 10.

The mode will remain the same.

d

Will the range be higher, lower, or remain the same? Explain.

Worked Solution
Create a strategy

The range is the difference between the largest and smallest values in the data set.

Apply the idea

Originally, the range was from 6 to 12, which is 6 units. By adding 20, the new range will be from 6 to 20, which is covering a larger spread of values than the original set.

The range will be higher.

Reflect and check

Let's check our prediction by calculating the new range.

The range is calculated by finding the difference of the highest and lowest values. 20-6=14

The new range of 14 is higher than the original range of 6.

Idea summary

An outlier is a data point that varies significantly from the rest of the data. An outlier will be a value that is either significantly larger or smaller than other observations.

Removing outliers will have the following effects on the summary statistics:

A really low outlierA really high outlier
The range will decreaseThe range will decrease
The median might increaseThe median might decrease
The mean will increaseThe mean will decrease
The mode will not changeThe mode will not change

Outcomes

6.PS.2

The student will represent the mean as a balance point and determine the effect on statistical measures when a data point is added, removed, or changed.

6.PS.2c

Observe patterns in data to identify outliers and determine their effect on mean, median, mode, or range.

What is Mathspace

About Mathspace