Univariate Data

NZ Level 5

Effects of Outliers

Lesson

In statistics, we tend to assume that our data will fit some kind of trend and that most things will fit into a "normal" range. This is why we look at measures of central tendency, such as the mean, median and mode, and talk about the normal distribution.

An outlier is an event that is very different from the norm and results in a score that is really above or below average. For example, if there are five people in a group and four people were between $120$120cm and $130$130cm, whereas Jim was $165$165cm, Jim would be an outlier as he is *much* taller than everyone else in the group.

Outliers can skew or change the shape of our data.

- The range will increase significantly
- The median will change a little bit
- The mean will change significantly
- The mode will stay the same because an outlier will never be the most frequent occurrence.

If we have... | |
---|---|

A really low outlier | A really high outlier |

The range increases significantly | The range increases significantly |

The median decreases a little bit | The median increases a little bit |

The mean decreases significantly | The mean increases significantly |

The mode will not change | The mode will not change |

Consider the following set of data:

$36$36, $36$36, $70$70, $45$45, $52$52, $48$48, $36$36, $43$43, $44$44, $51$51

**A)** Calculate the mean of the data set. Leave your answer to one decimal place if necessary.

**Think:** The mean is the average of the scores.

**Do:**

$\text{Mean }$Mean | $=$= | $\frac{36+36+70+45+52+48+36+43+44+51}{10}$36+36+70+45+52+48+36+43+44+5110 |

$=$= | $\frac{461}{10}$46110 | |

$=$= | $46.1$46.1 |

**B)** Calculate the median of the data set. Leave your answer to one decimal place if necessary.

**Think:** The median is the middle score once the data set is ordered.

**Do:**

Let's put the data in ascending order first:

$36$36, $36$36, $36$36, $43$43, $44$44, $45$45, $48$48, $51$51, $52$52, $70$70

There are ten numbers in this data set, so the median will be the $5.5$5.5^{th} score. This will be the average between the fifth and sixth numbers in the set:

$\frac{44+45}{2}=44.5$44+452=44.5

**C)** Calculate the mode of the data set.

**Think:** Which score occurs most frequently?

**Do: **$36$36

**D)** Calculate the range of the data set.

**Think: **The range is the difference between the highest and lowest scores.

**Do:** $70-36=34$70−36=34

**E)** Remove the outlier from the data set and recalculate the mean. Leave your answer to one decimal place if necessary.

**Think:** Is the highest or the lowest score significantly different from the other scores.

**Do: **

We're going to remove $70$70 from the data set because it is much higher than the other scores in the set.

Now let's calculate the mean. Remember that there are only nine scores in the set now:

$\text{Mean }$Mean | $=$= | $\frac{36+36+45+52+48+36+43+44+51}{9}$36+36+45+52+48+36+43+44+519 |

$=$= | $43.444$43.444... | |

$=$= | $43.4$43.4 |

**F) **Is this higher or lower than the mean that was calculated with the outlier?

**Think:** Let's compare the two means.

**Do:** The mean is lower than the mean that was calculated with the outlier.

**G)** With the outlier removed from the data set, recalculate the median.

**Think:** The mean will now be the fifth score in the set.

**Do:** $44$44

**H)** Is this higher or lower than the median that was calculated with the outlier?

The median is lower than the median that included the outlier.

**I)** With the outlier removed from the data set, recalculate the mode.

**Think:** Which score occurs most frequently now?

**Do:** $36$36

**J)** Is this higher or lower than the mode that was calculated with the outlier?

This is the same as the mode that was calculated with the outlier.

**K)** With the outlier removed from the data set, recalculate the range.

**Think:** Let's calculate the difference between the new highest and lowest scores.

**Do:** $52-36=16$52−36=16

**L)** Is this greater or smaller than the range that was calculated with the outlier?

This range is smaller than the range that was calculated with the outlier.

Consider the following set of data:

$53,46,25,50,30,30,40,30,47,109$53,46,25,50,30,30,40,30,47,109

Fill in this table of summary statistics.

Mean $\editable{}$ Median $\editable{}$ Mode $\editable{}$ Range $\editable{}$ Which data value is an outlier?

Fill in this table of summary statistics after removing the outlier $109$109.

Mean $\editable{}$ Median $\editable{}$ Mode $\editable{}$ Range $\editable{}$ Let $A$

`A`be the original data set and $B$`B`be the data set without the outlier.Fill in this table using the symbols $>$>, $<$< and $=$= to compare the statistics before and after removing the outlier.

With outlier Without outlier Mean: $A\editable{}B$ `A``B`Median: $A\editable{}B$ `A``B`Mode: $A\editable{}B$ `A``B`Range: $A\editable{}B$ `A``B`

Plan and conduct surveys and experiments using the statistical enquiry cycle:– determining appropriate variables and measures;– considering sources of variation;– gathering and cleaning data;– using multiple displays, and re-categorising data to find patterns, variations, relationships, and trends in multivariate data sets;– comparing sample distributions visually, using measures of centre, spread, and proportion;– presenting a report of findings