Similar to other data displays, we can describe the shape of a boxplot by saying it is symmetrical or skewed.

Recall that each quartile represents 25\% of the data, regardless of its shape. This means a longer quartile has the same number of data values as a shorter quartile. However, the data is more spead out in the longer quartiles.

It is easy to assume that the longer section of the boxplot in a skewed data set contains more data. But always remember that each quartile contains 25 \% of the data no matter its size. A stretched quartile simply has data points that are more spread out, and a narrower quartile has data points that are very close together.

The stem-and-leaf plot displays the scores of students in a class on an exam.

Leaf | |
---|---|

6 | 7\ 7\ 9 |

7 | 0\ 0\ 2\ 3\ 4\ 5\ 5 |

8 | 0\ 1\ 3\ 3\ 5 |

Key: 6 | 1 = 61

a

Construct the five-number summary.

Worked Solution

b

Construct a boxplot for the data.

Worked Solution

c

Describe the shape of the boxplot.

Worked Solution

Idea summary

We can describe the shape based on the distribution of the data set.

Symmetrical boxplots are symmetrical about the median.

Uniform boxplots have all quartiles the same width.

Positively skewed boxplots have the majority of data points with higher values.

Negatively skewed boxplots have the majority of the data points with lower values.

An **outlier** is a data point that varies significantly from the rest of the data. An outlier will be a value that is either significantly larger or smaller than other observations.

Outliers are important to identify as they point to unusual bits of data that may require further investigation. For example, if we had data on the temperature in a volcanic lake and we had a high outlier, it is worth investigating if the sensor is faulty or if we need to prepare a nearby town for evacuation.

There are formal ways to determine if a data point is an outlier, but for now we are only going to look at data points obviously larger and smaller and how they affect the measures of center or spread.

Drag point P to various positions to explore how the point may skew the data.

Move the point to the position of a really low outlier, then move the point closer to the data. Complete the sentences in the table that follows:

Removing a really low outlier \text{The range will } â¬š \text{.} \text{The median might } â¬š \text{.} \text{The mean will } â¬š \text{.} \text{The mode will } â¬š \text{.} Move the point to the position of a really high outlier, then move the point closer to the data. Complete the sentences in the table that follows:

Removing a really high outlier \text{The range will } â¬š \text{.} \text{The median might } â¬š \text{.} \text{The mean will } â¬š \text{.} \text{The mode will } â¬š \text{.}

Outliers can skew or change the shape of our data. This can be a problem (especially for small data sets) because the mean, median and range might not properly represent the situation. We can counteract this by removing outliers.

Removing outliers will have the following effects:

Removing a really low outlier | Removing a really high outlier |
---|---|

The range will decrease. | The range will decrease. |

The median might increase. | The median might decrease. |

The mean will increase. | The mean will decrease. |

The mode will not change. | The mode will not change. |

Keep in mind:

- When describing skewed distributions, it's better to use the median and interquartile range because they are less impacted by outliers.
- When describing symmetrical or uniform distributions, it's better to use mean and range because they take the values of all data points into account.

The number of fatal accidents from 2000 to 2014 for different airlines are listed in the set and displayed in the boxplot:\{0,\,0,\,0,\,0,\,0,\,0,\,1,\,1,\,1,\,1,\,2,\,2,\,2,\,2,\,2,\,2,\,4,\,4,\,4,\,5,\,5,\,5,\,5,\,5,\,6,\,7,\,10,\,11,\,11,\,12,\,15,\,24 \}

a

Identify and interpret the range of the data set.

Worked Solution

b

Identify and interpret the IQR of the data set.

Worked Solution

c

Explain what will happen to the range and IQR if the outlier at 24 is removed.

Worked Solution

Yartezi works at a coffee shop and tracks the number of customers that come in each day. The data she collected is shown:90,\, 85,\, 88,\, 86,\, 95,\, 101,\, 98,\, 84,\, 35,\, 82,\, 87,\, 90,\, 92,\, 97

a

Formulate a question that could be answered using a boxplot.

Worked Solution

b

Describe the data collection method that Yartezi used.

Worked Solution

c

Construct a boxplot using the data points Yaretzi collected.

Worked Solution

d

Answer the formulated question from part (a) using the boxplot and explain whether the answer is reasonable.

Worked Solution

e

Construct a second boxplot after the outlier has been removed, but mark the outlier as a point. Compare the shape of the new boxplot with the first boxplot.

Worked Solution

f

Answer the formulated question from part (a) using the boxplot that represents the data set after the outlier was removed.

Worked Solution

Idea summary

An **outlier** is a data point that varies significantly from the body of the data. An outlier will be a value that is either significantly larger or smaller than other observations.

Removing outliers will have the following effects on the summary statistics:

A really low outlier | A really high outlier |
---|---|

The range will decrease | The range will decrease |

The median might increase | The median might decrease |

The mean will increase | The mean will decrease |

The mode will not change | The mode will not change |

The IQR is resistant to outliers because it describes the middle half of the data, not the extremes.