Shape and spread

Similar to other data displays, we can describe the shape of a boxplot by saying it is symmetrical or skewed.

Recall that each quartile represents 25\% of the data, regardless of its shape. This means a longer quartile has the same number of data values as a shorter quartile. However, the data is more spead out in the longer quartiles.

Symmetrical
0
5
10
15
20
25
30
35

A symmetrical data set is distributed around the center with a similar amount of data on the left and right side of the median.

Notice that the same number of data points are in each quartile. In this example, the data in the box is a little less spread out than in the whiskers, but is still symmetrical.

Uniform
0
5
10
15
20
25
30
35

A uniform data set is evenly distributed across all values. In other words, all four quartiles appear to be the same size.

In this boxplot, notice that the whiskers and each part of the box are 7 units in length.

Negative (left) skew
0
5
10
15
20
25
30
35

A data set has a negative skew when the majority of the data points have higher values, with some data points at lower values. It is sometimes called a left skew because it looks stretched to the left.

Remember, each quartile still contains the same number of data points.

In this boxplot, notice that the top 50\% of the data is between the values 25 and 31. However, the bottom half of the data is much more spread out.

Positive (right) skew
0
5
10
15
20
25
30
35

A data set has a positive skew when the majority of the data points have lower values, with some data points at higher values. It is sometimes called a right skew because it looks stretched to the right.

In this boxplot, notice that the bottom 50\% of the data is between the values 5 and 12. However, the top half of the data is much more spread out.

It is easy to assume that the longer section of the boxplot in a skewed data set contains more data. But always remember that each quartile contains 25 \% of the data no matter its size. A stretched quartile simply has data points that are more spread out, and a narrower quartile has data points that are very close together.

Examples

Example 1

The stem-and-leaf plot displays the scores of students in a class on an exam.

Leaf
67\ 7\ 9
70\ 0\ 2\ 3\ 4\ 5\ 5
80\ 1\ 3\ 3\ 5

Key: 6 | 1 = 61

a

Construct the five-number summary.

Worked Solution
Create a strategy
  • To find the lower extreme, locate the smallest score.

  • To find the upper extreme, locate the highest score.

  • Calculate the median by finding the middle score if the number of scores is odd, or by averaging the two middle scores if even.

  • Determine the lower quartile by identifying the middle score of the values below the median.

  • Identify the upper quartile as the middle score of the values above the median.

Apply the idea

By listing the data, we can easily identify the quartiles:

A data set: 67, 67, 69, 70, 70, 72, 73, 74, 75, 75, 80, 81, 83, 83, 87. The lower extreme, lower quartile, median, upper quartile, and upper extreme are labeled.

Since the lower quartile is between the same two values, the lower quartile is 70. For the upper quartile, we need to average the two values on either side

Here's the five-number summary:

Minimum67
Lower quartile70
Median74
Upper quartile81
Maximum87
b

Construct a boxplot for the data.

Worked Solution
Create a strategy

Use the five-number summary in part (a) to construct the boxplot.

Apply the idea
Exam scores
60
65
70
75
80
85
90
c

Describe the shape of the boxplot.

Worked Solution
Create a strategy

Observe the distribution of data.

  • Symmetrical boxplots are symmetrical about the median.

  • Negatively skewed boxplots have the majority of data points with higher values.

  • Positively skewed boxplots have the majority of the data points with lower values.

  • Uniform boxplots have all quartiles the same width.

Apply the idea
Exam scores
60
65
70
75
80
85
90

In this boxplot, most of the data have smaller values because the box and whisker above the median are more spread out.

Because the data above the median is more spread out, the boxplot is positively skewed.

Reflect and check

The direction of the skew will always be the side that is more spread out. In this example, we said the data is positively skewed (or skewed right) because the right side is more spread out.

The reason for this is because the data points on the right side will make the distribution "crooked". If those data points were closer to the rest of the data, it would be more symmetric.

Idea summary

We can describe the shape based on the distribution of the data set.

  • Symmetrical boxplots are symmetrical about the median.

  • Uniform boxplots have all quartiles the same width.

  • Positively skewed boxplots have the majority of data points with higher values.

  • Negatively skewed boxplots have the majority of the data points with lower values.

Outliers

An outlier is a data point that varies significantly from the rest of the data. An outlier will be a value that is either significantly larger or smaller than other observations.

Outliers are important to identify as they point to unusual bits of data that may require further investigation. For example, if we had data on the temperature in a volcanic lake and we had a high outlier, it is worth investigating if the sensor is faulty or if we need to prepare a nearby town for evacuation.

There are formal ways to determine if a data point is an outlier, but for now we are only going to look at data points obviously larger and smaller and how they affect the measures of center or spread.

Exploration

Drag point P to various positions to explore how the point may skew the data.

Loading interactive...
  1. Move the point to the position of a really low outlier, then move the point closer to the data. Complete the sentences in the table that follows:

    Removing a really low outlier
    \text{The range will } ⬚ \text{.}
    \text{The median might } ⬚ \text{.}
    \text{The mean will } ⬚ \text{.}
    \text{The mode will } ⬚ \text{.}
  2. Move the point to the position of a really high outlier, then move the point closer to the data. Complete the sentences in the table that follows:

    Removing a really high outlier
    \text{The range will } ⬚ \text{.}
    \text{The median might } ⬚ \text{.}
    \text{The mean will } ⬚ \text{.}
    \text{The mode will } ⬚ \text{.}

Outliers can skew or change the shape of our data. This can be a problem (especially for small data sets) because the mean, median and range might not properly represent the situation. We can counteract this by removing outliers.

Removing outliers will have the following effects:

Removing a really low outlierRemoving a really high outlier
The range will decrease.The range will decrease.
The median might increase.The median might decrease.
The mean will increase.The mean will decrease.
The mode will not change.The mode will not change.

Keep in mind:

  • When describing skewed distributions, it's better to use the median and interquartile range because they are less impacted by outliers.
  • When describing symmetrical or uniform distributions, it's better to use mean and range because they take the values of all data points into account.

Examples

Example 2

The number of fatal accidents from 2000 to 2014 for different airlines are listed in the set and displayed in the boxplot:\{0,\,0,\,0,\,0,\,0,\,0,\,1,\,1,\,1,\,1,\,2,\,2,\,2,\,2,\,2,\,2,\,4,\,4,\,4,\,5,\,5,\,5,\,5,\,5,\,6,\,7,\,10,\,11,\,11,\,12,\,15,\,24 \}

Number of fatal accidents
-2
0
2
4
6
8
10
12
14
16
18
20
22
24
26
a

Identify and interpret the range of the data set.

Worked Solution
Create a strategy

The range of the data set is the distance between the minimum value and the maximum value.

Apply the idea

The maximum of the data set is at 24 and the minimum is at 0. So, 24-0=24 is the range.

From 2000-2014 the number of fatal accidents with different airlines varied by 24 accidents.

Reflect and check

All statements about the range that describe variance (how the data varies) should be sentences phrased in terms of the variable of interest and include the correct units of measurement.

b

Identify and interpret the IQR of the data set.

Worked Solution
Create a strategy

The IQR of the data set is the distance between the upper and lower quartiles.

Apply the idea

The upper quartile is at 5.5, and the lower quartile is at 1 so:

\displaystyle IQR\displaystyle =\displaystyle Q_3-Q_1Formula for IQR
\displaystyle =\displaystyle 5.5-1Substitute Q_3=5.5 and Q_1=1
\displaystyle =\displaystyle 4.5Evaluate the subtraction

From 2000-2014 the number of fatal accidents for the middle half of all airlines varied by 4.5 accidents.

Reflect and check

Since each quartile of a boxplot represents about 25\% of the data set, the IQR represents the middle 50\% of the data set.

c

Explain what will happen to the range and IQR if the outlier at 24 is removed.

Worked Solution
Create a strategy

From parts (a) and (b) we know that the range and IQR are 24 and 4.5, respectively. We need to recalculate, or estimate, the new range and IQR without the point at 24.

Apply the idea

Without the point at 24, the maximum of the data set will change to 15 and the minimum is still 0. So, 15-0=15 is the new range. This is a reduction in the range by 9 accidents.

With the point at 24 removed, the lower quartile remains at 1 and the upper quartile lowers slightly to 5. So, 5-1=4 is the new interquartile range. This is a reduction by 0.5 accidents.

Reflect and check

Similar to mean, the range is greatly affected when extreme data points are added or removed. The median may change a little, but the interquartile range is least affected.

Example 3

Yartezi works at a coffee shop and tracks the number of customers that come in each day. The data she collected is shown:90,\, 85,\, 88,\, 86,\, 95,\, 101,\, 98,\, 84,\, 35,\, 82,\, 87,\, 90,\, 92,\, 97

a

Formulate a question that could be answered using a boxplot.

Worked Solution
Create a strategy

Boxplots help us see how clustered or spread out the data is, so the question could focus on the spread of the data. Boxplots also help us see how the data breaks down into quarters (or quartiles).

Apply the idea

One possible question about this data that could be answered using a boxplot is, "What is the typical number of customers that come into the coffee shop each day?"

Reflect and check

Other questions might be, "How does the number of customers who come into the coffee shop in a day vary?" or "How often can I expect an unusually high number of customers to visit the coffee shop?"

b

Describe the data collection method that Yartezi used.

Worked Solution
Create a strategy

Consider whether Yartezi collected the data herself or whether she acquired data that was already collected by someone else.

If Yartezi collected the data herself, decide whether she used an observation, measurement, a survey, or an experiment to collect the data.

Apply the idea

Since Yartezi tracks the number of customers herself, she collected the data rather than acquired it from elsewhere. Yartezi used an observation to collect the data by counting the number of customers as they came into the shop.

Reflect and check

Yartezi did not measuring anything, and she did not need to ask the customers anything. She did not use an experiment because she was not interested in the factors that caused customers to come into the shop.

c

Construct a boxplot using the data points Yaretzi collected.

Worked Solution
Create a strategy

To create a boxplot, we need to identify the five critical values:

  • Lower extreme

  • Lower quartile (Q_1)

  • Median

  • Upper quartile (Q_3)

  • Upper extreme

Apply the idea

First, we need to put the data values in order:35,\, 82,\, 84,\, 85,\, 86,\, 87,\, 88,\, 90,\, 90,\, 92,\, 95,\, 97,\, 98,\, 101Now, we can find the lower extreme, upper extreme, and the median (the middle value). After that, we can find the middle value of the lower extreme and median (this will be the lower quartile) and the middle value of the median and the upper extreme (this will be the upper quartile).

A data set: 35, 82, 84, 85, 86, 87, 88, 90, 90, 92, 95, 97, 98, 101. The lower extreme, lower quartile, median, upper quartile, and upper extreme are labeled.
Number of customers
30
40
50
60
70
80
90
100
110
d

Answer the formulated question from part (a) using the boxplot and explain whether the answer is reasonable.

Worked Solution
Create a strategy

The question from part (a) is, "What is the typical number of customers that come into the coffee shop each day?" To answer this question, we could consider a measure of center, such as the median.

Apply the idea

The median is 89, so we could say that 89 customers typically come into the coffee shop each day.

This answer does seem reasonable because it is near the middle of the data set, so it is a reasonable estimate of all the data.

Reflect and check

The mean of this data set is \dfrac{35+82+84+85+86+87+88+90+90+92+95+97+98+101}{14}\approx 86.4This value would be an unreasonable estimate of the center of the data because 86 is toward the bottom 25\% of the data.

e

Construct a second boxplot after the outlier has been removed, but mark the outlier as a point. Compare the shape of the new boxplot with the first boxplot.

Worked Solution
Create a strategy

First, we need to identify the outlier. Since the left whisker is very long, the outlier will be the minimum value. Then, we will need to recalculate the five-number summary after removing the outlier.

Apply the idea

The outlier of the data set is 35 because it is much lower than the other data values.

We can now find the new five-number summary without the outlier.

A data set: 82, 84, 85, 86, 87, 88, 90, 90, 92, 95, 97, 98, 101. The lower extreme, lower quartile, median, upper quartile, and upper extreme are labeled.

The lower quartile is Q_1=\dfrac{85+86}{2}=85.5, and the lower quartile is Q_3=\dfrac{95+97}{2}=96.

Number of customers
30
40
50
60
70
80
90
100
110

With the outlier, the original boxplot was very negatively skewed with a very large spread in the first quartile. The boxplot without the outlier has a slight positive skew and a much smaller spread.

Reflect and check

Although the outlier may skew the data, we should still represent it on the graph to represent the full picture of the data. Removing the outlier when calculating measures of center or spread give us a better understanding of the majority of the data, but we should not disregard the outlier completely.

f

Answer the formulated question from part (a) using the boxplot that represents the data set after the outlier was removed.

Worked Solution
Create a strategy

Previously, we looked at the median to determine the typical number of customers that come into the coffee shop each day. We now want to reconsider that answer, as it is likely that the median has changed since the outlier was removed.

Apply the idea

Before the outlier was removed, the median was 89 customers. After removing the outlier, the median is 90 customers.

Now, we can estimate that 90 customers typically come into the coffee shop each day.

Reflect and check

Without the outlier, this new median is a better measure of the center of the data. However, we should not completely disregard the outlier. The outlier could potentially prompt us to ask a new question about the data, such as "What factors cause a drop in the number of customers?"

A number of factors could have caused the outlier in this context, such as a heavy snowstorm, a power outage, or a city event happening in a different part of the town. To determine the factors that cause outliers, we would need to run an experiment.

Idea summary

An outlier is a data point that varies significantly from the body of the data. An outlier will be a value that is either significantly larger or smaller than other observations.

Removing outliers will have the following effects on the summary statistics:

A really low outlierA really high outlier
The range will decreaseThe range will decrease
The median might increaseThe median might decrease
The mean will increaseThe mean will decrease
The mode will not changeThe mode will not change

The IQR is resistant to outliers because it describes the middle half of the data, not the extremes.

Outcomes

8.PS.2

The student will apply the data cycle (formulate questions; collect or acquire data; organize and represent data; and analyze data and communicate results) with a focus on boxplots.

8.PS.2a

Formulate questions that require the collection or acquisition of data with a focus on boxplots.

8.PS.2b

Determine the data needed to answer a formulated question and collect the data (or acquire existing data) using various methods (e.g., observations, measurement, surveys, experiments).

8.PS.2d

Organize and represent a numeric data set of no more than 20 items, using boxplots, with and without the use of technology.

8.PS.2e

Identify and describe the lower extreme (minimum), upper extreme (maximum), median, upper quartile, lower quartile, range, and interquartile range given a data set, represented by a boxplot.

8.PS.2f

Describe how the presence of an extreme data point (outlier) affects the shape and spread of the data distribution of a boxplot.

8.PS.2g

Analyze data represented in a boxplot by making observations and drawing conclusions

What is Mathspace

About Mathspace