topic badge

6.03 Measures of spread

Introduction

We compared the center of different sets of data in lesson  7.02 Measures of center  . We will now compare the spread of data sets, going beyond the concept of range from 6th grade. In this lesson, we will examine measures of spread for data sets without outliers. Then we will examine how measures of spread are affected by the includsion of outliers.

Measures of spread

We've seen the range of a set of data, the difference between the maximum and minimum values of the data, and the interquartile range (IQR), which is the difference between the third quartile and first quartile of a data set.

Recall that the mean absolute deviation (MAD) of a data set is the average distance between each data point and the mean. This value gives us an idea of the spread of a set of data. A new value we may consider when evaluating the spread of a data set is the standard deviation.

Standard deviation

A measure of spread, which helps give us a meaningful estimate of the variability in a data set

Exploration

Consider the dot plots shown:

A dot plot, ranging from 0 to 10 in steps of 1. The number of dots is as follows: at 1, 7; at 2, 6; at 3, 5; at 4, 4; at 5, 3; at 6, 4; at 7, 5; at 8, 6; at 9, 7. The following information are shown: mean = 5, median = 5, standard deviation = 2.84, and MAD = 2.55.
A dot plot, ranging from 0 to 10 in steps of 1. The number of dots is as follows: at 1, 3; at 2, 3; at 3, 3; at 4, 3; at 5, 3; at 6, 3; at 7, 3; at 8, 3; at 9, 3. The following information are shown: mean = 5, median = 5, standard deviation = 2.58, and MAD = 2.22.
A dot plot, ranging from 0 to 10 in steps of 1. The number of dots is as follows: at 1, 1; at 2, 1; at 3, 2; at 4, 3; at 5, 4; at 6, 3; at 7, 2; at 8, 1; at 9, 1. The following information are shown: mean = 5, median = 5, standard deviation = 2, and MAD = 1.56.
A dot plot, ranging from 0 to 10 in steps of 1. The number of dots is as follows: at 3, 1; at 4, 2; at 5, 8; at 6, 2; at 7, 1. The following information are shown: mean = 5, median = 5, standard deviation = 0.93, and MAD = 0.57.
A dot plot, ranging from 0 to 10 in steps of 1. The number of dots is as follows: at 5, 7. The following information are shown: mean = 5, median = 5, standard deviation = 0, and MAD = 0.
  1. What do you notice about the dot plots and the measures of spread?

  2. Make a conjecture about the MAD and standard deviation of a set of data.

For sets of data that are more spread out from the mean and median, both the MAD and standard deviation are higher numbers. As data takes on a more symmetrical and centered shape, the MAD and standard deviation become lower in value. In general, MAD and standard deviation give us information about how close the data is to the center of the data set, but the standard deviation is usually a higher number than MAD.

When comparing data sets, the standard deviation alone will tell us how variable or consistent the values in the data set are.

  • A larger value indicates a wider spread (more variable) data set.
  • A smaller value indicates a more tightly packed (less variable) data set.

Just like the IQR can be used to describe the middle half of a data set, the mean and standard deviation together can be used to describe the majority of a data set. We say that the majority of the data lies between \text{Mean}\pm \text{standard deviation}.

Examples

Example 1

Shown below are histograms comparing the test scores from two different groups of students. A table of values shows the mean and standard deviation of the scores for each group. Determine what the mean and standard deviation of the groups tells us.

Two histograms and a table are shown. Left histogram titled Group A with with numbers 0 through 60 in steps of 10 on the y-axis and with bars labeled at their endpoint 30 to 70 in steps of 1 on the x-axis. The 30 through 40 bar goes to 14 on the y-axis, 40 through 50 goes to 17, 50 through 60 goes to 24, and 60 through 70 goes to 16. Right histogram titled Group B with with numbers 0 through 60 in steps of 10 on the y-axis and with bars labeled at their endpoint 30 to 70 in steps of 1 on the x-axis. The 30 through 40 bar goes to 4 on the y-axis, 40 through 50 goes to 55, 50 through 60 goes to 13, and 60 through 70 goes to 2. The table with 2 columns titled Mean and Standard Deviation, and 2 rows titled Group A and Group B. The data is as follows: Group A: Mean, 50.86, Standard Deviation, 10.49; Group B: Mean, 46.86, Standard Deviation, 5.43.
Worked Solution
Apply the idea

The majority of students in Class A scored between 40.37 and 61.35.

The majority of students in Class B scored between 41.43 and 52.29.

The mean indicates that Group A scored higher on average on the test. The standard deviation tells us that Group B had less variability and more consistent scores.

Reflect and check

The graphs show how a smaller deviation on a similar-sized set of data with a similar range usually results in a higher peak.

Idea summary

The spread of a data set can be described by using the range, IQR, and standard deviation.

  • The range describes the spread of the data.

  • The IQR describes the spread of the middle half of the data.

  • The standard deviation describes the spread of the majority of the data.

Outliers

Exploration

Begin by dragging Point P closer to the other points in the data set. Then, move Point P further away from the data set.

Loading interactive...
  1. What happens to the standard deviation as you move Point P to the position of an outlier?

  2. What happens to the range and IQR as you move Point P to the position of an outlier?

  3. What can you conclude about an outlier's impact on measures of spread?

We saw that outliers can cause the data to skew, and thus influence the mean that is sometimes used to describe the center of a set of data. Like measures of center, outliers have an impact on certain measures of spread.

The range of a data set will obviously be impacted by the inclusion of an outlier, since an oulier will be a maximum or minimum value. Outliers will also affect the MAD and the standard deviation. This makes sense because the mean is used as part of the calculation for the both measures of spread, and the mean is affected by outliers.

When analyzing sets of data, if the data has outliers, it is best to use the IQR to describe the data since the IQR is more resistant to outliers.

Examples

Example 2

The number of fatal accidents from 2000 to 2014 for different airlines is displayed in the box plot:\text{Number of fatal accidents}= \\\{0, 0,0,0,0,0,1,1,1,1,2,2,2,2,2,2,4,4,4,5,5,5,5,5,6,7,10,11,11,12,15,24 \}

Number of fatal accidents
-2
0
2
4
6
8
10
12
14
16
18
20
22
24
26
a

Identify and interpret the range of the data set.

Worked Solution
Create a strategy

The range of the data set is the distance between the minimum value and the maximum value.

Apply the idea

The maximum of the data set is at 24 and the minimum is at 0. So, 24-0=24 is the range.

From 2000-2014 the number of fatal accidents with different airlines varied by 24 accidents.

Reflect and check

All statements about the range that describe variance should be sentences phrased in terms of the variable of interest and include the correct units of measurement.

b

Identify and interpret the IQR of the data set.

Worked Solution
Create a strategy

The IQR of the data set is the distance between the upper and lower quartiles.

Apply the idea

The upper quartile is at 5.5, and the lower quartile is at 1 so:

\displaystyle IQR\displaystyle =\displaystyle Q_3-Q_1Formula for IQR
\displaystyle =\displaystyle 5.5-1Substitute Q_3=5.5 and Q_1=1
\displaystyle =\displaystyle 4.5Evaluate the subtraction

From 2000-2014 the number of fatal accidents for the middle half of all airlines varied by 4.5 accidents.

Reflect and check

Since each quartile of a box plot represents \approx 25\% of the data set, the IQR represents the middle 50\% of the data set.

c

Explain what will happen to the range and IQR if the outlier at 24 is removed.

Worked Solution
Create a strategy

From parts (a) and (b) we know that the range and IQR are 24 and 4.5, respectively. We need to recalculate, or estimate, the new range and IQR without the point at 24.

Apply the idea

Without the point at 24, the maximum of the data set will change to 15 and the minimum is still 0. So, 15-0=15 is the new range. This is a reduction in the range by 9 accidents.

With the point at 24 removed, the lower quartile remains at 1 and the upper quartile lowers slightly to 5. So, 5-1=4 is the new interquartile range. This is a reduction by 0.5 accidents.

Reflect and check

Similar to mean and median, the range is greatly affected when extreme data points are added or removed, but the interquartile range should change very little.

Example 3

Compare the spread of the data sets showing the fuel efficiency for cars versus trucks:

Car fuel efficiency (Miles per gallon)
10
15
20
25
30
35
40
45
50
55
60
A dot plot titled Truck fuel efficiency in miles per gallon, ranging from 12 to 27. The number of dots is as follows: at 12, 1; at 13, 2; at 14, 1; at 15, 3; at 16, 2; at 17, 1; at 19, 1; at 27, 1.
Car fuel efficiencyTruck fuel efficiency
Range3515
Outlier(s)5027
Standard deviation9.143.79
Worked Solution
Create a strategy

Since the data sets have outliers, it is best to describe the spread of the data using the IQR.

Apply the idea
\displaystyle IQR\displaystyle =\displaystyle Q_3-Q_1Formula for IQR
\displaystyle =\displaystyle 28.25-20Substitute Q_3=28.25 and Q_1=10
\displaystyle =\displaystyle 8.25Evaluate the subtraction

The IQR for the car fuel efficiency is approximately 8.25 miles per gallon, meaning the middle 50\% of cars vary by 8.25 miles per gallon.

We need to calculate the 5-number summary of the fuel efficiency for trucks and then calculate the IQR. We can do this by using technology or splitting the data into four quarters and finding the upper and lower quartiles.

A sequence of numbers arranged in a single row: 12, 13, 13, 14, 15, 15, 15, 16, 16, 17, 19, 27. The numbers are grouped into 4: first group: 12, 13, 13; second group: 14, 15, 15; third group: 15, 16, 16; fourth group: 17, 19, 27. The space between the first and second group is labeled lower quartile, space between the second and third group is labeled median, space between the third and fourth group is labeled upper quartile.
\displaystyle IQR\displaystyle =\displaystyle Q_3-Q_1Formula for IQR
\displaystyle =\displaystyle 16.5-13.5Substitute Q-3=16.5 and Q_1=13.5
\displaystyle =\displaystyle 3Evaluate the subtraction

The IQR for truck fuel efficiency is approximately 3 miles per gallon, meaning that the middle 50\% of trucks vary by 3 miles per gallon.

The fuel efficiency of the middle 50\% of cars varies by 8.25 miles per gallon. The fuel efficiency of the middle 50\% of trucks varies by 3 miles per gallon. While cars get better gas mileage overall, there is nearly 5 miles per gallon greater variability,in their their fuel efficiency, compared to trucks.

By comparing the range of each graph, we could see that the spread for fuel efficiency for cars is larger. But, once we examine the IQR, we can see that the difference in spread is not as drastic as it initially appears.

Idea summary

The range and standard deviation are both strongly influenced by outliers. When any outlier is removed from the data set, both the range and standard deviation will decrease. The IQR is resistant to outliers because it describes the middle half of the data, not the extremes.

Outcomes

S.ID.A.1

Represent data with plots on the real number line (dot plots, histograms, and box plots).

S.ID.A.2

Use statistics appropriate to the shape of the data distribution to compare center (median, mean) and spread (interquartile range, standard deviation) of two or more different data sets.

S.ID.A.3

Interpret differences in shape, center, and spread in the context of the data sets, accounting for possible effects of extreme data points (outliers).

What is Mathspace

About Mathspace