We compared the center of different sets of data in lesson  7.02 Measures of center . We will now compare the spread of data sets, going beyond the concept of range from 6th grade. In this lesson, we will examine measures of spread for data sets without outliers. Then we will examine how measures of spread are affected by the includsion of outliers.
We've seen the range of a set of data, the difference between the maximum and minimum values of the data, and the interquartile range (IQR), which is the difference between the third quartile and first quartile of a data set.
Recall that the mean absolute deviation (MAD) of a data set is the average distance between each data point and the mean. This value gives us an idea of the spread of a set of data. A new value we may consider when evaluating the spread of a data set is the standard deviation.
Consider the dot plots shown:
What do you notice about the dot plots and the measures of spread?
Make a conjecture about the MAD and standard deviation of a set of data.
For sets of data that are more spread out from the mean and median, both the MAD and standard deviation are higher numbers. As data takes on a more symmetrical and centered shape, the MAD and standard deviation become lower in value. In general, MAD and standard deviation give us information about how close the data is to the center of the data set, but the standard deviation is usually a higher number than MAD.
When comparing data sets, the standard deviation alone will tell us how variable or consistent the values in the data set are.
Just like the IQR can be used to describe the middle half of a data set, the mean and standard deviation together can be used to describe the majority of a data set. We say that the majority of the data lies between \text{Mean}\pm \text{standard deviation}.
Shown below are histograms comparing the test scores from two different groups of students. A table of values shows the mean and standard deviation of the scores for each group. Determine what the mean and standard deviation of the groups tells us.
The spread of a data set can be described by using the range, IQR, and standard deviation.
The range describes the spread of the data.
The IQR describes the spread of the middle half of the data.
The standard deviation describes the spread of the majority of the data.
Begin by dragging Point P closer to the other points in the data set. Then, move Point P further away from the data set.
What happens to the standard deviation as you move Point P to the position of an outlier?
What happens to the range and IQR as you move Point P to the position of an outlier?
What can you conclude about an outlier's impact on measures of spread?
We saw that outliers can cause the data to skew, and thus influence the mean that is sometimes used to describe the center of a set of data. Like measures of center, outliers have an impact on certain measures of spread.
The range of a data set will obviously be impacted by the inclusion of an outlier, since an oulier will be a maximum or minimum value. Outliers will also affect the MAD and the standard deviation. This makes sense because the mean is used as part of the calculation for the both measures of spread, and the mean is affected by outliers.
When analyzing sets of data, if the data has outliers, it is best to use the IQR to describe the data since the IQR is more resistant to outliers.
The number of fatal accidents from 2000 to 2014 for different airlines is displayed in the box plot:\text{Number of fatal accidents}= \\\{0, 0,0,0,0,0,1,1,1,1,2,2,2,2,2,2,4,4,4,5,5,5,5,5,6,7,10,11,11,12,15,24 \}
Identify and interpret the range of the data set.
Identify and interpret the IQR of the data set.
Explain what will happen to the range and IQR if the outlier at 24 is removed.
Compare the spread of the data sets showing the fuel efficiency for cars versus trucks:
Car fuel efficiency | Truck fuel efficiency | |
---|---|---|
Range | 35 | 15 |
Outlier(s) | 50 | 27 |
Standard deviation | 9.14 | 3.79 |
The range and standard deviation are both strongly influenced by outliers. When any outlier is removed from the data set, both the range and standard deviation will decrease. The IQR is resistant to outliers because it describes the middle half of the data, not the extremes.