Measures of spread in a quantitative (numerical) data set seek to describe whether the scores in a data set are very similar and clustered together, or whether there is a lot of variation in the scores and they are very spread out.
In this section, we will look at the range and interquartile range as measures of spread. We will also explore how to compare the data set using parallel box plots.
The range is the simplest measure of spread in a quantitative (numerical) data set. It is the difference between the maximum and minimum scores in a data set.
We subtract the lowest score in the set from the highest score in the set. That is: \\ \text{Range}=\text{Highest score}-\text{Lowest score}
For example, at one school the ages of students in Year 7 vary between 11 and 14. So the range for this set is 14-11=3.
As a different example, if we looked at the ages of people waiting at a bus stop, the youngest person might be a 7 year old and the oldest person might be a 90 year old. The range of this set of data is 90-7=83, which is a much larger range of ages.
Whilst the range is very simple to calculate, it is based on the sparse information provided by the upper and lower limits of the data set. To get a better picture of the internal spread in a data set, it is often more useful to find the set's quartiles, from which the interquartile range (IQR) can be calculated.
To calculate the interquartile range: \text{Interquartile range}=Q_{3}-Q_{1}
Quartiles are scores at particular locations in the data set - similar to the median, but instead of dividing a data set into halves, they divide a data set into quarters.
Parallel box plots are used to compare two sets of data visually. When comparing box plots, we compare the five-number summary values of each data set (the quartiles and minimum and maximum values) and also the range and interquartile range. The two box plots are drawn parallel to each other.
It is important to clearly label each box plot. Below are parallel box plots comparing the time it took two different groups of people to complete an online task.
We can see that overall the under 30s were faster at completing the task. Both the under 30s box plot and the over 30s box plot are slightly negatively skewed. Over 75\% of the under 30s completed the task in under 22s, which is the median time taken by the over 30s. 100\% of the under 30s had finished the task before 75\% of the over 30s had completed it.
Overall the under 30s performed better and had a smaller spread of scores. There was a larger variance within the over 30 group, with a range of 24 seconds compared to 20 seconds for the under 30s.
The two box plots below show the data collected by the manufacturers on the life-span of light bulbs, measured in thousands of hours.
Complete the following table using the two box plots. Write each answer in terms of hours (remember to multiply the values on the data display by 1000).
Manufacturer A | Manufacturer B | |
---|---|---|
Median | ||
Lower quartile | ||
Upper quartile | ||
Range | ||
Interquartile range |
Which manufacturer produces light bulbs with the best lifespan?
The box plots show the number of goals scored by two football players in each season.
Who scored the most goals in a season?
How many more goals did Holly score in her best season compared to Sophie in her best season?
What is the difference between the median number of goals scored in a season by each player?
What is the difference between the interquartile range for both players?
When comparing box plots, we compare the five-number summary values of each data set (the quartiles and minimum and maximum values) and the range and interquartile range.
So we have already seen how data can be displayed in histograms and in box plots These two displays are great for being able to identify key features of the shape of the data, as well as the range and in the case of the box plot the interquartile range and the median.
We should expect then that the shape of data would be the same whether it is represented in a polygon, box plot or histogram. Remember that the shape of data can be symmetric, left skewed or right skewed. This is shown below:
All representations of symmetric data:
All representations of positively skewed data:
All representations of negatively skewed data:
Looking at the diagrams above, can you see the similarities in the representations?
We can see the skewed tails, where the bulk of the data sits and general shape. These are some of the features you can use to match histograms and box plots. We can also look at the data range.
Match the box plot shown to the correct histogram.
We should expect then that the shape of data would be the same whether it is represented in a polygon, box plot or histogram. Remember that the shape of data can be symmetric, left skewed or right skewed.