topic badge

11.03 The spread of data sets

Lesson

Introduction

Measures of spread in a quantitative (numerical) data set seek to describe whether the scores in a data set are very similar and clustered together, or whether there is a lot of variation in the scores and they are very spread out.

In this section, we will look at the range and interquartile range as measures of spread. We will also explore how to compare the data set using parallel box plots.

Spread and parallel box plots

The range is the simplest measure of spread in a quantitative (numerical) data set. It is the difference between the maximum and minimum scores in a data set.

We subtract the lowest score in the set from the highest score in the set. That is: \\ \text{Range}=\text{Highest score}-\text{Lowest score}

For example, at one school the ages of students in Year 7 vary between 11 and 14. So the range for this set is 14-11=3.

As a different example, if we looked at the ages of people waiting at a bus stop, the youngest person might be a 7 year old and the oldest person might be a 90 year old. The range of this set of data is 90-7=83, which is a much larger range of ages.

Whilst the range is very simple to calculate, it is based on the sparse information provided by the upper and lower limits of the data set. To get a better picture of the internal spread in a data set, it is often more useful to find the set's quartiles, from which the interquartile range (IQR) can be calculated.

To calculate the interquartile range: \text{Interquartile range}=Q_{3}-Q_{1}

Quartiles are scores at particular locations in the data set - similar to the median, but instead of dividing a data set into halves, they divide a data set into quarters.

Parallel box plots are used to compare two sets of data visually. When comparing box plots, we compare the five-number summary values of each data set (the quartiles and minimum and maximum values) and also the range and interquartile range. The two box plots are drawn parallel to each other.

It is important to clearly label each box plot. Below are parallel box plots comparing the time it took two different groups of people to complete an online task.

Two box plots comparing the time with a number line below labeled as seconds. Ask your teacher for more information.

We can see that overall the under 30s were faster at completing the task. Both the under 30s box plot and the over 30s box plot are slightly negatively skewed. Over 75\% of the under 30s completed the task in under 22s, which is the median time taken by the over 30s. 100\% of the under 30s had finished the task before 75\% of the over 30s had completed it.

Overall the under 30s performed better and had a smaller spread of scores. There was a larger variance within the over 30 group, with a range of 24 seconds compared to 20 seconds for the under 30s.

Examples

Example 1

The two box plots below show the data collected by the manufacturers on the life-span of light bulbs, measured in thousands of hours.

Two box plots showing the data between Manufacturers A and B. Ask your teacher for more information.
a

Complete the following table using the two box plots. Write each answer in terms of hours (remember to multiply the values on the data display by 1000).

Manufacturer AManufacturer B
Median
Lower quartile
Upper quartile
Range
Interquartile range
Worked Solution
Create a strategy
  • To find the lower quartile, median, and upper quartile, find the corresponding values of the vertical lines of the box, respectively.

  • For the range, use the formula: \text{Range}=\text{Highest score}-\text{Lowest score}

  • For the interquartile range, use the formula: \text{IQR}=Q_{3}-Q_{1}

Apply the idea

Since the scale is in thousands of hours, we can multiply the numbers on the scale by a thousand to find the number of hours.

Manufacturer AManufacturer B
Median4\times 1000 = 40005\times 1000 =5000
Lower quartile2.5\times 1000 =25003.5\times 1000 =3500
Upper quartile4.5\times 1000 =45006\times 1000 =6000
Range(5-1)\times 1000 =4000(8-1.5)\times 1000 =6500
Interquartile range(4.5-2.5)\times 1000 =2000(6-3.5)\times 1000 =2500
b

Which manufacturer produces light bulbs with the best lifespan?

Worked Solution
Create a strategy

Choose the manufacturer with the greater median.

Apply the idea

The data set for Manufacturer B has a median of 5000 hours, while the median of the data set for Manufacturer A is 4000 hours. So Manufacurer A produces light bulb with the best lifespan.

Reflect and check

In fact, the best lightbulb produced by Manufacturer A has a lifespan of 5000 hours, which is the same as the median of Manufacturer B. This means that about half of the lightbulbs produced by Manufacturer B have a greater lifespan than all of the lightbulbs produced by Manufacturer A.

Example 2

The box plots show the number of goals scored by two football players in each season.

Two box plots showing the data of goals between Sophie and Holly. Ask your teacher for more information.
a

Who scored the most goals in a season?

Worked Solution
Create a strategy

Compare the maximum value of both box plots.

Apply the idea

By looking at the endpoints of the right whiskers, Sophie scored 18 goals, while Holly scored 19 goals. So Holly score the most goals in a season.

b

How many more goals did Holly score in her best season compared to Sophie in her best season?

Worked Solution
Create a strategy

Subtract Sophie's maximum from Holly's maximum.

Apply the idea
\displaystyle \text{Number of goals}\displaystyle =\displaystyle 19-18Subtract 18 from 19
\displaystyle =\displaystyle 1Evaluate the subtraction

Holly scored 1 more goal in her best season.

c

What is the difference between the median number of goals scored in a season by each player?

Worked Solution
Create a strategy

Find the difference of the medians.

Apply the idea

Sophie's median is 11 and Holly's median is 10.

\displaystyle \text{Difference}\displaystyle =\displaystyle 11-10Subtract 10 from 11
\displaystyle =\displaystyle 1Evaluate the subtraction
d

What is the difference between the interquartile range for both players?

Worked Solution
Create a strategy

Use the formula: \text{IQR}=Q_{3}-Q_{1}

Apply the idea
\displaystyle \text{Sophie's IQR}\displaystyle =\displaystyle 14-7Substitute the quartiles
\displaystyle =\displaystyle 7Evaluate
\displaystyle \text{Holly's IQR}\displaystyle =\displaystyle 15-6Substitute the quartiles
\displaystyle =\displaystyle 9Evaluate
\displaystyle \text{Difference}\displaystyle =\displaystyle 9-7Subtract the IQRs
\displaystyle =\displaystyle 2Evaluate
Idea summary

When comparing box plots, we compare the five-number summary values of each data set (the quartiles and minimum and maximum values) and the range and interquartile range.

Different graphs of the same data

So we have already seen how data can be displayed in histograms and in box plots These two displays are great for being able to identify key features of the shape of the data, as well as the range and in the case of the box plot the interquartile range and the median.

We should expect then that the shape of data would be the same whether it is represented in a polygon, box plot or histogram. Remember that the shape of data can be symmetric, left skewed or right skewed. This is shown below:

All representations of symmetric data:

The image shows a symmetric curve.

Common shape of a symmetric distribution

The image shows a histogram with approximately symmetric data. Ask your teacher for more information.

Histogram of approximately symmetric data

The image shows box plot for symmetric data. The box plot is symmetric about the median.

Box plot of symmetric data

All representations of positively skewed data:

The image shows a curve that has a longer tail to the right.

General shape of positively skewed data

The image shows a histogram with high columns on the left. Ask your teacher for more information.

Histogram of positively skewed data

The image shows a box plot with a long right whisker and short left whisker.

Box plot of positively skewed data

All representations of negatively skewed data:

The image shows a curve that has a longer tail to the left.

General shape negatively skewed data

The image shows a histogram with high frequency columns on the right. Ask your teacher for more information.

Histogram of negatively skewed data

The image shows a box plot with a long left whisker and short right whisker.

Box plot of negatively skewed data

Looking at the diagrams above, can you see the similarities in the representations?

We can see the skewed tails, where the bulk of the data sits and general shape. These are some of the features you can use to match histograms and box plots. We can also look at the data range.

Examples

Example 3

Match the box plot shown to the correct histogram.

0
1
2
3
4
5
6
7
8
9
10
A
The image shows a histogram with high columns on the left. Ask your teacher for more information.
B
The image shows a histogram with symmetric data. Ask your teacher for more information.
C
The image shows a histogram with high columns on the right. Ask your teacher for more information.
D
The image shows a histogram with high columns on the right. Ask your teacher for more information.
Worked Solution
Create a strategy

Look for characteristics of skewed and symmetric distributions of data.

Apply the idea

The given box plot shows symmetric data. Only option B shows symmetric data. The correct answer is option B.

Idea summary

We should expect then that the shape of data would be the same whether it is represented in a polygon, box plot or histogram. Remember that the shape of data can be symmetric, left skewed or right skewed.

Outcomes

VCMSP351

Compare shapes of box plots to corresponding histograms and dot plots and discuss the distribution of data

What is Mathspace

About Mathspace