# 11.03 The spread of data sets

Lesson

Measures of spread in a quantitative (numerical) data set seek to describe whether the scores in a data set are very similar and clustered together, or whether there is a lot of variation in the scores and they are very spread out.

In this section, we will look at the range and interquartile range as measures of spread. We will also explore how to compare the data set using parallel box plots.

## Range

The range is the simplest measure of spread in a quantitative (numerical) data set. It is the difference between the maximum and minimum scores in a data set.

To calculate the range

Subtract the lowest score in the set from the highest score in the set. That is,

$\text{Range }=\text{highest Score}-\text{lowest Score}$Range =highest Scorelowest Score

For example, at one school the ages of students in Year $7$7 vary between $11$11 and $14$14. So the range for this set is $14-11=3$1411=3.

As a different example, if we looked at the ages of people waiting at a bus stop, the youngest person might be a $7$7 year old and the oldest person might be a $90$90 year old. The range of this set of data is $90-7=83$907=83, which is a much larger range of ages.

## Interquartile range

Whilst the range is very simple to calculate, it is based on the sparse information provided by the upper and lower limits of the data set. To get a better picture of the internal spread in a data set, it is often more useful to find the set's quartiles, from which the interquartile range (IQR) can be calculated.

Quartiles are scores at particular locations in the data set - similar to the median, but instead of dividing a data set into halves, they divide a data set into quarters. Let's look at how we would divide up some data sets into quarters now.

### Parallel box plots

Parallel box plots are used to compare two sets of data visually. When comparing box plots, we compare the five-number summary values of each data set (the quartiles and minimum and maximum values) and also the range and interquartile range. The two box plots are drawn parallel to each other.

It is important to clearly label each box plot. Below are parallel box plots comparing the time it took two different groups of people to complete an online task.

You can see that overall the under $30$30s were faster at completing the task. Both the under $30$30s box plot and the over $30$30s box plot are slightly negatively skewed. Over $75%$75% of the under $30$30s completed the task in under $22$22s, which is the median time taken by the over $30$30s.  $100%$100% of the under $30$30s had finished the task before $75$75% of the over $30$30s had completed it.
Overall the under $30$30s performed better and had a smaller spread of scores. There was a larger variance within the over $30$30 group, with a range of $24$24 seconds compared to $20$20 seconds for the under $30$30s.

#### Worked example

##### Example 1

The box plots show the distances, in centimetres, jumped by two high jumpers.

a) Who has a higher median jump?

Think: The median is shown by the line in the middle of the box. Whose median line has a higher value?

Do: John

b) Who made the highest jump?

Think: The highest jump is the end of the whisker for each jumper. Bill doesn't have an upper whisker as his highest jump was the same as the upper quartile height. Whose jump was the highest?

Do: John

c) Who made the lowest jump?

Think: The lowest jump is shown on each box plot by the lower whisker.

Do: Both John and Bill had the lowest jump of $60$60 cm.

### Different graphs of the same data

So we have already seen how data can be displayed in histograms and in box plots  These two displays are great for being able to identify key features of the shape of the data, as well as the range and in the case of the box plot the inter-quartile range and the median.

We should expect then that the shape of data would be the same whether it is represented in a polygon, box plot or histogram.  Remember that the shape of data can be symmetric, left skewed or right skewed. This is shown below:

 All representations of symmetric data
 All representations of positively skewed data
 All representations of negatively skewed data

Looking at the diagrams above, can you see the similarities in the representations?

We can see the skewed tails, where the bulk of the data sits and general shape. These are some of the features you can use to match histograms and box-and-whisker plots. You can also look at the data range

Let's see if you can match histograms to their correct box plot representation.

#### Practice questions

##### Question 1

The two box plots below show the data collected by the manufacturers on the life-span of light bulbs, measured in thousands of hours.

1. Complete the following table using the two box plots. Write each answer in terms of hours.

Manufacturer A Manufacturer B
Median $\editable{}$ $\editable{}$
Lower Quartile $\editable{}$ $\editable{}$
Upper Quartile $\editable{}$ $\editable{}$
Range $\editable{}$ $\editable{}$
Interquartile Range $\editable{}$ $\editable{}$
2. Hence, which manufacturer produces light bulbs with the best lifespan?

Manufacturer A.

A

Manufacturer B.

B

Manufacturer A.

A

Manufacturer B.

B

##### Question 2

Match the box plot shown to the correct histogram.

1
2
3
4
5
6
7
8
9

1. A

B

C

D

A

B

C

D

##### Question 3

The box plots show the number of goals scored by two football players in each season.

1. Who scored the most goals in a season?

Sophie

A

Holly

B

Sophie

A

Holly

B
2. How many more goals did Holly score in her best season compared to Sophie in her best season?

3. What is the difference between the median number of goals scored in a season by each player?

4. What is the difference between the interquartile range for both players?

### Outcomes

#### VCMSP351

Compare shapes of box plots to corresponding histograms and dot plots and discuss the distribution of data