7. Statistics

Lesson

Measures of spread in a numerical data set seek to describe whether the scores in a data set are very similar and clustered together, or whether there is a lot of variation in the scores and they are very spread out.

In this section, we will look at the **range** and **interquartile range** as measures of spread.

The range is the simplest measure of spread in a numerical data set. It is the difference between the maximum and minimum scores in a data set.

Two bus drivers, Kenji and Björn, track how many passengers board their busses each day for a week. Their results are displayed in this table:

M | T | W | T | F | |
---|---|---|---|---|---|

Kenji | $10$10 | $13$13 | $14$14 | $16$16 | $11$11 |

Björn | $2$2 | $27$27 | $13$13 | $5$5 | $17$17 |

Both data sets have the same median and the same mean, but the sets are quite different. To calculate the range, we start by finding the greatest and least number of passengers for each driver:

Greatest | Least | |
---|---|---|

Kenji | $16$16 | $10$10 |

Björn | $27$27 | $2$2 |

Now we subtract the least from the greatest to find the difference, which is the **range**:

Range | |||
---|---|---|---|

Kenji | $16-10$16−10 | $=$= | $6$6 |

Björn | $27-2$27−2 | $=$= | $25$25 |

Notice how Kenji's range is quite small, at least compared to Björn's. We might say that Kenji's route is more predictable and that Björn's route is much more variable.

We can see that the range does not say anything about the size of the scores, just their spread.

Summary

The range of a numerical data set is the difference between the greatest and the least score.

$\text{Range}=\text{Greatest score}-\text{Least score}$Range=Greatest score−Least score

To get a better picture of the internal spread in a data set, it is often more useful to find the set's quartiles, from which the interquartile range (IQR) can be calculated.

Quartiles are scores at particular locations in the data set–similar to the **median**, but instead of dividing a data set into halves, they divide a data set into quarters. Let's look at how we would divide up some data sets into quarters now.

Careful!

Make sure the data set is ordered before finding the quartiles or the median.

- Here is a data set with $8$8 scores:

$\editable{1}$1 | $\editable{3}$3 | $\editable{4}$4 | $\editable{7}$7 | $\editable{11}$11 | $\editable{12}$12 | $\editable{14}$14 | $\editable{19}$19 |

First locate the median, between the $4$4th and $5$5th scores:

Median | ||||||||||||||

$\downarrow$↓ | ||||||||||||||

$\editable{1}$1 | $\editable{3}$3 | $\editable{4}$4 | $\editable{7}$7 | $\editable{11}$11 | $\editable{12}$12 | $\editable{14}$14 | $\editable{19}$19 |

Now there are four scores in each half of the data set, so split each of the four scores in half to find the quartiles. We can see the first quartile, $Q_1$`Q`1, is between the $2$2nd and $3$3rd scores–that is, there are two scores on either side of $Q_1$`Q`1. Similarly, the third quartile, $Q_3$`Q`3, is between the $6$6th and $7$7th scores:

$Q_1$Q1 |
Median | $Q_3$Q3 |
||||||||||||

$\downarrow$↓ | $\downarrow$↓ | $\downarrow$↓ | ||||||||||||

$\editable{1}$1 | $\editable{3}$3 | $\editable{4}$4 | $\editable{7}$7 | $\editable{11}$11 | $\editable{12}$12 | $\editable{14}$14 | $\editable{19}$19 |

- Now let's look at a situation with $9$9 scores:

$Q_1$Q1 |
Median | $Q_3$Q3 |
||||||||||||||

$\downarrow$↓ | $\downarrow$↓ | $\downarrow$↓ | ||||||||||||||

$\editable{8}$8 | $\editable{8}$8 | $\editable{10}$10 | $\editable{11}$11 | $\editable{13}$13 | $\editable{14}$14 | $\editable{18}$18 | $\editable{22}$22 | $\editable{25}$25 |

This time, the $5$5th term is the median. There are four terms on either side of the median, like for the set with eight scores. So $Q_1$`Q`1 is still between the $2$2nd and $3$3rd scores and $Q_3$`Q`3 is between the $6$6th and $7$7th scores.

- Finally, let's look at a set with $10$10 scores:

$Q_1$Q1 |
Median | $Q_3$Q3 |
||||||||||||||||

$\downarrow$↓ | $\downarrow$↓ | $\downarrow$↓ | ||||||||||||||||

$\editable{12}$12 | $\editable{13}$13 | $\editable{14}$14 | $\editable{19}$19 | $\editable{19}$19 | $\editable{21}$21 | $\editable{22}$22 | $\editable{22}$22 | $\editable{28}$28 | $\editable{30}$30 |

For this set, the median is between the $5$5th and $6$6th scores. This time, however, there are $5$5 scores on either side of the median. So $Q_1$`Q`1 is the $3$3rd term and $Q_3$`Q`3 is the $8$8th term.

Each quartile represents $25%$25% of the data set. The least score to the first quartile is approximately $25%$25% of the data, the first quartile to the median is another $25%$25%, the median to the third quartile is another $25%$25%, and the third quartile to the greatest score represents the last $25%$25% of the data. We can combine these sections together–for example, $50%$50% of the scores in a data set lie between the first and third quartiles.

These quartiles are sometimes referred to as percentiles. A percentile is a percentage that indicates the value below which a given percentage of observations in a group of observations fall. For example, if a score is in the $75$75th percentile in a statistical test, it is higher than $75%$75% of all other scores. The median represents the $50$50th percentile, or the halfway point in a data set.

- $Q_1$
`Q`1 is the first quartile (sometimes called the lower quartile). It is the middle score in the bottom half of data and it represents the $25$25th percentile. - $Q_2$
`Q`2 is the second quartile, and is usually called the median, which we have already learned about. It represents the $50$50th percentile of the data set. - $Q_3$
`Q`3 is the third quartile (sometimes called the upper quartile). It is the middle score in the top half of the data set, and represents the $75$75th percentile.

The interquartile range (IQR) is the difference between the **third quartile** and the **first quartile**. $50%$50% of scores lie within the IQR because it contains the data set between the first quartile and the median, as well as the median and the third quartile.

Since it focuses on the middle $50%$50% of the data set, the interquartile range often gives a better indication of the internal spread than the range does, and it is less affected by individual scores that are unusually high or low, which are the outliers.

To calculate the interquartile range

Subtract the first quartile from the third quartile. That is,

$\text{IQR }=Q_3-Q_1$IQR =`Q`3−`Q`1

Consider the following set of data: $1,1,3,5,7,9,9,10,15$1,1,3,5,7,9,9,10,15.

**(a)** Identify the median.

**Think:** There are nine numbers in the set, so we can say that $n=9$`n`=9. We can also see that the data set is already arranged in ascending order. We identify the median as the **middle** score either by the "cross-out" method or as the $\frac{n+1}{2}$`n`+12th score.

**Do:**

$\text{Position of median}$Position of median | $=$= | $\frac{9+1}{2}$9+12 |
Substituting $n=9$ |

$=$= | $5$5th score |
Simplifying the fraction |

Counting through the set to the $5$5th score gives us $7$7 as the median.

**(b)** Identify $Q_1$`Q`1 (the lower quartile) and $Q_3$`Q`3 (the upper quartile).

**Think:** We identify $Q_1$`Q`1 and $Q_3$`Q`3 as the middle scores in the lower and upper halves of the data set respectively, either by the "cross-out" method–or any method that we use to find the median, but just applying it to the lower or upper half of the data set.

**Do:** The lower half of the data set is all the scores to the left of the median, which is $1,1,3,5$1,1,3,5. There are four scores here, so $n=4$`n`=4. So we can find the position of $Q_1$`Q`1 as follows:

$\text{Position of }Q_1$Position of Q1 |
$=$= | $\frac{4+1}{2}$4+12 |
Substituting $n=4$ |

$=$= | $2.5$2.5th score |
Simplifying the fraction |

$Q_1$`Q`1 is therefore the mean of the $2$2nd and $3$3rd scores. So we see that:

$Q_1$Q1 |
$=$= | $\frac{1+3}{2}$1+32 |
Taking the average of the $2$2nd and $3$3rd scores |

$=$= | $2$2 |
Simplifying the fraction |

The upper half of the data set is all the scores to the right of the median, which is $9,9,10,15$9,9,10,15. Since there are also $n=4$`n`=4 scores, $Q_3$`Q`3 will be the mean of the $2$2nd and $3$3rd scores in this upper half.

$Q_3$Q3 |
$=$= | $\frac{9+10}{2}$9+102 |
Taking the average of the $2$2nd and $3$3rd scores in the upper half |

$=$= | $9.5$9.5 |
Simplifying the fraction |

**(c)** Calculate the $\text{IQR }$IQR of the data set.

**Think:** Remember that $\text{IQR }=Q_3-Q_1$IQR =`Q`3−`Q`1, and we just found $Q_1$`Q`1 and $Q_3$`Q`3.

**Do:**

$\text{IQR }$IQR | $=$= | $9.5-2$9.5−2 |
Substituting $Q_1=9.5$ |

$=$= | $7.5$7.5 |
Simplifying the subtraction |

Practice questions

Look at the data sets below. Which data set has the largest range?

$101,105,118,129,136$101,105,118,129,136

A$19,23,25,28,29$19,23,25,28,29

B$22,25,43,64$22,25,43,64

C$104,107,113,120,125$104,107,113,120,125

D

Answer the following, given this set of scores:

$33,38,50,12,33,48,41$33,38,50,12,33,48,41

Sort the scores in ascending order.

Find the number of scores.

Find the median.

Find the first quartile of the set of scores.

Find the third quartile of the set of scores.

Find the interquartile range.

Understand that a set of data collected to answer a statistical question has a distribution which can be described by its center, spread, and overall shape.

Recognize that a measure of center for a numerical data set summarizes all of its values with a single number, while a measure of variation describes how its values vary with a single number.

Summarize numerical data sets in relation to their context.

Find the quantitative measures of center (median and/or mean) for a numerical data set and recognize that this value summarizes the data set with a single number. Interpret mean as an equal or fair share. Find measures of variability (range and interquartile range) as well as informally describe the shape and the presence of clusters, gaps, peaks, and outliers in a distribution.