topic badge
Standard Level

12.05 Spread of data

Lesson

Measures of spread in a quantitative (numerical) data set seek to describe whether the scores in a data set are very similar and clustered together, or whether there is a lot of variation in the scores and they are very spread out.

There are several methods to describe the spread of data, which vary greatly in complexity. We can simply look at the numerical range of the entire data set, or we can break the data into chunks (such as quartiles, deciles, or percentiles) to examine limited ranges within. We can also compare the spread of data to the mean, which can then be normalised for a meaningful comparison to other data sets.

In this section, we will look at the range, interquartile rangestandard deviation and variance as measures of spread. We will also explore how to break data into quantiles of any number, but particularly quartiles, deciles, and percentiles.

 

Range

The range is the simplest measure of spread in a quantitative (numerical) data set. It is the difference between the maximum and minimum scores in a data set.

To calculate the range

Subtract the lowest score in the set from the highest score in the set. That is,

$\text{Range }=\text{highest Score}-\text{lowest Score}$Range =highest Scorelowest Score

For example, at one school the ages of students in Year $7$7 vary between $11$11 and $14$14. So the range for this set is $14-11=3$1411=3.

As a different example, if we looked at the ages of people waiting at a bus stop, the youngest person might be a $7$7 year old and the oldest person might be a $90$90 year old. The range of this set of data is $90-7=83$907=83, which is a much larger range of ages.

 

How do scores affect range?

Remember, the range only changes if the highest or lowest score in a data set is changed. Otherwise it will remain the same.

 

Practice questions

Question 1

Find the range of the following set of scores:

$10,19,19,7,20,14,2,11$10,19,19,7,20,14,2,11

Question 2

The range of a set of scores is $8$8, and the highest score is $19$19.

What is the lowest score in the set?

QUESTION 3

In a study, a group of people were shown $30$30 names, and after $1$1 minute they were asked to recite as many names by memory as possible. The results are presented in the dot plot.

  1. Each dot represents:

    One person in the group

    A

    One name remembered

    B
  2. How many people took part in the study?

  3. What is the largest number of names someone remembered?

  4. What was the smallest number of names someone remembered?

  5. What is the range?

 

Interquartile range

Whilst the range is very simple to calculate, it is based on the sparse information provided by the upper and lower limits of the data set. To get a better picture of the internal spread in a data set, it is often more useful to find the set's quartiles, from which the interquartile range (IQR) can be calculated.

Quartiles are scores at particular locations in the data set - similar to the median, but instead of dividing a data set into halves, they divide a data set into quarters. Let's look at how we would divide up some data sets into quarters now.

Careful!

Make sure the data set is ordered before finding the quartiles or the median.

 

Exploration

  • Here is a data set with $8$8 scores:
$\editable{1}$1   $\editable{3}$3   $\editable{4}$4   $\editable{7}$7   $\editable{11}$11   $\editable{12}$12   $\editable{14}$14   $\editable{19}$19

 

First locate the median, between the $4$4th and $5$5th scores:

        Median        
              $\downarrow$              
$\editable{1}$1   $\editable{3}$3   $\editable{4}$4   $\editable{7}$7   $\editable{11}$11   $\editable{12}$12   $\editable{14}$14   $\editable{19}$19

 

Now there are $4$4 scores in each half of the data set, so split each of the four scores in half to find the quartiles. We can see the first quartile ($Q_1$Q1) is between the $2$2nd and $3$3rd scores; there are two scores on either side of $Q_1$Q1. Similarly, the upper quartile ($Q_3$Q3) is between the $6$6th and $7$7th scores:

    $Q_1$Q1   Median   $Q_3$Q3    
      $\downarrow$       $\downarrow$       $\downarrow$      
$\editable{1}$1   $\editable{3}$3   $\editable{4}$4   $\editable{7}$7   $\editable{11}$11   $\editable{12}$12   $\editable{14}$14   $\editable{19}$19

 

 

  • Now let's look at a situation with $9$9 scores:
    $Q_1$Q1   Median   $Q_3$Q3    
      $\downarrow$         $\downarrow$         $\downarrow$      
$\editable{8}$8   $\editable{8}$8   $\editable{10}$10   $\editable{11}$11   $\editable{13}$13   $\editable{14}$14   $\editable{18}$18   $\editable{22}$22   $\editable{25}$25

This time, the $5$5th term is the median. There are four terms on either side of the median, like for the set with eight scores. So $Q_1$Q1 is still between the $2$2nd and $3$3rd scores and $Q_3$Q3 is between the $6$6th and $7$7th scores.

 

  • Finally, let's look at a set with $10$10 scores:
    $Q_1$Q1   Median   $Q_3$Q3    
        $\downarrow$         $\downarrow$         $\downarrow$        
$\editable{12}$12   $\editable{13}$13   $\editable{14}$14   $\editable{19}$19   $\editable{19}$19   $\editable{21}$21   $\editable{22}$22   $\editable{22}$22   $\editable{28}$28   $\editable{30}$30

For this set, the median is between the $5$5th and $6$6th scores. This time, however, there are $5$5 scores on either side of the median. So $Q_1$Q1 is the $3$3rd term and $Q_3$Q3 is the $8$8th term.

 

What do the quartiles represent?

Each quartile represents $25%$25% of the data set. The lowest score to the first quartile represents $25%$25% of the data; the first quartile to the median represents another $25%$25%; the median to the third quartile is another $25%$25%; and the third quartile to the highest score represents the last $25%$25% of the data. We can combine these quartiles together - for example, $50%$50% of the scores in a data set lie between the first and third quartiles.

These quartiles are sometimes named as percentilesA percentile is a percentage that indicates the value below which a given percentage of observations in a group of observations fall. For example, if a score is in the $75$75th percentile in a statistical test, it is higher than $75%$75% of all other scores. The median represents the $50$50th percentile, or the halfway point in a data set.

 

Naming the quartiles

 

The first quartile ($Q_1$Q1)

$Q_1$Q1 is the first quartile (sometimes called the lower quartile). It is the middle score in the bottom half of data and it represents the $25$25th percentile.

The first quartile score  is the $\frac{n+1}{4}$n+14th score, where $n$n is the total number of scores.

 

The median

$Q_2$Q2 is the second quartile, and is usually called the median, which we have already learnt about. It represents the $50$50th percentile of the data set.

The median is the $\frac{n+1}{2}$n+12th score, where $n$n is the number of scores.

 

The third quartile ($Q_3$Q3)

$Q_3$Q3 is the third quartile (sometimes called the upper quartile). It is the middle score in the top half of the data set, and represents the $75$75th percentile.

The third quartile is the $\frac{3\left(n+1\right)}{4}$3(n+1)4th score, where $n$n is the total number of scores.

 

Calculating the interquartile range

The interquartile range (IQR) is the difference between the third quartile and the first quartile. $50%$50% of scores lie within the IQR because two full quartiles lie in this range. Since it focuses on the middle $50%$50% of the data set, the interquartile range often gives a better indication of the internal spread than the range does, and it is less affected by individual scores that are unusually high or low (called outliers).

 

To calculate the interquartile range

Subtract the first quartile from the third quartile. That is,

$\text{IQR }=Q_3-Q_1$IQR =Q3Q1

 

Worked example

Example 1

Consider the following set of data: $1,1,3,5,7,9,9,10,15$1,1,3,5,7,9,9,10,15.

(a) Identify the median.

Think: There are nine numbers in the set, so we could say that $n=9$n=9. We can also see that the data set is already arranged in ascending order. We identify the median as the middle score either by the "cross out" method or as the $\frac{n+1}{2}$n+12th score. 

Do:

$\text{Middle position}$Middle position $=$= $\frac{9+1}{2}$9+12
  $=$= $5$5th score

Counting through the set to the $5$5th score, this means that the median is $7$7.

(b) Identify $Q_1$Q1 (lower quartile) and $Q_3$Q3 (upper quartile).

Think: We identify $Q_1$Q1 and $Q_3$Q3 as the middle scores in the lower and upper halves of the data set respectively, either by the "cross out" method, or as $Q_1$Q1 being the $\frac{n+1}{4}$n+14th score and $Q_3$Q3 being the $\frac{3\left(n+1\right)}{4}$3(n+1)4th score. 

Do:

$Q_1$Q1$position$position $=$= $\frac{9+1}{4}$9+14
  $=$= $2.5$2.5th score
$Q_1$Q1 is therefore the mean of the $2$2nd and $3$3rd scores. So we see that 
$Q_1$Q1 $=$= $\frac{1+3}{2}$1+32
  $=$= $2$2

Similarly,

$Q_3$Q3$position$position $=$= $\frac{3\left(9+1\right)}{4}$3(9+1)4
  $=$= $7.5$7.5th score
$Q_3$Q3 is therefore the mean of the $7$7th and $8$8th scores. So we see that
$Q_3$Q3 $=$= $\frac{9+10}{2}$9+102
  $=$= $9.5$9.5

(c) Calculate the $IQR$IQR of the data set.

Think: We remember that $IQR=Q_3-Q_1$IQR=Q3Q1, and we just found $Q_1$Q1 and $Q_3$Q3.

Do:

$IQR$IQR $=$= $9.5-2$9.52
  $=$= $7.5$7.5

 

Practice questions

QUESTION 4

Answer the following, given this set of scores:

$33,38,50,12,33,48,41$33,38,50,12,33,48,41

  1. Sort the scores in ascending order.

  2. Find the number of scores.

  3. Find the median.

  4. Find the first quartile of the set of scores.

  5. Find the third quartile of the set of scores.

  6. Find the interquartile range.

QUESTION 5

Answer the following using this set of scores:

$-3,-3,1,9,9,6,-9$3,3,1,9,9,6,9

  1. Sort the scores in ascending order.

  2. Find the number of scores.

  3. Find the median.

  4. Find the first quartile of the set of scores.

  5. Find the third quartile of the set of scores.

  6. Find the interquartile range.

QUESTION 6

For the following set of scores in the bar chart to the right:

Bar ChartScoresFrequency53040506070

  1. Input the data in the following distribution table:

    Score $\left(x\right)$(x) Freq $\left(f\right)$(f) $fx$fx Cumulative Freq $\left(cf\right)$(cf)
    $30$30 $\editable{}$ $\editable{}$ $\editable{}$
    $40$40 $\editable{}$ $\editable{}$ $\editable{}$
    $50$50 $\editable{}$ $\editable{}$ $\editable{}$
    $60$60 $\editable{}$ $\editable{}$ $\editable{}$
    $70$70 $\editable{}$ $\editable{}$ $\editable{}$
    Totals $\editable{}$ $\editable{}$  

  2. Find the median score using the distribution table above.

  3. Find the first quartile score.

  4. Find the third quartile score.

  5. Find the interquartile range.

 

Standard deviation and variance

Standard deviation is a measure of spread, which helps give us a meaningful estimate of the variability in a data set. While the quartiles that we just looked at were related to the median central tendency, the standard deviation is instead related to the mean central tendency. A small standard deviation indicates that most scores are close to the mean, while a large standard deviation indicates that the scores are more spread out away from the mean value. Variance is the standard deviation squared.

We can calculate the standard deviation for a population or a sample. In this course we focus on the population standard deviation and variance.

The symbols used are:

$\text{Population Standard Deviation}$Population Standard Deviation $=$= $\sigma$σ (lowercase sigma)
$\text{Population Variance}$Population Variance $=$= $\sigma^2$σ2 (lowercase sigma)


Simply put, population standard deviation describes the spread of data by comparing the distance of each score to the mean. It is complicated to calculate, so we will use our graphics calculator to find it. It gives a lot of information about the spread of data because it takes into account every data point in the set.

Standard deviation is also a very powerful way of comparing different data sets, particularly if there are different means and population numbers.

 

Practice questions

Question 7

The mean income of people in Country A is $\$19069$$19069. This is the same as the mean income of people in Country B. The standard deviation of Country A is greater than the standard deviation of Country B. In which country is there likely to be the greatest difference between the incomes of the rich and poor?

  1. Country A

    A

    Country B

    B

Question 8

Find the population standard deviation of the following set of scores, to two decimal places, by using the statistics mode on the calculator:

$8,20,9,9,8,19,9,18,5,10$8,20,9,9,8,19,9,18,5,10

Question 9

The table shows the number of goals scored by a football team in each game of the year.

Score ($x$x) Frequency ($f$f)
$0$0 $3$3
$1$1 $1$1
$2$2 $5$5
$3$3 $1$1
$4$4 $5$5
$5$5 $5$5
  1. In how many games were $0$0 goals scored?

  2. Determine the median number of goals scored. Leave your answer to one decimal place if necessary.

  3. Calculate the mean number of goals scored each game. Leave your answer to two decimal places if necessary.

  4. Use your calculator to find the population standard deviation. Leave your answer to two decimal places if necessary.

Question 10

Fill in the table and answer the questions below.

  1. Complete the table given below.

    Class Class Centre Frequency $fx$fx
    $1-9$19 $\editable{}$ $8$8 $\editable{}$
    $10-18$1018 $\editable{}$ $6$6 $\editable{}$
    $19-27$1927 $\editable{}$ $4$4 $\editable{}$
    $28-36$2836 $\editable{}$ $6$6 $\editable{}$
    $37-45$3745 $\editable{}$ $8$8 $\editable{}$
    Totals   $\editable{}$ $\editable{}$
  2. Use the class centres to estimate the mean of the data set, correct to two decimal places.

  3. Use the class centres to estimate the population standard deviation, correct to two decimal places.

  4. If we used the original ungrouped data to calculate standard deviation, do you expect that the ungrouped data would have a higher or lower standard deviation?

    Higher standard deviation

    A

    Lower standard deviation

    B

 

Quantiles

When we find the median value of a data set, we are finding a central value with the property that there are as many data points below this value as there are above it. That is, we are finding a value that splits the data set into two equal parts.

We then extended this idea in order to define the quartiles. These are three numbers that split a data set into four equal parts - so the first quartile is the median of the lower half of the set, and the third quartile is the median of the upper half. (The second quartile is, of course, just the median of the whole set.)

Extending this idea further still leads to the concept of quantiles. If there are $N$N numbers in the data set and we want to divide the set into $k$k parts, the associated quantile is a number which has, as nearly as possible, $\frac{N}{k}$Nk of the numbers are below it in the ordered set, with the remaining data points above it.

Most often we choose either $k=10$k=10 and the resulting quantiles are called deciles, or we choose $k=100$k=100 and call the resulting quantiles percentiles.

Clearly, the median should be the same as the $50$50th percentile, the first quartile should be the same as the $25$25th percentile and the third quartile should be the same as the $75$75th percentile. Similarly, for example, the $4$4th decile is the same as the $40$40th percentile. Thus, if we know how to calculate the percentiles, we automatically have a way of determining the quartiles and the deciles.

There are different methods for determining the percentiles of a data set, each giving slightly different results. The differences disappear when the data sets are large.

The simplest method is the following: to find the $p$pth percentile of a data set with $N$N elements, calculate $\frac{p}{100}\times N$p100×N. The smallest integer that is greater than or equal to the result is the rank of the number in the data that will be taken to be the required percentile.

 

Worked example

Example 2

Find the $30$30th percentile of the following set of nine numbers: $14,19,23,24,31,33,40,42,56$14,19,23,24,31,33,40,42,56.

Think: Note that, once again, the data set is already arranged in ascending order. So the $30$30th percentile can be found $\frac{30}{100}$30100 of the way along the data set. Remember that there are $n=9$n=9 scores.

Do:

$\text{Position }$Position $=$= $\frac{30}{100}\times9$30100×9
  $=$= $2.7$2.7th score

The nearest integer above $2.7$2.7 is $3$3. So, we take the third score to be the $30$30th percentile.

So, for this data set, the $30$30th percentile is $23$23.

Reflect: Note that the $25$25th percentile would also be $23$23 for this set of scores, which happens because the data set is so small. If the data set was much larger, the two percentiles would likely be different.

 

Practice question

QUESTION 11

Consider the data set $9,5,6,3,9,8,4,2,3,2$9,5,6,3,9,8,4,2,3,2.

  1. Calculate the mean to two decimal places.

  2. Calculate the median.

  3. Calculate the value of quartile $1$1.

  4. Calculate the value of quartile $3$3.

  5. Calculate the value of decile $2$2.

  6. Calculate the value of decile $8$8.

  7. Calculate the value of the percentile $43$43.

  8. Calculate the value of the percentile $88$88.

 

Measures of spread from cumulative frequency graphs
 

Worked example

example 3

(a) Construct a cumulative frequency ogive, created from the table below: 

class interval frequency cumulative frequency
$50\le t<55$50t<55 $5$5 $5$5
$55\le t<60$55t<60 $10$10 $15$15
$60\le t<65$60t<65 $25$25 $40$40
$65\le t<70$65t<70 $26$26 $66$66
$70\le t<75$70t<75 $40$40 $106$106
$75\le t<80$75t<80 $49$49 $155$155
$80\le t<85$80t<85 $28$28 $183$183
Total $183$183  
Think: Remember that the cumulative frequency ogive is a line graph connecting cumulative frequencies at the upper endpoint of each class interval. 
Do:

(b) State the number of people with a life expectancy of $65$65 years or less.

Think: Cumulative frequency ogives show us the total of frequencies up to a certain score.

Do: Look along the horizontal axis until you get to $65$65. Draw an imaginary line up from $65$65 to the graph, then across to the vertical axis. $40$40 people have a life expectancy of $65$65 or less. 

(c) Find the approximate median life expectancy from the graph.

Think: Median means the half way mark, or $50%$50% mark. Find $50%$50% of the total number (total frequency) and look for this number on the vertical axis.

Do: $50%$50% of $183$183 is $91.5$91.5. Look up the vertical axis to $91.5$91.5, then draw an imaginary line across to the graph and down to the horizontal axis.  The median life expectancy is approximately $74$74 years old.

(d) Find the approximate interquartile range from the graph. 

Think: This time we need to take $25%$25% and $75%$75% of the total frequency and read across from these numbers on the vertical axis to find the corresponding ages on the horizontal axis. We then subtract the numbers to find the $IQR$IQR.

Do: $25%$25% of $183$183 is $45.75$45.75 and $75%$75% of $183$183 is $137.25$137.25.

Reading from the graph we find $Q_1$Q1 is approximately $66$66 and $Q_3$Q3 is approximately $78$78. Therefore the $IQR$IQR is approximately  $12$12.

 

Practice question

QUESTION 12

Use the cumulative frequency histogram given to answer the following.

  1. Determine the range of scores.

  2. Determine the mode.

  3. Determine the median score.

  4. Calculate the mean.

    Round your answer to two decimal places.

  5. Use your calculator to find the standard deviation.

    Round your answer to one decimal place.

What is Mathspace

About Mathspace