topic badge

3.04 Box plots

Lesson

When we are trying to understand what our data is telling us, we usually find measures of central tendency (e.g. median, mean and mode) as well as measures of spread, such as the range. The range is easily affected by outliers, so to get a better picture of the spread in a data set we often find the set's quartiles.

Quartiles are scores at particular locations in the data set - similar to the median, but instead of dividing a data set into halves, they divide a data set into quarters.

Exploration

  • Here is a data set with $8$8 scores:
$\editable{1}$1   $\editable{3}$3   $\editable{4}$4   $\editable{7}$7   $\editable{11}$11   $\editable{12}$12   $\editable{14}$14   $\editable{19}$19

 

First locate the median, between the $4$4th and $5$5th scores:

        Median        
              $\downarrow$              
$\editable{1}$1   $\editable{3}$3   $\editable{4}$4   $\editable{7}$7   $\editable{11}$11   $\editable{12}$12   $\editable{14}$14   $\editable{19}$19

 

Now there are $4$4 scores in each half of the data set, so split each of the four scores in half to find the quartiles. We can see the first quartile (Q1) is between the $2$2nd and $3$3rd scores; there are two scores on either side of Q1. Similarly, the upper quartile (Q3) is between the $6$6th and $7$7th scores:

    Q1   Median   Q3    
      $\downarrow$       $\downarrow$       $\downarrow$      
$\editable{1}$1   $\editable{3}$3   $\editable{4}$4   $\editable{7}$7   $\editable{11}$11   $\editable{12}$12   $\editable{14}$14   $\editable{19}$19
 
  • Now let's look at a situation with $9$9 scores:
    Q1   Median   Q3    
      $\downarrow$         $\downarrow$         $\downarrow$      
$\editable{8}$8   $\editable{8}$8   $\editable{10}$10   $\editable{11}$11   $\editable{13}$13   $\editable{14}$14   $\editable{18}$18   $\editable{22}$22   $\editable{25}$25

 

This time, the $5$5th term is the median. There are four terms on either side of the median, like for the set with eight scores. So Q1 is still between the $2$2nd and $3$3rd scores and Q3 is between the $6$6th and $7$7th scores.

  • Finally, let's look at a set with $10$10 scores:
    Q1   Median   Q3    
        $\downarrow$         $\downarrow$         $\downarrow$        
$\editable{12}$12   $\editable{13}$13   $\editable{14}$14   $\editable{19}$19   $\editable{19}$19   $\editable{21}$21   $\editable{22}$22   $\editable{22}$22   $\editable{28}$28   $\editable{30}$30
 

For this set, the median is between the $5$5th and $6$6th scores. This time, however, there are $5$5 scores on either side of the median. So Q1 is the $3$3rd term and Q3 is the $8$8th term.

What do the quartiles represent?

Each quartile represents $25%$25% of the data set. In other words, the lowest score to the lower quartile represents $25%$25% of the data, the lower quartile to the median represents another $25%$25%, the median to the upper quartile is another $25%$25% and the upper quartile to the highest score represents another $25%$25%. We can add these quartiles together. For example, $50%$50% of the scores in a data set lie between the lower and upper quartiles.

These quartiles are sometimes named as percentilesA percentile is a percentage that indicates the value below which a given percentage of observations in a group of observations fall. For example, if a score is in the $75$75th percentile in a statistical test, it is higher than $75%$75% of all other scores. The median represents the $50$50th percentile, the halfway point in a data set.

Naming the quartiles

The first quartile (Q1)

The first quartile is also called the lower quartile. It is the middle score between the lowest score and the median and it represents the $25$25th percentile.

The first quartile score is the $\frac{n+1}{4}$n+14th score, where $n$n is the total number of scores.

The median (Q2)

The second quartile is the median, and it represents the $50$50th percentile.

The median is the $\frac{n+1}{2}$n+12th score, where $n$n is the number of scores.

The third quartile (Q3)

The third quartile is also called the upper quartile. It is the middle score between the median and the highest score. It represents the $75$75th percentile.

The third quartile is the $\frac{3\left(n+1\right)}{4}$3(n+1)4th score, where $n$n is the total number of scores.

The interquartile range (IQR)

The interquartile range (IQR) is the difference between the upper quartile and the lower quartile. $50%$50% of scores lie within the IQR because $2$2 full quartiles lie in this range.

The video below demonstrates how to find the quartiles and the IQR for a small data set.

We can also use the cumulative frequency to help us find the quartiles.

Worked example

Example 1

For the following set of scores in the histogram:

(a) Input the data in the following distribution table:

Score ($x$x) Frequency ($f$f) $f\times x$f×x Cumulative Frequency (cf)
$30$30 $5$5 $150$150 $5$5
$40$40 $5$5 $200$200 $10$10
$50$50 $5$5 $250$250 $15$15
$60$60 $1$1 $60$60 $16$16
$70$70 $3$3 $210$210 $19$19
Totals $19$19 $870$870  

 

(b) Find the median using the distribution table above.

Think: Which score represents the middle number?

Do: 

$\text{Middle score }$Middle score $=$= $\frac{n+1}{2}$n+12
  $=$= $\frac{19+1}{2}$19+12
  $=$= $\text{10th score}$10th score

The tenth score is the median, so the median is $40$40.

(c) Find the first quartile

Think: We can use the frequency table to work out which score lies between the lowest score and the median.

Do: The first quartile is $30$30.

(d) Find the third quartile.

Think: We score is the middle score between the median and the highest score?

Do: The third quartile is $50$50.

(e) Find the interquartile range.

Think: We need to find the difference between the third quartile and the first quartile.

Do: $50-30=20$5030=20

Practice questions

Question 1

Answer the following, given this set of scores:

$33,38,50,12,33,48,41$33,38,50,12,33,48,41

  1. Sort the scores in ascending order.

  2. Find the number of scores.

  3. Find the median.

  4. Find the first quartile of the set of scores.

  5. Find the third quartile of the set of scores.

  6. Find the interquartile range.

Question 2

Answer the following using this set of scores:

$-3,-3,1,9,9,6,-9$3,3,1,9,9,6,9

  1. Sort the scores in ascending order.

  2. Find the number of scores.

  3. Find the median.

  4. Find the first quartile of the set of scores.

  5. Find the third quartile of the set of scores.

  6. Find the interquartile range.

Question 3

For the following set of scores in the bar chart to the right:

Bar ChartScoresFrequency53040506070

  1. Input the data in the following distribution table:

    Score $\left(x\right)$(x) Freq $\left(f\right)$(f) $fx$fx Cumulative Freq $\left(cf\right)$(cf)
    $30$30 $\editable{}$ $\editable{}$ $\editable{}$
    $40$40 $\editable{}$ $\editable{}$ $\editable{}$
    $50$50 $\editable{}$ $\editable{}$ $\editable{}$
    $60$60 $\editable{}$ $\editable{}$ $\editable{}$
    $70$70 $\editable{}$ $\editable{}$ $\editable{}$
    Totals $\editable{}$ $\editable{}$  

  2. Find the median score using the distribution table above.

  3. Find the first quartile score.

  4. Find the third quartile score.

  5. Find the interquartile range.

The five-number summary

Using the quartiles we can determine 5 key values of a data set known as the five-number summary. These are the:

1. Minimum value 

2. Lower quartile 

3. Median 

4. Upper quartile

5. Maximum value

The five numbers from the five-number summary break up our set of scores into four parts. Have a look at the diagram here:

So knowing the five numbers can help us identify key regions of $25%$25%, $50%$50%, and $75%$75% of the scores.

It also leads nicely to the development of box plots.

Practice questions

Question 4

The table shows the number of points scored by a basketball team in each game of their previous season.

$59$59 $67$67 $73$73 $82$82 $91$91 $58$58 $79$79 $88$88
$69$69 $84$84 $55$55 $80$80 $98$98 $64$64 $82$82  
  1. Sort the data in ascending order.

  2. State the maximum value of the set.

  3. State the minimum value of the set.

  4. Find the median value.

  5. Find the lower quartile.

  6. Find the upper quartile.

Question 5

To gain a place in the main race of a car rally, teams must compete in a qualifying round. The median time in the qualifying round determines the cut off time to make it through to the main race. Below are some results from the qualifying round.

$75%$75% of teams finished in $159$159 minutes or less.

$25%$25% of teams finished in $132$132 minutes or less.

$25%$25% of teams finished between with a time between $132$132 and $142$142 minutes.

  1. Determine the cut off time required in the first round to make it through to the main race.

  2. Determine the interquartile range in the qualifying round.

  3. In the qualifying round, the ground was wet, while in the main race, the ground was dry. To make the times more comparable, the finishing time of each team from the qualifying round is reduced by $5$5 minutes. What would be the new median time from the qualifying round?

Box plots

Box plots (also known as box-and-whisker plots) are a way of showing the five-number summary for a data set.

The diagram below shows a nice summary of all this information:

As you can see the box plot consists of a number line, a rectangle with a line inside (the box), and 2 horizontal lines (the whiskers). The box represents the middle $50%$50% of the scores and it's size tells us the interquartile range.

Worked example

Example 2

Using the box-and-whisker plot above:

a) what percentage of scores lie between:

$10.9$10.9 and $11.2$11.2

$10.8$10.8 and $10.9$10.9   

$11.1$11.1 and $11.3$11.3  

$10.9$10.9 and $11.3$11.3   

$10.8$10.8 and $11.2$11.2

Think: For these five questions, think about how many quartiles are in that range. Remember that one quartile represents $25%$25% of the data set.

Do:

$50%$50% of the scores lie between Q1 to Q3.

$25%$25% of the scores lie between the lowest score and Q1.

$50%$50% of the scores lie between the median and the highest score.

$75%$75% of the scores lie between Q2 and the highest score.

$75%$75% of the scores lie between the lowest score and Q3.

 

b) In which quartile (or quartiles) is the data the most spread out?

Think: Which quartile takes up the longest space on the graph?

Do: The second quartile is the most spread out.

Example 3

Using the information in the table, create a box plot to represent this data:

Minimum $5$5
Lower Quartile $25$25
Median $40$40
Upper Quartile $45$45
Maximum $65$65

Think: Where do each of these values sit on a box and whisker plot?

Do: Here is our graph. Notice how the values in our table correspond to particular places on the box-and-whisker plot.

Practice questions

Question 6

For the box plot shown below, find each of the following:

0
2
4
6
8
10
12
14
16
18
20
score
  1. Lowest score: $\editable{}$
    Highest score: $\editable{}$
    Range: $\editable{}$
    Median: $\editable{}$
    Interquartile Range: $\editable{}$

Question 7

The box plot below shows the age at which a group of people got their driving licences.

15
20
25
30
35
Age
A box and whisker plot that illustrates the distribution of ages at which a group of people obtained their driving licenses. The horizontal axis is labeled "Age" and is marked with intervals from 15 to 35. 
  1. What is the oldest age at which someone got their licence?

  2. What is the youngest age at which someone got their licence?

  3. What percentage of people were aged from $18$18 to $22$22?

    $10%$10%

    A

    $25%$25%

    B

    $50%$50%

    C
  4. The middle $50%$50% of responders were within how many years of one another?

    $9$9

    A

    $6$6

    B

    $7$7

    C

    $8$8

    D
  5. In which quartile are the ages least spread out?

    $4$4th

    A

    $1$1st

    B

    $3$3rd

    C

    $2$2nd

    D
  6. The bottom $50%$50% of responders were within how many years of one another?

    $5$5

    A

    $4$4

    B

    $6$6

    C

Question 8

Using the information in the table, create a box plot to represent this data:

Minimum $5$5
Lower Quartile $25$25
Median $35$35
Upper Quartile $60$60
Maximum $75$75
  1. 0
    10
    20
    30
    40
    50
    60
    70
    80
    90
    100

Parallel box plots

Parallel box plots are used to compare two sets of data visually. When comparing box plots, we compare the five-number summary values of each data set (the quartiles and minimum and maximum values) and also the range and interquartile range. The two box plots are drawn parallel to each other.

It is important to clearly label each box plot. Below are parallel box plots comparing the time it took two different groups of people to complete an online task. 

You can see that overall the under $30$30s were faster at completing the task. Both the under $30$30s box plot and the over $30$30s box plot are slightly negatively skewed. Over $75%$75% of the under $30$30s completed the task in under $22$22s, which is the median time taken by the over $30$30s.  $100%$100% of the under $30$30s had finished the task before $75$75% of the over $30$30s had completed it. 
Overall the under $30$30s performed better and had a smaller spread of scores. There was a larger variance within the over $30$30 group, with a range of $24$24 seconds compared to $20$20 seconds for the under $30$30s.

Worked example

Example 4

The box plots show the distances, in centimetres, jumped by two high jumpers.

a) Who has a higher median jump?

Think: The median is shown by the line in the middle of the box. Whose median line has a higher value?

Do: John

 

b) Who made the highest jump?

Think: The highest jump is the end of the whisker for each jumper. Bill doesn't have an upper whisker as his highest jump was the same as the upper quartile height. Whose jump was the highest?

Do: John 

 

c) Who made the lowest jump?

Think: The lowest jump is shown on each box plot by the lower whisker. 

Do: Both John and Bill had the lowest jump of $60$60 cm.

Practice questions

Question 9

The box plots show the monthly profits (in thousands of dollars) of two derivatives traders over a year.

Ned

5
10
15
20
25
30
35
40
45
50
55
60

Tobias

5
10
15
20
25
30
35
40
45
50
55
60

  1. Who made a higher median monthly profit?

    Ned

    A

    Tobias

    B
  2. Whose profits had a higher interquartile range?

    Tobias

    A

    Ned

    B
  3. Whose profits had a higher range?

    Ned

    A

    Tobias

    B
  4. How much more did Ned make in his most profitable month than Tobias did in his most profitable month?

Question 10

The two box plots below show the data collected by the manufacturers on the life-span of light bulbs, measured in thousands of hours.

  1. Complete the following table using the two box plots. Write each answer in terms of hours.

      Manufacturer A Manufacturer B
    Median $\editable{}$ $5000$5000
    Lower Quartile $\editable{}$ $\editable{}$
    Upper Quartile $4500$4500 $\editable{}$
    Range $\editable{}$ $6500$6500
    Interquartile Range $\editable{}$ $\editable{}$
  2. Hence, which manufacturer produces light bulbs with the best lifespan?

    Manufacturer A.

    A

    Manufacturer B.

    B

Question 11

The box plots below represent the daily sales made by Carl and Angelina over the course of one month.

0
10
20
30
40
50
60
70
Angelina's Sales
0
10
20
30
40
50
60
70
Carl's Sales
  1. What is the range in Angelina's sales?

  2. What is the range in Carl’s sales?

  3. By how much did Carl’s median sales exceed Angelina's?

  4. Considering the middle $50%$50% of sales for both sales people, whose sales were more consistent?

    Carl

    A

    Angelina

    B
  5. Which salesperson had a more successful sales month?

    Angelina

    A

    Carl

    B

Outcomes

MA12-8

solves problems using appropriate statistical processes

What is Mathspace

About Mathspace