topic badge

7.05 Box plots

Lesson

Five number summary

When we are trying to understand what our data is telling us, we usually find statistics that tell us the location of the data (such as the mean or median) as well as measures of spread, such as the range.

To get a better picture of the distribution of a data set, with a concise set of values, we often use the five number summary.

The five number summary is made up of the minimum and maximum values, the median, and two other values, known as upper and lower quartiles.

 

Medians and quartiles

We are familiar with the median as the middle value in a data set when the values are arranged in order.  The median is a useful statistic that tells us the location of the data.

Quartiles are values at particular locations in the data set – similar to the median, but instead of dividing a data set into halves, they divide a data set into quarters.

Exploration

  • Here is a data set with $8$8 values:
$\editable{1}$1   $\editable{3}$3   $\editable{4}$4   $\editable{7}$7   $\editable{11}$11   $\editable{12}$12   $\editable{14}$14   $\editable{19}$19

 

First locate the median, between the $4$4th and $5$5th values:

        Median        
              $\downarrow$              
$\editable{1}$1   $\editable{3}$3   $\editable{4}$4   $\editable{7}$7   $\editable{11}$11   $\editable{12}$12   $\editable{14}$14   $\editable{19}$19

 

Now there are $4$4 values in each half of the data set, so split each of the four values in half to find the quartiles. We can see the lower quartile is between the $2$2nd and $3$3rd values; there are two values on either side of the first quartile. Similarly, the upper quartile is between the $6$6th and $7$7th values:

    lower quartile   Median   upper quartile    
      $\downarrow$       $\downarrow$       $\downarrow$      
$\editable{1}$1   $\editable{3}$3   $\editable{4}$4   $\editable{7}$7   $\editable{11}$11   $\editable{12}$12   $\editable{14}$14   $\editable{19}$19

 

We can see that the intervals between the quartiles each contain two values–one quarter of the total number of values in the data set.

 

Calculating quartiles

The lower quartile is also called the first quartile, or $Q_1$Q1. It is the middle value between the minimum value and the median. To calculate the lower quartile, we identify the scores less than the median (which we call the lower half). Then we determine the middle value of this lower half.

The median is also known as the second quartile, or $Q_2$Q2, which we have already learnt about and it represents the middle value in the sorted data set.

The median is the $\frac{n+1}{2}$n+12th value in the sorted data set, where $n$n is the number of values in the data set.

The upper quartile is also called the third quartile, or $Q_3$Q3. It is the middle value between the median and the maximum value. The upper quartile can be found by identifying the scores in the upper half (above the median). Then we determine the middle value of this upper half.

The range is the difference between the maximum value and minimum value in the data set.

The interquartile range, or IQR, is the difference between the upper quartile and the lower quartile. Half of the values in the data set lie within the interquartile range.

The interquartile range is a useful measure of the spread of data because, unlike the range, it is not affected by unusually large or small values.

 

Five number summary

The five number summary is the set of values made up of the:

  • minimum
  • lower quartile ($Q_1$Q1)
  • median ($Q_2$Q2)
  • upper quartile ($Q_3$Q3)
  • maximum

These values break our data set into four parts as shown in this diagram

Knowing the five number summary can help us identify key regions of the data set.

  • One quarter of values in the data set lie below the lower quartile;
  • One quarter of values lie above the upper quartile;
  • One half of values lie below the median; one half of the values lie above the median;
  • One half of the values in the data set lie within the interquartile range, between the lower and upper quartiles.

 

Five number summary with CAS

The individual values required for a five number summary are readily obtained using the Statistics mode on a CAS calculator.

Worked examples

Example 1

Determine the five number summary and interquartile range for this data set:

$-2,10,-1,6,9,6,-6,1,7$2,10,1,6,9,6,6,1,7.

ClassPad

This data can be entered in the Statistics mode, "list1", without needing to sort the values in ascending order.

Use the Calc -> One-variable menu to calculate the values for the five number summary (and other statistics).

  • Minimum and maximum values are reported as $minX=-6$minX=6 and $maxX=10$maxX=10, respectively.
  • The median is reported as $Med=6$Med=6.
  • Lower and upper quartiles are reported as $Q_1=-1.5$Q1=1.5; and $Q_3=8$Q3=8, respectively.

Hence, the five number summary is set out below:

Minimum Lower quartile Median Upper quartile Maximum
$-6$6 $-1.5$1.5 $6$6 $8$8 $10$10

Note that, in this example, neither quartile is a value from the data set because the positions of the quartiles fall between values.

The interquartile range is the difference between the upper quartile and the lower quartile:

Interquartile range $=$= $8.5-(-1.5)$8.5(1.5)
  $=$= $10$10
Example 2

Use class centres to determine the five number summary and interquartile range for the data represented by the histogram:

The histogram data can be represented by the frequency table below:

Class Class centre Frequency
$30-<40$30<40 $35$35 $5$5
$40-<50$40<50 $45$45 $5$5
$50-<60$50<60 $55$55 $7$7
$60-<70$60<70 $65$65 $1$1
$70-<80$70<80 $75$75 $3$3

ClassPad

Using Statistics mode, enter class centres into "list1" and frequencies into "list2"

Use the Calc -> One-variable menu to calculate the values for the five number summary (and other statistics), using the "Freq" setting to select frequencies from "list2"

  • $minX=35$minX=35 and $maxX=75$maxX=75. Note that these are the class centres, so it is possible that there are higher and lower values. However, in practice, box plots are usually used to summarise continuous data, with class intervals small enough that it is reasonably accurate to use the class centres to find estimates for summary statistics.
  • The median is reported as $Med=55$Med=55.
  • Lower and upper quartiles are reported as $Q_1=40$Q1=40; and $Q_3=55$Q3=55, respectively.

Hence, the five number summary is set out below:

Minimum Lower quartile Median Upper quartile Maximum
$35$35 $40$40 $55$55 $55$55 $75$75

In this case, median and upper quartile have the same value.

The interquartile range is the difference between the upper quartile and the lower quartile:

Interquartile range $=$= $55-40$5540
  $=$= $15$15

 

Practice questions

Question 1

The table shows the number of points scored by a basketball team in each game of their previous season.

$59$59 $67$67 $73$73 $82$82 $91$91 $58$58 $79$79 $88$88
$69$69 $84$84 $55$55 $80$80 $98$98 $64$64 $82$82  
  1. Sort the data in ascending order.

  2. State the maximum value of the set.

  3. State the minimum value of the set.

  4. Find the median value.

  5. Find the lower quartile.

  6. Find the upper quartile.

 

 
QUESTION 2

Answer the following questions using the given frequency table.

Score

Frequency

$15$15

$13$13

$16$16 $9$9
$17$17 $23$23
$18$18 $19$19
$19$19 $8$8
$20$20 $13$13
  1. Complete the five number summary using a CAS calculator.

    Minimum: $\editable{}$

    Lower quartile: $\editable{}$

    Median: $\editable{}$

    Upper quartile: $\editable{}$

    Maximum: $\editable{}$

  2. Calculate the interquartile range.

 

Question 3

To gain a place in the main race of a car rally, teams must compete in a qualifying round. The median time in the qualifying round determines the cut off time to make it through to the main race. Below are some results from the qualifying round.

$75%$75% of teams finished in $159$159 minutes or less.

$25%$25% of teams finished in $132$132 minutes or less.

$25%$25% of teams finished between with a time between $132$132 and $142$142 minutes.

  1. Determine the cut off time required in the first round to make it through to the main race.

  2. Determine the interquartile range in the qualifying round.

  3. In the qualifying round, the ground was wet, while in the main race, the ground was dry. To make the times more comparable, the finishing time of each team from the qualifying round is reduced by $5$5 minutes. What would be the new median time from the qualifying round?

 

Box plots

Box plots, sometimes called box-and-whisker plots, can be a useful way of displaying quantitative (numerical) data as they clearly show the five values from a five number summary of a data set. In particular, a box plot highlights the middle $50%$50% of the scores in the data set, between $Q_1$Q1 and $Q_3$Q3.
 

Features of a box plot

We start with a number line that covers the full range of values in our data set.

We then plot the values from the five number summary above the number line, and connect them in a certain way to create a box plot. Here is an example:

The two vertical edges of the box show the upper and lower quartiles of the data range. The left hand side of the box is $Q_1$Q1 and the right hand side of the box is $Q_3$Q3. The vertical line inside the box shows the median.

Then there are two lines that extend from the box outwards. The endpoint of the left line is at the minimum value, while the endpoint of the right line is at the maximum value.

The box plot must be drawn parallel to a number line so that the values for the five number summary can be easily read from the graph.

The example above is represents the five number summary set out below:

Minimum Lower quartile Median Upper quartile Maximum
$18$18 $51$51 $68$68 $87$87 $100$100

The interquartile range (IQR) is the difference between the upper quartile and the lower quartile.

For this example, the IQR is $87-51=36$8751=36.

Since the marks of the box plot represent quartiles, each region represents $25%$25% of the values in the data set. Hence, in this example, we can make statements such as:

  • $50%$50% of values lie in the range from $51$51 to $87$87 (the interquartile range)
  • $25%$25% of values are less than $51$51
  • $75%$75% of values are between $18$18 and $87$87
  • the top $25%$25% of values are least spread out because this region is the smallest

 

Practice questions

Question 4

Question 5

The box plot below shows the age at which a group of people got their driving licences.

15
20
25
30
35
Age
A box and whisker plot that illustrates the distribution of ages at which a group of people obtained their driving licenses. The horizontal axis is labeled "Age" and is marked with intervals from 15 to 35. 
  1. What is the oldest age at which someone got their licence?

  2. What is the youngest age at which someone got their licence?

  3. What percentage of people were aged from $18$18 to $22$22?

    $10%$10%

    A

    $25%$25%

    B

    $50%$50%

    C
  4. The middle $50%$50% of responders were within how many years of one another?

    $9$9

    A

    $6$6

    B

    $7$7

    C

    $8$8

    D
  5. In which quartile are the ages least spread out?

    $4$4th

    A

    $1$1st

    B

    $3$3rd

    C

    $2$2nd

    D
  6. The bottom $50%$50% of responders were within how many years of one another?

    $5$5

    A

    $4$4

    B

    $6$6

    C

 

Box plots with CAS

Worked examples

EXAMPLE 3

Use a CAS calculator to construct a boxplot for this data set, and determine the upper quartile:

$48,4,8,36,8,28,20,40,44$48,4,8,36,8,28,20,40,44.

ClassPad

Using the Statistics mode:

  • Enter values into "list1"
  • Set the graph Type to "MedBox"
  • Use the Analysis -> Trace function, with the arrow keys to read values from the five-number summary


 

 

EXAMPLE 4

Use a CAS calculator to construct a boxplot for the data set represented by this frequency table:

Score Frequency
$12$12 $1$1
$13$13 $0$0
$14$14 $8$8
$15$15 $11$11
$16$16 $14$14
$17$17 $7$7

ClassPad

Using the Statistics mode:

  • Enter values into "list1"
  • Enter frequencies into "list2"
  • Set the graph Type to "MedBox"

For this data set the box plot shows that values are most spread out below the lower quartile.

 

Outliers and box plots

Sometimes a data set contains unusually high or low values. These unusual values are called outliers and may arise from data collection errors or due to the natural variation of the data.

We often want to identify the outlier values, and see the characteristics of the data without the effect of the outliers. In this case, we can construct a modified box plot to show the outlier values separately.

Worked example

EXAMPLE 6

For the data set used in Example 5, construct a box plot and determine the range with the outlier value displayed separately.

ClassPad

Using the Statistics mode:

  • Enter values into "list1"
  • Enter frequencies into "list2"
  • Set the graph Type to "MedBox" with "Show Outliers" enabled.

The boxplot shows a dot to indicate that the value of $12$12 is an outlier.

With this outlier displayed separately, the minimum score is $14$14 and the lower $25%$25% of values no longer appears to be unusually spread out.

 

Practice questions

Question 6

Answer the following questions using the given grouped frequency table.

Class Class centre Frequency
$40\le x<45$40x<45 $42.5$42.5 $3$3
$45\le x<50$45x<50 $47.5$47.5 $4$4
$50\le x<55$50x<55 $52.5$52.5 $7$7
$55\le x<60$55x<60 $57.5$57.5 $3$3
$60\le x<65$60x<65 $62.5$62.5 $3$3
$65\le x<70$65x<70 $67.5$67.5 $9$9
$70\le x<75$70x<75 $72.5$72.5 $4$4
$75\le x<80$75x<80 $77.5$77.5 $5$5
  1. Complete the five number summary using a CAS calculator.

    Minimum: $\editable{}$

    Lower quartile: $\editable{}$

    Median: $\editable{}$

    Upper quartile: $\editable{}$

    Maximum: $\editable{}$

  2. Calculate the interquartile range.

 
Question 7

Salaries earned by employees at a software company is given in the histogram below.

  1. Use your CAS calculator to construct a box plot, using the class centres.

    65000
    70000
    75000
    80000
    85000
    90000
    95000
    100000
    105000

  2. Calculate the interquartile range.

  3. Using the box plot, approximately what percentage of salaries lie in the range $\$90000$$90000 to $\$100000$$100000?

  4. Complete the following statement.

    The highest $25%$25% of salaries lie between $\$\quad$$ $\editable{}$ and $\$\quad$$ $\editable{}$ inclusive.

 

Parallel box plots

Parallel box plots are used to compare two sets of data visually.

We call these parallel box plots as they are presented parallel to each other along the same number line for comparison. They must therefore be in the same scale, so a visual comparison is fairly straightforward.

It is important to clearly label each box plot. Here we have plotted two sets of data, comparing the time it took two different groups of people to complete an online task.

We will see in later lessons that this format is very useful for comparing the characteristics of two (or more) data sets.

 

Practice question

Question 8

The heights (in metres) of the boys and girls in a class of $30$30 students were recorded. The results are given in the table below.

Boys: $1.65$1.65 $1.66$1.66 $1.67$1.67 $1.68$1.68 $1.63$1.63 $1.62$1.62 $1.61$1.61 $1.60$1.60 $1.75$1.75 $1.76$1.76 $1.77$1.77 $1.78$1.78 $1.73$1.73 $1.72$1.72 $1.71$1.71
Girls: $1.55$1.55 $1.56$1.56 $1.57$1.57 $1.58$1.58 $1.53$1.53 $1.52$1.52 $1.51$1.51 $1.50$1.50 $1.69$1.69 $1.70$1.70 $1.71$1.71 $1.72$1.72 $1.67$1.67 $1.66$1.66 $1.65$1.65
  1. Complete the table for the given data of the heights of boys in the class.

    Minimum $\editable{}$
    First quartile $\editable{}$
    Median $\editable{}$
    Third quartile $\editable{}$
    Maximum $\editable{}$
  2. Complete the table for the given data of the heights of girls in the class.

    Minimum $\editable{}$
    First quartile $\editable{}$
    Median $\editable{}$
    Third quartile $\editable{}$
    Maximum $\editable{}$
  3. Draw a parallel box plot for this data.

Outcomes

2.1.10

construct and use parallel box plots (including the use of the ‘Q1 – 1.5 x IQR’ and ‘Q3 + 1.5 x IQR’ criteria for identifying possible outliers) to compare groups in terms of location (median), spread (IQR and range) and outliers, and interpret and communicate the differences observed in the context of the data

What is Mathspace

About Mathspace