topic badge

7.05 Box plots

Lesson

Five number summary

We previously looked at the quartiles of a data set, and found the first quartile, the median, and the third quartile. Remember that the quartiles can be useful to give some basic insight into the internal spread of data, whereas the range only uses the difference between the two extreme data points, the maximum and minimum. We can use the quartiles in combination with the two extremes of a data set to simplify the data into a five number summary:

  • Minimum value: The minimum value is the lowest score in a data set.
  • Lower Quartile $\left(Q_1\right)$(Q1): The lower quartile is also called the first quartile. It is the middle score between the lowest score and the median and it represents the $25$25th percentile.
  • Median: The median is the middle score in a data set. It is calculated as the $\frac{n+1}{2}$n+12th score, where $n$n is the total number of scores.
  • Upper Quartile $\left(Q_3\right)$(Q3): The upper quartile is also called the third quartile. It is the middle score between the highest score and the median and it represents the $75$75th percentile.
  • Maximum value: The maximum value is the highest score in a data set.

 

The five numbers from the five number summary break up a set of scores into four parts with $25%$25% of the scores in each quartile. Have a look at the diagram here:

So knowing these five key numbers can help us identify regions, such as the top $25%$25%, $50%$50%, and $75%$75% of the scores.  

 

Practice question

Question 1

The table shows the number of points scored by a basketball team in each game of their previous season.

$59$59 $67$67 $73$73 $82$82 $91$91 $58$58 $79$79 $88$88
$69$69 $84$84 $55$55 $80$80 $98$98 $64$64 $82$82  
  1. Sort the data in ascending order.

  2. State the maximum value of the set.

  3. State the minimum value of the set.

  4. Find the median value.

  5. Find the lower quartile.

  6. Find the upper quartile.

 

Box plots

Box plots, sometimes called box-and-whisker plots, are a useful way of getting a quick overview of a numerical data set as they visually display the five number summary. 

Creating a box plot:

  • Start with a horizontal scale that covers the full range of values in our data set–label and mark the scale.
  • Plot the values from the five number summary above the number line.
  • Create vertical lines at $Q_1$Q1 and $Q_3$Q3 and join the tops and bottoms of these lines to frame the box. The left-hand side of the box is $Q_1$Q1 and the right hand side of the box is $Q_3$Q3. The box highlights the middle $50%$50% of the scores in the data set. 
  •  Draw a vertical line inside the box to show the median of the data set.
  • Then there are two lines that extend from the box outwards (whiskers). The endpoint of the left line is at the minimum score, while the endpoint of the right line is at the maximum score.

 

Worked examples

example 1

For the box plot above, find the:

(a) Range

Think: The range is the difference between the highest score and the lowest score. That is, the difference between the scores at the ends of the whiskers.

Do: For this data set, the range is $18-3=15$183=15.

(b) Median

Think: The median is shown by the line inside the rectangular box.

Do: For this data set, the median line is at the score $10$10.

(c) Interquartile range (IQR)

Think: The IQR is the difference between the upper quartile and the lower quartile.

Do: For this set, the lower quartile (at the left end of the box) is $8$8, while the upper quartile (at the right end of the box) is $15$15. This means that the IQR is $15-8=7$158=7.

(d) What percentage of scores are in the range $8$8 to $18$18 inclusive?

Think: $8$8 is the first quartile and $18$18 is the maximum value and there are $25%$25% of the data between each quartile.

Do: There is $75%$75% of the data between these values.

 

Outliers and box plots

Sometimes a data set contains unusually high or low values. These unusual values are called outliers and may arise from data collection errors or due to the natural variation of the data.

We often want to identify the outlier values, and see the characteristics of the data without the effect of the outliers. In this case, we can construct a modified box plot to show the outlier values separately.

We will look more closely at identifying and displaying outliers in sets of data in our next lesson on describing distributions.

 

Practice question

Question 2

 

Parallel box plots

Parallel box plots are used to compare two or more sets of data visually. These box plots are presented parallel to each other along the same number line for comparison. They must therefore be in the same scale, so a visual comparison is fairly straightforward. It is important to clearly label each box plot.

Key Comparisons:

When comparing two sets of data we can compare the location of each value of the five number summary. We can also ask ourselves the following questions:

  • How does the spread of each data set compare? Compare the range and interquartile range.
  • How does the skew of each data set compare? Is one set of data more symmetrical? 
  • Is there a big difference in the medians?
  • Is there a significant difference in location of any of the quartiles? For example, does one box plot have a whisker that lies completely above the range of the other box plot? This would mean $25%$25% of the scores in the first data set are larger than any score in the second set.

 

Worked example

Example 2

The parallel box plot below shows two sets of data, comparing the time it took two different groups of people to complete an online task.

(a) Which group was generally faster?

Think: Which box plot has its main values further to the left? Is this consistent for all of the values in the five number summary? Are the differences significant? In particular note the difference in the median.

Do: We can see that overall the under $30$30s were faster at completing the task. Each of the numbers in the five number summary are smaller for the under $30$30s and their median is $4$4 seconds faster than the over $30$30s. We also have over $75%$75% of the under $30$30s completed the task in under $22$22 seconds, which is the median time taken by the over $30$30s.  $100%$100% of the under $30$30s had finished the task before $75%$75% of the over $30$30s had completed it. 

(b) Which group had more consistent completion times?

Think: For consistency note the difference in range and interquartile range. Recall, the smaller a measure of spread the more consistent the scores are.

Do: Overall the under $30$30s had smaller spread of scores. There was a larger variance within the over $30$30 group, with a range of $24$24 seconds compared to $20$20 seconds for the under $30$30s. The interquartile range was also smaller by $3$3 seconds for the under $30$30s group.

Practice question

Question 3

The box plots below represent the daily sales made by Carl and Angelina over the course of one month.

0
10
20
30
40
50
60
70
Angelina's Sales
0
10
20
30
40
50
60
70
Carl's Sales
  1. What is the range in Angelina's sales?

  2. What is the range in Carl’s sales?

  3. By how much did Carl’s median sales exceed Angelina's?

  4. Considering the middle $50%$50% of sales for both sales people, whose sales were more consistent?

    Carl

    A

    Angelina

    B
  5. Which salesperson had a more successful sales month?

    Angelina

    A

    Carl

    B

 

Constructing box plots with technology

Select  your brand of calculator below to work through an example of determining the $5$5 number summary and constructing box plots efficiently with the aid of technology.

 

Casio ClassPad

Calculator example coming soon.

TI Nspire

Calculator example coming soon.

 

Practice questions

QUESTION 4

Answer the following questions using the given frequency table.

Score

Frequency

$15$15

$13$13

$16$16 $9$9
$17$17 $23$23
$18$18 $19$19
$19$19 $8$8
$20$20 $13$13
  1. Complete the five number summary using a CAS calculator.

    Minimum: $\editable{}$

    Lower quartile: $\editable{}$

    Median: $\editable{}$

    Upper quartile: $\editable{}$

    Maximum: $\editable{}$

  2. Calculate the interquartile range.

Question 5

Salaries earned by employees at a software company is given in the histogram below.

  1. Use your CAS calculator to construct a box plot, using the class centres.

    65000
    70000
    75000
    80000
    85000
    90000
    95000
    100000
    105000

  2. Calculate the interquartile range.

  3. Using the box plot, approximately what percentage of salaries lie in the range $\$90000$$90000 to $\$100000$$100000?

  4. Complete the following statement.

    The highest $25%$25% of salaries lie between $\$\quad$$ $\editable{}$ and $\$\quad$$ $\editable{}$ inclusive.

Outcomes

2.3.2.1

construct and use parallel box plots (including the use of the Q1 − 1.5 × IQR ≤ 𝑥 ≤ Q3 + 1.5 × IQR criteria for identifying possible outliers) to compare datasets in terms of median, spread (IQR and range) and outliers to interpret and communicate the differences observed in the context of the data

What is Mathspace

About Mathspace