topic badge
AustraliaVIC
VCE 12 General 2023

1.04 Boxplots and outliers

Lesson

Introduction

Box plots are a great way of displaying numerical data, as they clearly show all of the quartiles in a data set. Let's take a look at the features of box plots.

Five number summary

To create a box plot it is easiest to first calculate the five number summary for the data set.

A five number summary consists of the:

  • Minimum value

  • Lower quartile (Q_1)

  • Median

  • Upper quartile (Q_3)

  • Maximum value

Once we have these five values we can plot them on a number line and create our box plot.

The diagram below shows a nice summary of all this information:

A box plot with corresponding labels. Ask your teacher for more information.

The two vertical edges of the box show the quartiles of the data range. The left hand side of the box is the lower quartile (Q_1) and the right hand side of the box is the upper quartile (Q_3). The vertical line inside the box shows the median (the middle score) of the data.

Then there are two lines that extend from the box outwards. The endpoint of the left line is at the minimum score, while the endpoint of the right line is at the maximum score.

Examples

Example 1

For the following box plot:

score
0
2
4
6
8
10
12
14
16
18
20
a

Find the lowest score.

Worked Solution
Create a strategy

The lowest score is at the end of the left whisker.

Apply the idea

\text{Lowest score}=3

b

Find the highest score.

Worked Solution
Create a strategy

The highest score is at the end of the right whisker.

Apply the idea

\text{Highest score}=18

c

Find the range.

Worked Solution
Create a strategy

The range is the difference between the highest score and the lowest score.

Apply the idea
\displaystyle \text{Range}\displaystyle =\displaystyle 18-3Find the difference of the scores
\displaystyle =\displaystyle 15Evaluate the subtraction
d

Find the median.

Worked Solution
Create a strategy

The median is marked by a line between the lower and upper quartile.

Apply the idea

\text{Median}=10

e

Find the interquartile range (\text{IQR}).

Worked Solution
Create a strategy

The interquartile range (\text{IQR}) is the difference between the upper quartile and the lower quartile.

Apply the idea
\displaystyle \text{ Interquartile range (IQR) }\displaystyle =\displaystyle 15-7Find the difference between the quartiles
\displaystyle =\displaystyle 8Evaluate the subtraction

Example 2

Using the following box plot :

Glass Width
10.7
10.8
10.9
11.0
11.1
11.2
11.3
11.4
a

What percentage of scores lie between:

  • 10.9 and 11.2

  • 10.8 and 10.9

  • 11.1 and 11.3

  • 10.9 and 11.3

  • 10.8 and 11.2

Worked Solution
Create a strategy

Think about how many quartiles are in that range. One quartile represents 25\% of the data set.

Apply the idea

50\% of scores lie between Q1 and Q3. So 50\% of scores lie between 10.9 and 11.2.

25\% of the scores lie between the lowest score and Q1. So 25\% of scores lie between 10.8 and 10.9.

50\% of scores lie between the median and the highest score. So 50\% of scores lie between 11.1 and 11.3.

75\% of scores lie between Q2 and the highest score. So 75\% of scores lie between 10.9 and 11.3.

75\% of scores lie between the lowest score and Q3.So 75\% of scores lie between 10.8 and 11.2.

b

In which quartile (or quartiles) is the data the most spread out?

Worked Solution
Create a strategy

Which quartile takes up the longest space on the graph?

Apply the idea

The second quartile is the most spread out.

Example 3

Consider the following data set: 20, \, 36, \, 52, \, 56, \, 24, \, 16, \, 40, \, 4, \, 28

a

Complete the table for the given data:

Minimum\quad \quad
Lower quartile\quad \quad
Median\quad \quad
Upper quartile\quad \quad
Maximum\quad \quad
Worked Solution
Create a strategy

Order the numbers from smallest to largest to find the values of the five number summary.

Apply the idea

Ordered data: 4, \, 16, \, 20, \, 24, \, 28, \, 36, \, 40, \, 52, \, 56

\displaystyle \text{Minimum}\displaystyle =\displaystyle 4The first score
\displaystyle \text{Maximum}\displaystyle =\displaystyle 56The last score
\displaystyle \text{Median}\displaystyle =\displaystyle 28The middle score
\displaystyle Q_1\displaystyle =\displaystyle \dfrac{16+20}{2}Average the middle scores of 4, \, 16, \, 20, \, 24
\displaystyle =\displaystyle 18Evaluate
\displaystyle Q_3\displaystyle =\displaystyle \dfrac{40+52}{2}Average the middle scores of 36, \, 40, \, 52, \, 56
\displaystyle =\displaystyle 46Evaluate

The completed table is shown:

Minimum4
Lower quartile18
Median28
Upper quartile46
Maximum56
b

Construct a box plot for the data.

Worked Solution
Create a strategy

Use the five number summary to construct the box plot.

Apply the idea
Data
0
10
20
30
40
50
60
Idea summary

The features of a box-and-whisker plot is shown below:

The image shows a box plot on an axis frm 10.7 to 11.4. Ask your teacher for more information.

A list of the minimum, lower quartile, median, upper quartile, and maximum values is often called the five number summary.

One quartile represents 25\% of the data set.

Creating a box-and-whisker plot:

  1. Put the data in ascending order (from smallest to largest).

  2. Find the median (middle value) of the data.

  3. To divide the data into quarters, find the median (middle value) between the minimum value and the median, as well as between the median and the maximum value.

Shape of data from box plots

We can now see that data can be displayed in histograms and box plots. These two displays are great for being able to identify key features of the shape of the data, as well as the range and in the case of the box plot the IQR and the median.

We should expect then that the  shape of the data  would be the same whether it is represented in a box plot or histogram. Remember that the shape of data can be symmetric, positively skewed or negatively skewed.

Symmetric

The image shows a symmetrical curve, histogram, and box plot. Ask your teacher for more information.

Positive skew

The image shows a positively skewed curve, histogram, and box plot. Ask your teacher for more information.

Negative skew

The image shows a negatively skewed curve, histogram, and box plot. Ask your teacher for more information.

Let's now match some histograms to their correct box plot representation.

Examples

Example 4

Match each histogram to a box plot.

The image shows 3 histograms and 3 box plots. Ask your teacher for more information.
Worked Solution
Create a strategy

Look for characteristics of skewed and symmetric distributions of data.

Apply the idea
  • Boxplot A and histogram 3 have long right tails, so they are both right skewed.

  • Boxplot C and histogram 2 have long left tails, so they are both left skewed.

  • Boxplot B and histogram 1 are both approximately symmetric.

Example 5

Consider the following pairs of histograms and box plots.

a

Which two of these histograms and box plots are correctly paired?

A
A histogram and box plot that are both negatively skewed. Ask your teacher for more information.
B
A histogram and box plot that are both symmetrical. Ask your teacher for more information.
C
A histogram that is symmetrical and box plot that is positively skewed. Ask your teacher for more information.
D
A histogram that is symmetrical and box plot that is negatively skewed. Ask your teacher for more information.
Worked Solution
Create a strategy

Look for characteristics of skewed and symmetric distributions of data.

Apply the idea

In Option A, both the histogram and box plot are negatively skewed.

In Option B, both the histogram and box plot are both symmetrical.

In Option C, the histogram is roughly symmetrical, but the box plot is positively skewed.

In Option D, the histogram is roughly symmetrical, but the box plot is negatively skewed.

The matching pairs are options A and B.

b

In part (a) we determined that the following histogram/box plot were an incorrect match:

A histogram that is symmetrical and box plot that is positively skewed. Ask your teacher for more information.

Which two of the options correctly describe why?

A
The box plot has a long tail to the right which indicates positive skew, while the histogram does not appear to be skewed.
B
The data on the hisrtogram is widely spread, while the box plot indicates that the data is mostly located around the median.
C
The median for the histogram is roughly in the middle, while the medium of the box plot is located further to the left.
Worked Solution
Create a strategy

Compare the differences on the features of the histogram and box plot.

Apply the idea

Comparing the histogram and box plot, we can see that the box plot has a long tail to the right which indicates positive skew and its median is to the left.However, the histogram is symmetrical and the median would be in the middle.

So, the correct answers are Options A and C.

Idea summary

We can compare the shape of the data in histograms and box plots by checking for symmetry and skew.

  • For symmetric data, both graphs should be symmetrical about the median.

  • For positively skewed data, the histogram would have most of the data to the left and a shape with the tail pointing right. The box plot would have the box to the left and a long right whisker.

  • For negatively skewed data, the histogram would have most of the data to the right and a shape with the tail pointing left. The box plot would have the box to the right and a long left whisker.

Outliers

An outlier is an event that is very different from the norm and results in a score that is far away from the rest of the scores. For example, suppose there are 10 people in a long jump contest. Nine of those people managed a jump between 4 and 5 metres, while one person only jumped 1 metre. That one person is an outlier, as their jump is so much shorter than everyone else's.

To determine if a data value is an outlier, we use a rule that involves the interquartile range (IQR). This rule gives us the upper and lower fences of a box plot. A fence refers to the upper and lower boundaries, and any score which lies outside of the fences are classified as outliers.

Calculating outliers:

A data point is classified as an outlier if it lies above the upper fence or below the lower fence. These are calculated as follows: \text{Lower fence} = \text{Lower quartile} -1.5 \times \text{Interquartile Range}\\ \text{Upper fence} = \text{Upper quartile} +1.5 \times \text{Interquartile Range}

Using the five number summary and the upper and lower fences we can construct and box and whisker plot and identify any outliers.

A bot plot showing the first quartile, third quartile, lower fence and upper fence. Ask your teacher for more information.

The above diagram shows the construction of the box plots and how the upper and lower fences are constructed. Any data points outside of these are outliers.

Examples

Example 6

Consider following set of data: 9,\,5,\,3,\,2,\,6,\,1

a

Complete the five-number summary for this data set.

Minimum
Lower quartile
Median
Upper quartile
Maximum
Worked Solution
Create a strategy

Order the numbers from smallest to largest to find the values of the five number summary.

Apply the idea

Ordered data: 1,\,2,\,3,\,5,\,6,\,9

\displaystyle \text{Minimum}\displaystyle =\displaystyle 1The first score
\displaystyle \text{Maximum}\displaystyle =\displaystyle 9The last score
\displaystyle \text{Median}\displaystyle =\displaystyle \dfrac{3+5}{2}Average of the middle scores of the sorted data
\displaystyle =\displaystyle 4Evaluate
\displaystyle Q_1\displaystyle =\displaystyle 2Lower quartile
\displaystyle Q_3\displaystyle =\displaystyle 6Upper quartile

The completed table is shown:

Minimum1
Lower quartile2
Median4
Upper quartile6
Maximum9
b

Calculate the interquartile range.

Worked Solution
Create a strategy

We can use the interquartile range formula: \text{IQR} = Q_{3} - Q_{1}

Apply the idea
\displaystyle \text{IQR}\displaystyle =\displaystyle 6-2Substitute the quartiles
\displaystyle =\displaystyle 4Evaluate
c

Calculate the value of the lower fence.

Worked Solution
Create a strategy

We can use the formula: \text{Lower fence} = \text{lower quartile}-1.5\times \text{interquartile range}.

Apply the idea
\displaystyle \text{Lower fence}\displaystyle =\displaystyle 2-1.5 \times 4Substitute the lower quartile and the IQR
\displaystyle =\displaystyle -4Evaluate
d

Calculate the value of the upper fence.

Worked Solution
Create a strategy

We can use the formula: \text{Upper fence} = \text{upper quartile}+1.5\times \text{interquartile range}.

Apply the idea
\displaystyle \text{Upper fence}\displaystyle =\displaystyle 6+1.5 \times 4Substitute the upper quartile and the IQR
\displaystyle =\displaystyle 12Evaluate
e

Would the value -3 be considered an outlier?

Worked Solution
Create a strategy

Remember that an outlier should be less than the lower fence or greater than the upper fence.

Apply the idea

We found in part (c) that the lower fence is -4, and the value -3 is not lower than it. In part (d) the upper fence is 12, and the value -3 is not higher than it. So this means that -3 is not an outlier.

Idea summary

Calculating outliers:

A data point is classified as an outlier if it lies above the upper fence or below the lower fence. These are calculated as follows: \text{Lower fence} = \text{Lower quartile} -1.5 \times \text{Interquartile Range}\\ \text{Upper fence} = \text{Upper quartile} +1.5 \times \text{Interquartile Range}

Effects of outliers

Outliers can skew or change the shape of our data. If we compare two data sets that are identical apart from one outlier, we will see the following:

  • The set with an outlier will have a much larger range.

  • The median of each set will be close together, and might be the same.

  • The mean of each set will be noticeably different.

  • The mode of each set will be the same, since an outlier added to a data set will never be the most frequent score.

In particular, adding an outlier to a set of data will change the shape as follows:

Adding a low outlierAdding a high outlier
The range increases significantly.The range increases significantly.
The median could decrease a little.The median could increase a little.
The mean decreases significantly.The mean increases significantly.
The mode will not change.The mode will not change.

Examples

Example 7

Consider the following set of data: 53,\,46,\,25,\,50,\,30,\,30,\,40,\,30,\,47,\,109

a

Complete the following table of summary statistics.

Mean\qquad \qquad
Median\qquad \qquad
Mode\qquad \qquad
Range\qquad \qquad
Worked Solution
Create a strategy

To find the mean, use the formula: \text{Mean}=\dfrac{\text{Sum of score}}{\text{Number of scores}}

To find the median find the middle score, to find the mode find the most frequent score.

To find the range, use the formula: \text{Range}=\text{Highest score}-\text{Lowest score}

Apply the idea
\displaystyle \text{Mean}\displaystyle =\displaystyle \dfrac{53+46+25+50+30+30+40+30+47+109}{10}Use the formula
\displaystyle =\displaystyle \dfrac{460}{10}Evaluate the addition
\displaystyle =\displaystyle 46Evaluate the division

To find the median, order the scores: 25,\,30,\,30,\,30,\,40,\,46,\,47,\,50,\,53,\,109

The middle scores are: 40,\,46

\displaystyle \text{Median}\displaystyle =\displaystyle \dfrac{40+46}{2}Find the average of the middle values
\displaystyle =\displaystyle 43Evaluate

To find the mode, choose the score which occurs most often.

\text{Mode}=30

To find the range:

\displaystyle \text{Range}\displaystyle =\displaystyle 109-25Substitute the values
\displaystyle =\displaystyle 84Evaluate the subtraction

The completed table is shown:

Mean\qquad \qquad 46
Median\qquad \qquad 43
Mode\qquad \qquad 30
Range\qquad \qquad 84
b

Which data value is an outlier?

Worked Solution
Create a strategy

Choose the value that is much greater or much smaller than the rest of the data set.

Apply the idea

\text{Outlier}=109

c

Complete the following table of summary statistics after removing the outlier 109:

Mean\qquad \qquad
Median\qquad \qquad
Mode\qquad \qquad
Range\qquad \qquad
Worked Solution
Apply the idea
\displaystyle \text{Mean}\displaystyle =\displaystyle \dfrac{25+30+30+30+40+46+47+50+53}{9}Substitute all the scores
\displaystyle =\displaystyle \dfrac{351}{9}Evaluate the addition
\displaystyle =\displaystyle 39Evaluate the division

To find the median, order the scores: 25,\,30,\,30,\,30,\,40,\,46,\,47,\,50,\,53

The middle score is 40.

\text{Median}=40

To find the mode, choose the score which occurs most often.

\text{Mode}=30

To find the range:

\displaystyle \text{Range}\displaystyle =\displaystyle 53-25Substitute the values
\displaystyle =\displaystyle 28Evaluate the subtraction

The completed table is shown:

Mean:\qquad \qquad 39
Median:\qquad \qquad 40
Mode:\qquad \qquad 30
Range:\qquad \qquad 28
d

Let A be the original data set and B be the data set without the outlier.

Complete the table using the symbols >,< and = to compare the statistics before and after removing the outlier.

\text{With outlier}\text{Without\ outlier}
Mean:AB
Median:AB
Mode:AB
Range:AB
Worked Solution
Create a strategy

Compare the statistics in part (a) and in part (c).

Apply the idea

Statistics from parts (a) and (c):

With outlierWithout outlier
Mean4639
Median4340
Mode3030
Range8428

Comparison table:

\text{With outlier}\text{Without\ outlier}
Mean:A>B
Median:A>B
Mode:A=B
Range:A>B
Idea summary

In particular, adding an outlier to a set of data will change the shape as follows:

Adding a low outlierAdding a high outlier
The range increases significantly.The range increases significantly.
The median could decrease a little.The median could increase a little.
The mean decreases significantly.The mean increases significantly.
The mode will not change.The mode will not change.

Outcomes

U3.AoS1.4

the five-number summary and boxplots (including the designation and display of possible outliers)

What is Mathspace

About Mathspace