Boxplots

The range is a measure of spread based on the minimum and maximum in a data set, but does not tell us about the spread of data that falls between these two values. To find the range of a data set, simply subtract the largest and smallest pieces of data.

The median is a measure of center, and tells us where the middle of the data set is.

Exploration

Investigate how to divide the data into various percentages by checking the boxes. To investigate with different sets of data, click the "New data" button.

Loading interactive...
  1. What percentage of the data lies below Q_1? Hence, what does Q_1 represent?

  2. What percentage of the data lies below the median? Hence, what does the median represent?

  3. What percentage of the data lies below Q_3? Hence, what does Q_3 represent?

  4. What percentage of the data lies between the minimum value and the maximum value?

  5. Which section of the data is the most spread out?

    • From the minimum to Q_1

    • From Q_1 to the median

    • From the median to Q_3

    • From Q_3 to the maximum

To get a better picture of the internal spread in a data set, we can find the quartiles of a data set. Quartiles are scores at particular locations in the data set. Instead of dividing a data set into halves, like the median, they divide a data set into 4 quarters, where each quarter contains the same number of data values.

Let's look at how we would divide up a data set into quarters. When runners train for a marathon, they gradually increase the number of miles they run in the months before the marathon. This data set represents the number of miles Alessia ran each week of training.

A data set with scores 1, 3, 4, 7, 11, 12, 14, 19.

We want to make sure the data set is ordered from smallest to largest before finding the quartiles or the median.

A data set with scores 1, 3, 4, 7, 11, 12, 14, 19. The median is located between 7 and 11.

First we will locate the median, which is the middle data value. In this case where there is no middle value it is halfway between the two middle values:\dfrac{7+11}{2}=9

Now there are four values in each half of the data set, so we will split each of the four values in half to find the quartiles.

Scores 1, 3, 4, 7, 11, 12, 14, 19. Quartile 1 is between 3 and 4, quartile 3 is between 12 and 14.

We can see the first quartile, Q_{1}, is between the 2nd and 3rd values:\dfrac{3+4}{2}=3.5 Similarly, the third quartile, Q_{3}, is between the 6th and 7th values:\dfrac{12+14}{2}=13

We can now summarize the data by looking at these five critical points:

  • Lower extreme (minimum): lowest value
  • Lower quartile (Q_1): about 25\% of the data is below this value
  • Median (Q_2): about 50\% of the data lies on either side of this value
  • Upper quartile (Q_3): about 25\% of the data is above this value
  • Upper extreme (maximum): highest value
Lower extreme1
Lower quartile3.5
Median9
Upper quartile13
Upper extreme19

These values, known as the five-number summary, can be easily displayed in a boxplot or box-and-whisker plot.

A boxplot with its different parts- Lower extreme (minimum), Lower quartile (Q_1), Median, Upper quartile (Q_2), Upper extreme (maximum). Ask your teacher for more information.

The number line at the bottom helps us read the values in the boxplot. Above that, you will see that there are two lines or "whiskers" that extend from the box outwards. The box and the whiskers help us easily identify the four different quartiles. Each quartile represents approximately 25\% of the data set.

A boxplot with its different parts- Lower extreme (minimum), Lower quartile (Q_1), Median, Upper quartile (Q_2), Upper extreme (maximum). With ~25% on the first whisker, in the left side of the box, in the right side of the box, and on the right whisker.

The interquartile range (IQR) is the difference between the third quartile and the first quartile. 50\% of scores lie within the IQR.

Since it focuses on the middle 50\% of the data set, the interquartile range often gives a better indication of the internal spread than the range does, and it is less affected by outliers. IQR = \text{Upper quartile} - \text{Lower quartile}

In the previous example, the IQR of Alessia's training data is 13-3.5 = 9.5This tells us that the middle 50\% of the data differs by 9.5 miles.

To create a boxplot:

  1. Put the data in ascending order (from smallest to largest).

  2. Find the median (middle value) of the data.

  3. To divide the data into quarters, find the middle value between the minimum value and the median, as well as between the median and the maximum value.

When working through the data cycle, boxplots are a useful tool for answering statistical questions related to the spread of the data.

Examples

Example 1

For the following boxplot:

0
2
4
6
8
10
12
14
16
18
20
a

Find the lower extreme.

Worked Solution
Create a strategy

The lower extreme is the smallest value in the data set, also called the minimum. It is represented by the end of the left whisker.

Apply the idea

The lower extreme is 3.

b

Find the upper extreme.

Worked Solution
Create a strategy

The upper extreme is the largest value in the data set, also called the maximum. It is represented by the end of the right whisker.

Apply the idea

The upper extreme is 18.

c

Find the range.

Worked Solution
Create a strategy

The range is the difference between the largest data value and the smallest data value.

Apply the idea
\displaystyle \text{Range}\displaystyle =\displaystyle 18-3Find the difference of the extreme values
\displaystyle =\displaystyle 15Evaluate the subtraction
d

Find the median.

Worked Solution
Create a strategy

The median is marked by the line in the middle of the box.

Apply the idea

The median is 10.

Reflect and check

The median represents the middle value of the data.

e

Find the interquartile range (IQR).

Worked Solution
Create a strategy

The interquartile range (IQR) is the difference between the upper quartile and the lower quartile.

Apply the idea

The upper quartile (Q_3) is marked by the right side of the box. In this boxplot, Q_{3}=15.

The lower quartile (Q_1) is marked by the left side of the box. In this boxplot, Q_{1}=8.

\displaystyle IQR\displaystyle =\displaystyle \text{Upper quartile}-\text{Lower quartile}Formula for IQR
\displaystyle =\displaystyle 15-8Substitute known values
\displaystyle =\displaystyle 7Evaluate the subtraction

Example 2

You have been asked to represent this data in a boxplot: 20,\,36,\,52,\,56,\,24,\,16,\,40,\,4,\,28

a

Complete the table for the given data.

Minimum
Lower quartile
Median
Upper quartile
Maximum
Interquartile range
Worked Solution
Create a strategy

To find the minimum, median, and maximum values, order the numbers from smallest to largest and find the first, middle, and last value. Then, find the quartiles.

Apply the idea

To find the minimum, median, and maximum values, first put them in order: 4,\,16,\,20,\,24,\,28,\,36,\,40,\,52,\,56

\text{Minimum}=4

\text{Maximum}=56

The middle value is: 28

\text{Median}=28

To find the lower quartile, find the middle value of the lower half of the values: 4,\,16,\,20,\,24

The middle values are: 16,\,20

\displaystyle \text{Lower quartile}\displaystyle =\displaystyle \dfrac{16+20}{2}Find the average of the middle values
\displaystyle =\displaystyle \dfrac{36}{2}Evaluate the addition
\displaystyle =\displaystyle 18Evaluate the division

To find the upper quartile, find the middle value of the upper half of the values: 36,\,40,\,52,\,56

The middle values are: 40,\,52

\displaystyle \text{Upper quartile}\displaystyle =\displaystyle \dfrac{40+52}{2}Find the average of the middle values
\displaystyle =\displaystyle \dfrac{92}{2}Evaluate the addition
\displaystyle =\displaystyle 46Evaluate the division

The interquartile range is the difference between the upper and lower quartiles:IQR=46-18=28

Minimum4
Lower quartile18
Median28
Upper quartile46
Maximum56
Interquartile range28
b

Construct a boxplot for the data.

Worked Solution
Create a strategy

Use the the answer from part (a) to construct a boxplot.

Apply the idea
0
10
20
30
40
50
60
Reflect and check

We can use technology to check that we calculated the five critical points and constructed the boxplot correctly.

  1. Enter the data in a single column.

    A screenshot of the GeoGebra Statistics tool showing how to enter a given data set. Speak to your teacher for more details.
  2. Select all of the cells containing data and choose "One Variable Analysis".

    A screenshot of the GeoGebra Statistics tool showing the menu that contains the One Variable Analysis option. Speak to your teacher for more details.
  3. In the dropdown menu, change the histogram to a boxplot.

    A screenshot of the GeoGebra Statistics tool showing how to construct a boxplot of a given set of data. Speak to your teacher for more details.
  4. Select "Show Statistics", the button with \sum \text{x}, to reveal a list of statistical values, including the five-number summary.

    A screenshot of the GeoGebra Statistics tool showing how to calculate the statistics of the data set provided. Speak to your teacher for more details.

This confirms that the five critical points in our boxplot are correct.

Example 3

The box-and-whisker plot represents the thickness of the glass on various dining tables.

Glass width (mm)
10.7
10.8
10.9
11.0
11.1
11.2
11.3
11.4
a

Which formulated question could be answered by analyzing the given boxplot?

A
How many dining room tables have a thickness of 11.1\text{ mm}?
B
How does the size of a dining room table affect the thickness of the glass?
C
What proportion of glass tables have a thickness of 11\text{ mm}?
D
What range of thickness is most common for the glass of dining room tables?
Worked Solution
Create a strategy

Boxplots represent univariate data, which means they are only used to represent one variable (in this case, glass thickness). They do not show individual data points, but they do show how the data values in the set vary (how spread out they are).

Apply the idea

Option A - this question requires us to know the number of individual glass tables have a thickness of 11.1\text{ mm}. Since boxplots do not show individual data points, this question cannot be answered by the boxplot.

Option B - The second question considers two different variables: the size of the table and the thickness of the glass. This boxplot only shows data on the thickness of the glass, so it cannot be used to answer this question.

Option C - Similar to the first question, the third question requires us to look at individual glass tables with a thickness of 11\text{ mm}, and compare that to the total number of data values in the set. This question cannot be answered by the boxplot.

Option D - The last question is related to a measure of spread, so the boxplot can be used to answer this question. In particular, we could use the interquartile range shown in the boxplot to answer the question.

Question D can be answered by analyzing the boxplot.

Reflect and check

According to the boxplot, half of the data lies between 10.9\text{ mm} and 11.2\text{ mm}. We can say that it is most common for the glass of a dining room table to have a thickness between 10.9\text{ mm} and 11.2\text{ mm}.

b

What percentage of values lie between:

  • 10.9 and 11.2

  • 10.8 and 10.9

  • 11.1 and 11.3

  • 10.9 and 11.3

  • 10.8 and 11.2

Worked Solution
Create a strategy

First, determine whether each of the values represents a critical point on the boxplot. Then, use the fact that one quartile represents 25\% of the data set to find the percentages.

The image shows a box-and-whisker plot titled Glass width (mm) with interval 10.7, 10.8, 10.9, 11.0, 11.1, 11.2, 11.3, and 11.4. Ask your teacher for more information.
Apply the idea

10.9 and 11.2 represent the lower and upper quartiles.

The image shows a box-and-whisker plot titled Glass width (mm) with interval 10.7, 10.8, 10.9, 11.0, 11.1, 11.2, 11.3, and 11.4. Ask your teacher for more information.

50\% of values lie between 10.9 and 11.2.

10.8 and 10.9 are the lower extreme and the lower quartile, respectively.

The image shows a box-and-whisker plot titled Glass width (mm) with interval 10.7, 10.8, 10.9, 11.0, 11.1, 11.2, 11.3, and 11.4. Ask your teacher for more information.

25\% of the values lie between 10.8 and 10.9.

11.1 and 11.3 are the median and the upper extreme, respectively.

The image shows a box-and-whisker plot titled Glass width (mm) with interval 10.7, 10.8, 10.9, 11.0, 11.1, 11.2, 11.3, and 11.4. Ask your teacher for more information.

50\% of values lie between 11.1 and 11.3.

10.9 is the lower quartile and 11.3 is the upper extreme.

The image shows a box-and-whisker plot titled Glass width (mm) with interval 10.7, 10.8, 10.9, 11.0, 11.1, 11.2, 11.3, and 11.4. Ask your teacher for more information.

75\% of values lie between 10.9 and 11.3.

10.8 and 11.2 are the lower extreme and the upper quartile, respectively.

The image shows a box-and-whisker plot titled Glass width (mm) with interval 10.7, 10.8, 10.9, 11.0, 11.1, 11.2, 11.3, and 11.4. Ask your teacher for more information.

75\% of values lie between 10.8 and 11.2.

c

In which quartile (or quartiles) is the data the most spread out?

Worked Solution
Create a strategy

Which quartile takes up the longest space on the graph?

Apply the idea

The second quartile is the most spread out.

Reflect and check

Although the second quartile is the most spread out, it does not have more data values than the other quartiles. Each of the four quartiles have approximately the same number of data values, but the data values differ more in the second quartile.

Example 4

Lucille wants to track the amount of time she spends on her phone or tablet outside of school. Her goal is to only spend one to two hours on her devices each day. The statistical question she writes for her study is "How does the amount of time I spend on my phone or tablet vary each day?"

a

Determine the data Lucille must collect to answer her statistical question.

Worked Solution
Create a strategy

Consider whether Lucille needs to collect univariate or bivariate data, and determine the variable(s) of interest.

Apply the idea

Lucille must collect univariate data, and the variable of interest is the amount of time she spends on her phone or tablet every day.

Reflect and check

The way the statistical question is worded will have a great effect on the type of data that needs to be collected. For example, if Lucille's statistical question was, "When do I spend the most amount of time on my devices throughout the week?", she would need to collect data on the amount of time she spends on her devices and the time of day that she is on her devices.

b

Which method of data collection would lead to the least amount of statistical bias for Lucille's study?

A
Collecting data on the amount of time she spent on her phone and tablet in the past three weeks
B
Collecting data on the amount of time she spends on her phone and tablet over the next three weeks
Worked Solution
Create a strategy

Determine how statistical bias might be introduced in the data collection process by considering whether the data will be representative of Lucille's typical amount of screen time. Also consider whether Lucille's overall goal of the study will impact the collection of the data.

Apply the idea

Because Lucille's goal is to reduce the amount of time she spends on her devices each day, she may begin trying to reduce her screen time from the moment she begins the study. This means the amount of time she spends on her devices in the next three weeks may not be representative of the amount of time she spends on her devices normally.

Collecting data on the amount of time she spent on her phone and tablet in the past three weeks would lead to less bias. The correct answer is option A.

c

According to Lucille's phone and tablet settings, the total amount of time she spent on her devices (in hours) each day over the past three weeks is shown: \{3.2,\, 7.5,\, 6.1,\, 8.0,\, 1.8,\, 2.5,\, 4.8,\, 5.0,\, 3.2,\, 2.0,\, 0.5, 1.2,\, 2.8,\,4.5,\,3.6,\,5.5,\,7.1,\,6.2,\,4.2,\,2.3\}Represent the data in a boxplot.

Worked Solution
Create a strategy

We can use technology to represent the given data in a boxplot. To do so, we will follow these steps:

  1. Enter the data in a single column.

  2. Select all of the cells containing data and choose "One Variable Analysis."

  3. In the dropdown menu, change the histogram to a boxplot.

  4. Select "Show Statistics" to reveal the five critical points of the boxplot.

Apply the idea
A screenshot of the GeoGebra Statistics tool showing how to construct a boxplot and calculate related statistics of the data set provided. Speak to your teacher for more details.

We can use the five-number summary to copy the boxplot.

The image shows a box plot titled Screen time (in hours). Ask your teacher for more information.
d

How does the amount of time Lucille spends on her phone or tablet vary each day?

Worked Solution
Create a strategy

Consider the spread of the data by looking at the range and interquartile range. The statistics tell us how much the data varies by.

Apply the idea

The lower extreme of the data is 0.5 hours, and the upper extreme is 8 hours. Overall, the data varies by 8-0.5=7.5\text{ hours} This shows that there is a big difference in the amount of time Lucille spends on her phone or tablet each day.

To get a better idea of how the data varies between the lower and upper extrema, we can look at the interquartile range. The lower quartile is 2.4 hours, and the upper quartile is 5.8 hours. The interquartile range is 5.8-2.4=3.4\text{ hours}This shows that the middle 50\% of the data varies by 3.4 hours. Although the difference here is smaller, it is still relatively large when considering Lucille's overall goal.

Because Lucille's goal is to spend only one to two hours on her devices each day, the data shows that her screen time varies a lot each day and it varies more than she would like.

Reflect and check

Going one step further, we can find what percentage of days Lucille meets her goal of spending one to two hours on her device daily. The lower quartile is 2.4 hours, and 25\% of the data lies below this value. This means that Lucille achieves her goal less than 25\% of the time.

Idea summary

A list of the minimum, lower quartile, median, upper quartile, and maximum values is often called the five-number summary.

  • The lower extreme (minimum) is the smallest value in the data set.
  • The lower quartile (Q_{1} or the first quartile) is the middle score in the bottom half of data.

  • The median (Q_{2} or the second quartile) is the middle value of a data set.

  • The upper quartile (Q_{3} or the third quartile) is the middle score in the top half of the data set.

  • The upper extreme (maximum) is the largest value in the data set.

One quartile represents 25\% of the data set.

These features are shown in a boxplot:

A boxplot with its different parts- Lower extreme (minimum), Lower quartile (Q_1), Median, Upper quartile (Q_2), Upper extreme (maximum). Ask your teacher for more information.

Creating a boxplot:

  1. Put the data in ascending order (from smallest to largest).

  2. Find the median (middle value) of the data.

  3. To divide the data into quarters, find the middle value between the minimum value and the median, as well as between the median and the maximum value.

To calculate the interquartile range:

\displaystyle IQR=Q_{3}-Q_{1}
\bm{IQR}
is the interquartile range
\bm{Q_{1}}
is the first quartile
\bm{Q_{3}}
is the third quartile

Outcomes

8.PS.2

The student will apply the data cycle (formulate questions; collect or acquire data; organize and represent data; and analyze data and communicate results) with a focus on boxplots.

8.PS.2a

Formulate questions that require the collection or acquisition of data with a focus on boxplots.

8.PS.2b

Determine the data needed to answer a formulated question and collect the data (or acquire existing data) using various methods (e.g., observations, measurement, surveys, experiments).

8.PS.2c

Determine how statistical bias might affect whether the data collected from the sample is representative of the larger population.

8.PS.2d

Organize and represent a numeric data set of no more than 20 items, using boxplots, with and without the use of technology.

8.PS.2e

Identify and describe the lower extreme (minimum), upper extreme (maximum), median, upper quartile, lower quartile, range, and interquartile range given a data set, represented by a boxplot.

8.PS.2g

Analyze data represented in a boxplot by making observations and drawing conclusions

What is Mathspace

About Mathspace