topic badge

4.04 Compare data sets

Comparing data using boxplots

We have seen that the range and interquartile range can be used to measure the spread of data and both can be seen on a boxplot. We can also see the median, upper and lower quartiles, and sometimes extreme values.

Parallel boxplots are used to compare two sets of data visually. The data sets must use the same numerical variable, but for two different groups or categories.

It is important to clearly label each boxplot. Here are two parallel box plots comparing the time it took two different groups of people to complete an online task.

Two boxplots comparing the time with a number line below labeled as seconds. Ask your teacher for more information.

The boxplots must be drawn on the same scale to properly compare them.

Examples

Example 1

These two boxplots show the data collected by the manufacturers on the lifespan of light bulbs, measured in thousands of hours.

Two boxplots showing the data between Manufacturers A and B. Ask your teacher for more information.
a

Complete this table using the two boxplots. Write each answer in terms of hours, remembering to multiply the values on the data display by 1000.

Manufacturer AManufacturer B
Median
Lower quartile
Upper quartile
Range
Interquartile range
Worked Solution
Create a strategy
  • To find the lower quartile, median, and upper quartile, find the corresponding values of the vertical lines of the box, respectively.

  • For the range, use the formula: \text{Range}=\text{Highest score}-\text{Lowest score}

  • For the interquartile range, use the formula: \text{IQR}=Q_{3}-Q_{1}

Apply the idea

Since the scale is in thousands of hours, we can multiply the numbers on the scale by a thousand to find the number of hours.

Manufacturer AManufacturer B
Median4\cdot 1000 = 40005\cdot 1000 =5000
Lower quartile2.5\cdot 1000 =25003.5\cdot 1000 =3500
Upper quartile4.5\cdot 1000 =45006\cdot 1000 =6000
Range(5-1)\cdot 1000 =4000(8-1.5)\cdot 1000 =6500
Interquartile range(4.5-2.5)\cdot 1000 =2000(6-3.5)\cdot 1000 =2500
b

Which manufacturer produces light bulbs with the best lifespan?

Worked Solution
Create a strategy

Choose the manufacturer with the greater median.

Apply the idea

The data set for Manufacturer B has a median of 5000 hours, while the median of the data set for Manufacturer A is 4000 hours, so Manufacurer B produces light bulb with the best lifespan.

Reflect and check

In fact, the best lightbulb produced by Manufacturer A has a lifespan of 5000 hours, which is the same as the median of Manufacturer B. This means that about half of the lightbulbs produced by Manufacturer B have a greater lifespan than all of the lightbulbs produced by Manufacturer A.

Example 2

Sophie and Holly have been playing soccer for 20 years. These boxplots represent the total number of goals Sophie and Holly scored in each of their 20 seasons.

Two boxplots showing the data of goals between Sophie and Holly. Ask your teacher for more information.
a

Who had the highest scoring season?

Worked Solution
Create a strategy

Compare the maximum value of both boxplots.

Apply the idea

By looking at the endpoints of the right whiskers, Sophie scored 18 goals, while Holly scored 19 goals. So Holly had the highest scoring season.

b

How many more goals did Holly score in her best season compared to Sophie in her best season?

Worked Solution
Create a strategy

Subtract Sophie's maximum from Holly's maximum.

Apply the idea
\displaystyle \text{Number of goals}\displaystyle =\displaystyle 19-18Subtract 18 from 19
\displaystyle =\displaystyle 1Evaluate

Holly scored 1 more goal in her best season.

c

What is the difference between the median number of goals scored in a season by each player?

Worked Solution
Create a strategy

Find the difference of the medians.

Apply the idea

Sophie's median is 11 and Holly's median is 10.

\displaystyle \text{Difference}\displaystyle =\displaystyle {11-10}Subtract 10 from 11
\displaystyle =\displaystyle 1Evaluate
d

Which player was more consistent?

Worked Solution
Create a strategy

We can look at the range and interquartile ranges and see whose is lower. The lower the spread, the more consistent.

We can use that: \text{IQR}=\text{Upper quartile}-\text{Lower quartile}

Apply the idea
\displaystyle \text{Sophie's IQR}\displaystyle =\displaystyle {14-7}Substitute the quartiles
\displaystyle =\displaystyle 7Evaluate
\displaystyle \text{Holly's IQR}\displaystyle =\displaystyle 15-6Substitute the quartiles
\displaystyle =\displaystyle 9Evaluate
\displaystyle \text{Sophie's Range}\displaystyle =\displaystyle 18-4Substitute the upper and lower extremes
\displaystyle =\displaystyle 14Evaluate
\displaystyle \text{Holly's Range}\displaystyle =\displaystyle 19-4Substitute the upper and lower extremes
\displaystyle =\displaystyle 15Evaluate

Since Sophie has the lower IQR and range, her season goal totals are more consistent.

Example 3

The advertised fuel efficiency for 12 cars and 12 trucks was recorded in this table..

Cars151718222222232526313550
Trucks121313141515151616171927

The car data was represented with a boxplot.

Car fuel efficiency (Miles per gallon)
10
15
20
25
30
35
40
45
50
55
60

The truck data was represented with a dot plot.

A dot plot titled Truck fuel efficiency in miles per gallon, ranging from 12 to 27. The number of dots is as follows: at 12, 1; at 13, 2; at 14, 1; at 15, 3; at 16, 2; at 17, 1; at 19, 1; at 27, 1.
a

Convert the dot plot to a boxplot and draw a parallel boxplot comparing cars to trucks.

Worked Solution
Create a strategy

There are 12 data values on the dot plot, so we can start by identifying the upper and lower quartiles, extremes, and the median.

Apply the idea

It may be helpful to list the data from the dot plot to find the key values for the boxplot.

A sequence of numbers arranged in a single row: 12, 13, 13, 14, 15, 15, 15, 16, 16, 17, 19, 27. The numbers are grouped into 4 groups : first group: 12, 13; with another 13 between the groups, second group: 14, 15; with another 15 between the groups third group: 15, 16; with another 16 between the groups fourth group: 17, 19; 27 is circled and labeled as the outlier. The 13 between the first and second group is labeled lower quartile, the 15 between the second and third group is labeled median, the q6 between the third and fourth group is labeled upper quartile.
Lower extreme (min)12
Lower quartile13
Median15
Upper quartile16
Upper extreme (max)19

The value of 27 \text{ mpg} is very far from the rest of the data, so we can consider it to be an outlier (extreme value).

This means that the upper whisker would end at 19 and 27 would be shown as a dot.

Truck fuel efficiency (Miles per gallon)
10
12
14
16
18
20
22
24
26
28
30

When we create a parallel boxplot, we must use the same scale. This give us:

Two boxplots showing the data of fuel efficiency (miles per gallon) between car and truck. Ask your teacher for more information.
Reflect and check

We can also use technology by

  1. Inputting the data in two separate columns.

    A screenshot of the GeoGebra statistics tool showing how to enter a given data set. Speak to your teacher for more details.
  2. Highlighting the data and selecting "Multiple Variable Analysis" from the data drop down menu
    A screenshot of the GeoGebra statistics tool showing how to select the Multiple Variable Regression Analysis option. Speak to your teacher for more details.
  3. Adjusting the window size to get a good view.
    A screenshot of the GeoGebra statistics tool showing how to generate the parallel box plot of a given data set. Speak to your teacher for more details.
b

Select a measure of center from both data sets to compare the fuel efficiency of cars versus trucks.

Worked Solution
Create a strategy

Since we know that outliers influence the mean of a data set, it's best to compare the median of each data set.

Apply the idea

The median miles per gallon of a car is 22.5 and the median miles per gallon of a truck is 15. That means that a car typically gets 7.5 miles more per gallon in fuel efficiency.

Reflect and check

Since both data sets have a higher-valued outlier, we can expect that the mean of each set is higher than the median and does not represent the typical value, as well as the median does.

c

Compare the spread of the data sets.

Worked Solution
Create a strategy

Since the data sets have outliers, it is best to describe the spread of the data using the IQR.

Apply the idea
\displaystyle IQR \text{ for cars}\displaystyle =\displaystyle Q_3-Q_1Formula for IQR
\displaystyle =\displaystyle 28.25-20Substitute Q_3=28.25 and Q_1=10
\displaystyle =\displaystyle 8.25Evaluate the subtraction

The IQR for the car fuel efficiency is approximately 8.25 miles per gallon, meaning the middle 50\% of cars vary by 8.25 miles per gallon.

\displaystyle IQR \text{ for trucks}\displaystyle =\displaystyle Q_3-Q_1Formula for IQR
\displaystyle =\displaystyle 16-13Substitute Q_3=16 and Q_1=13
\displaystyle =\displaystyle 3Evaluate the subtraction

The IQR for truck fuel efficiency is approximately 3 miles per gallon, meaning that the middle 50\% of trucks vary by 3 miles per gallon.

While cars get better gas mileage overall, the spread is greater compared to trucks, so car mileage is less consistent.

Reflect and check

By comparing the range for cars and trucks, we could see that the spread for fuel efficiency for cars is signficantly larger, 35 for cars and 15 for trucks. But, once we examine the IQR, we can see that the difference in spread is not as drastic as it initially appears.

Idea summary
Two boxplots showing the data of thousands of hours between Manufacturer A and Manufacturer B. Ask your teacher for more information.

When comparing two or more categories for the same variable, it can be helpful to draw a parallel boxplot.

We can compare key values of each data set (the quartiles and minimum and maximum values) and the range and interquartile range.

Choosing the best data display

Different displays help us to identify different key features like the center and spread of a data set.

Let's start with the displays we have seen for categorical data.

A pictograph is titled Pizza choice. Ask your teacher for more information.
Pictograph shows the count or frequency for each category using a key
A bar graph shows on the vertical axis with a scale from 0 to 15. Ask your teacher for more information.
Bar graph shows the count or frequency for each category using a bar
A dot plot is shown for each topping: Cheese, 9; Pepperoni, 15; Vegetarian, 4; Shawarma, 5; Hawaiian, 4.
Dot plots (line plots) show the frequency for categories or numerical data with a small range
A circle graph is divided into 5 unequal sections. Ask your teacher for more information.
Circle graphs show the proportion or parts of a whole for a small number of categories

For most numerical data displays, we can identify or estimate frequencies, shape, measures of center, measures of spread, and if there are outliers

Measures of center summarize the data set with a single value. We often use this to generalize which group performed better.

  • Mean
  • Median
  • Mode

Measures of spread summarize the spread of a data set. We often use this to generalize which group was more consistent. The lower the spread the more consistent the data.

  • Range
  • Interquartile range

For numerical data, the shape shows us how the data is spread out and where there are clusters, peaks, or gaps.

A histogram titled Age of first car purchase with Frequency on the y-axis, with numbers 0 to 6, and Age on the x-axis. Ask your teacher for more information.
A histogram groups numerical data into intervals and shows the shape and clusters
A stem and leaf plot titled Age of first car purchase. The left column is titled Stem, and right column titled Leaf. Ask your teacher for more information.
Stem-and-leaf plots sorts the original data, and shows the shape in some cases
A dot plot with an axis at the bottom that has the numbers 3 through 15. Ask your teacher for more information.
Dot plots can also be used for numerical data, but is best for small amounts of data
A circle graph titled Temperatures on birthdays, divided into 7 unequal portions. Ask your teacher for more information.
Circle graphs allow us to the see proportion of data points in each interval

From boxplots, we can only identify:

  • Shape
  • Range and interquartile range
  • Median
  • Outliers

We should expect then that the shape of data would be the same whether it is represented in a boxplot or histogram.

Boxplots divide data into four equal quartiles using the lower extreme (minimum), lower quartile, median, upper quartile, and upper extreme (maximum).

A boxplot on a number line ranging from 0 to 40 with an interval of 2. A line extends from 4 to 7, a box extends from 7 to 28 with a median plotted at 10 represented by a vertical segment in the box. A line extends from 28 to 34.

Exploration

A high school has scolarship programs for gymnastics and basketball. The histogram, dot plot, and boxplot summarize the heights of 29 students in a class, in inches:

A histogram titled Classmate Height with Number of Students on the y-axis, with numbers 0 through 15, and Height in inches on the x-axis, with bars labeled at their endpoint 60 to 80 in steps of 5. The 60 through 65 bar goes to 11 on the y-axis, 65 through 70 goes to 13, 70 through 75 goes to 1, and 75 through 80 goes to 6.
A line plot titled Classmate Heights in inches, ranging from 60 to 80 in steps of 1. The number of dots is as follows: at 60, 2; at 61, 3; at 62, 2; at 63, 1; at 64, 3; at 65, 2; at 66, 4; at 67, 2; at 68, 5; at 74, 1; at 75, 1; at 76, 2; at 78, 2; and at 79, 1.
A boxplot on a number line ranging from 0 to 10 and titled Classmate Heights. A line extends from 60 to 63, a box extends from 63 to 68 with a vertical line plotted at 66, and a line extends from 68 to 79.
  1. What do you notice about the different displays for the same data?

  2. What can you see from the histogram and dot plot that you can't see from the boxplot?

  3. What can you see from the boxplot and dot plot that you can't see from the histogram?

Histograms, boxplots, and dot plots may display the same data, but the different displays have their own strengths and weaknesses.

Neither histograms nor boxplots show every individual data value, but histograms will show intervals where there may be gaps or a lower frequency of data.

Boxplots provide a quick, efficient overall view of the shape, center and spread of the data if we're not interested in where there may be gaps in the data.

Some people choose data displays that can be misleading. It's important to choose a data display that shows a true picture of the data.

Examples

Example 4

Determine the best type of data display(s) for each statistical question:

a

How much variation is there in the number of zucchinis produced by a single plant?

Worked Solution
Create a strategy

We should consider the size and possible range of values of the data set, as well as what key features the question is focusing on.

Apply the idea

A boxplot

Since we are focusing on variation, we want to quickly identify the spread of the data, this means a box plot is appropriate.

Reflect and check

If we used a histogram, then we would be able to estimate the range, but the interquartile range would not be easy to see and sometimes the range is very high due to extreme values..

b

What types of vegetables are most popular to grow among urban gardeners?

Worked Solution
Apply the idea

This is categorical data, so the options are pictograph, bar graph, line plot, or circle graph.

Since "most" is related to the mode or category with the largest frequency, we want a display where that is easy to see.

Depending on what the data set looks like, a bar graph or circle graph could be appropriate.

Reflect and check

If there were a very large number of categories, then the bar chart would be easier to read.

Example 5

Shown are the quiz score percentages from Mr. Sanchez's first period math class: \left\{20,\,25,\,26,\,30,\,30,\,40,\,43,\,63,\,65,\,67,\,70,\,70,\,75,\,90,\,93 \right\}

a

Construct a boxplot of the quiz scores.

Worked Solution
Create a strategy

Recall how to find the five-number summary using the data provided or use technology, as shown in the example:

  1. Enter the data in a single column.

    A screenshot of the GeoGebra statistics tool showing the data 20, 25, 26, 30, 30, 40, 43, 63, 65, 67, 70, 70, 75, 90, and 93 entered in column A, rows 1 to 15. Speak to your teacher for more details.
  2. Select all of the cells containing data and choose "One Variable Analysis".

    A screenshot of the GeoGebra statistics tool showing the cells containing 20, 25, 26, 30, 30, 40, 43, 63, 65, 67, 70, 70, 75, 90, and 93 selected. The menu from the second leftmost icon is shown. Speak to your teacher for more details.
  3. Select "Show Statistics" to reveal a list of statistical values, including the five-number summary.

    A screenshot of the GeoGebra statistics tool. From left to right, the following are shown: the cells containing 20, 25, 26, 30, 30, 40, 43, 63, 65, 67, 70, 70, 75, 90, and 93 selected, a list of statistical values, and a histogram. Speak to your teacher for more details.

Use the minimum, Q1 (first quartile), median, Q3 (third quartile), and maximum from the statistics listed to create the boxplot.

Apply the idea
Entitled Mrs. Sanchez's Period 1 Math Quiz Results, A boxplot on a number line ranging from 0 to 100 with interval of 10 is shown. A line extends from 20 to 30 and a box extends from 30 to 70 with a median represented by a vertical segment between 60 and 70. A line extends from 70 to a number between 90 and 100.
b

What are the advantages and disadvantages of a boxplot?

Worked Solution
Apply the idea

Boxplots visually summarize large sets of data, although they can be used for small sets too, like this one. In a boxplot, it is easy to see and estimate the shape, center (median), and spread of data. However, if we were not given the individual data points, we would not know how many students are in the set of data and their individual quiz scores.

Even without that information, the boxplot can still provide important information about the quiz. We can see at a glance that most of the students did not do very well on the quiz. We can see that 75\% of the students scored 70 or below, 50\% of the students scored below about 63 and 25\% got a score of less than 30.

Reflect and check

25 \% of the quiz scores lie between the minimum and first quartile, the first quartile and the median, the median and the third quartile, and the third quartile and the maximum value. This is true even when the quarters of the boxplot are uneven in length.

c

Explain whether a dot plot or a histogram could be a better display for the data.

Worked Solution
Create a strategy

Use the size of the data set and its range to determine which would be better.

Apply the idea

While the data set is made up of only 15 students' scores, the scores are spread out from 25 \% to 93 \%, which would mean the dot plot is very long for the data points, so this display would not be suitable.

A histogram could be graphed by organizing the data into 10\% intervals. One advantage of the histogram is that we would be able to see the gaps in the 50s and 80s which are lost in the boxplot.

Reflect and check

A stem-and-leaf plot would be another good choice for this data set.

Example 6

Match the boxplot shown to the correct histogram.

0
1
2
3
4
5
6
7
8
9
10
A
The image shows a histogram with high columns on the left. Ask your teacher for more information.
B
The image shows a histogram with symmetric data. Ask your teacher for more information.
C
The image shows a histogram with high columns on the right. Ask your teacher for more information.
D
The image shows a histogram with high columns on the right. Ask your teacher for more information.
Worked Solution
Create a strategy

Look for characteristics of the data set like the value it is centered around, the range of the data, and the shape - skewed or symmetrical.

Apply the idea

The given boxplot shows symmetrical data. Only option B shows symmetrical data. The correct answer is option B.

Reflect and check

If there were other histograms that were symmetrical, then we could match that the range of values should be 1 to 9, and the data set is centered around 5.

Idea summary

We should expect that the shape of data would be the same whether it is represented in a boxplot or histogram.

The best display for a data set is one that reveals the information we want to share. Some displays hide key information like the individual data points, the total number of data points, or features like the shape, clusters, gaps, and spread.

As a starting place, consider:

  • If it is categorical, look at bar graphs, dot plots, or circle graphs. However, for a younger or diverse audience, a pictograph might be appropriate.
  • For numerical data, if there is a small quantity and range of data, try a dot plot.

  • If the data has a large range or quantity of data, try a histogram or boxplot.

  • Choose a boxplot if you only need to see an overview of center, spread and shape.

  • Choose a histogram if, in addition to center, spread, and shape, you want to know the size of the data set and view any gaps or clusters among various intervals.

  • Choose a stem-and-leaf plot if you need to be able to see the actual data values.
  • A circle graph could be used for grouped numerical data if highlighting the proportions in each interval is important.

Outcomes

8.PS.2

The student will apply the data cycle (formulate questions; collect or acquire data; organize and represent data; and analyze data and communicate results) with a focus on boxplots.

8.PS.2a

Formulate questions that require the collection or acquisition of data with a focus on boxplots.

8.PS.2b

Determine the data needed to answer a formulated question and collect the data (or acquire existing data) using various methods (e.g., observations, measurement, surveys, experiments).

8.PS.2c

Determine how statistical bias might affect whether the data collected from the sample is representative of the larger population.

8.PS.2d

Organize and represent a numeric data set of no more than 20 items, using boxplots, with and without the use of technology.

8.PS.2e

Identify and describe the lower extreme (minimum), upper extreme (maximum), median, upper quartile, lower quartile, range, and interquartile range given a data set, represented by a boxplot.

8.PS.2g

Analyze data represented in a boxplot by making observations and drawing conclusions

8.PS.2h

Compare and analyze two data sets represented in boxplots.

8.PS.2i

Given a contextual situation, justify which graphical representation (e.g., pictographs, bar graphs, line graphs, line plots/dot plots, stem-and-leaf plots, circle graphs, histograms, and boxplots) best represents the data.

What is Mathspace

About Mathspace