topic badge
AustraliaVIC
VCE 12 General 2023

2.02 Compare two sets of data

Lesson

Introduction

The table below shows the different displays that can be used depending on the type of response and explanatory variables.

Response variableExplanatory variableDisplay
Categorical CategoricalTwo-way frequency table, Segmented bar chart
Numerical Categorical (two categories only)Back to back stem plot, Parallel box plot, Parallel dot plot
Numerical CategoricalParallel box plot, Parallel dot plot

There are many things to keep in mind when comparing two sets of data. A few of the most important questions to ask yourself are:

  • How do the spreads of data compare?

  • How do the skews compare? Is one set of data more symmetrical?

  • Is there a big difference in the medians?

Back-to-back stem plots

A back-to-back stem plot is very similar to a regular stem plot, in that the "stem" is used to group the scores and each "leaf" indicates the individual scores within each group.

In a back-to-back stem-and-leaf plot, however, two sets of data are displayed simultaneously. One set of data is displayed with its leaves on the left, and the other with its leaves on the right. The "leaf" values are still written in ascending order from the stem outwards.

Examples

Example 1

The data below shows the results of a survey conducted on the price of concert tickets locally and the price of the same concerts at an international venue.

LocalInternational
7\ 5\ 2\ 260\ 5
9\ 6\ 5\ 4\ 072\ 3\ 8\ 8
9\ 6\ 5\ 3\ 082\ 3\ 7\ 8
8\ 7\ 4\ 3\ 190\ 1\ 6\ 7\ 9
5100\ 2\ 3\ 5\ 8

\text{ Key: } 6|1|2 = \$ 16 \text{ and } \$ 12

a

What was the most expensive ticket price at the international venue?

Worked Solution
Create a strategy

Examine the International stem and leaf plot and determine the highest value recorded.

Apply the idea

The most expensive ticket price at the international venue is \$108.

b

What was the median ticket price at the international venue? Leave your answer to two decimal places if needed.

Worked Solution
Create a strategy

Examine the International stem and leaf plot to determine the 10th and 11th score.

Apply the idea

On the stem and leaf plot of international venue, the 10th score is 88 and the 11th score is 90.

\displaystyle \text{Median}\displaystyle =\displaystyle \dfrac{88+90}{2}Add 88 and 90 and divide by 2
\displaystyle =\displaystyle \$89Evaluate
c

What percentage of local ticket prices were cheaper than the international median?

Worked Solution
Create a strategy

Count how many of the local prices are less than \$89 (international median price) and divide by the total number of prices.

Apply the idea

From the stem and leaf plot of local prices, there are 13 out of 20 ticket prices that are cheaper than \$89.

This means that the percentage of local ticket prices cheaper than the international median is given by:

\displaystyle \text{Percentage}\displaystyle =\displaystyle \dfrac{13}{20}\times 100\%Multiply \dfrac{13}{20} by 100\%
\displaystyle =\displaystyle 65\%Evaluate
d

At the international venue, what percentage of tickets cost between \$90 and \$110 (inclusive)?

Worked Solution
Create a strategy

Count how many of the international prices are between \$90 and \$110 and divide by the total number of prices.

Apply the idea

From the stem and leaf plot of international prices, there are 10 out of 20 ticket prices that are between \$90 and \$110.

This means that the percentage of tickets between \$90 and \$110 is given by:

\displaystyle \text{Percentage}\displaystyle =\displaystyle \dfrac{10}{20}\times 100\%Multiply \dfrac{10}{20} by 100\%
\displaystyle =\displaystyle 50\%Evaluate
e

At the local venue, what percentage of tickets cost between \$90 and \$100 (inclusive)?

Worked Solution
Create a strategy

Count how many of the local prices are between \$90 and \$100 and divide by the total number of prices.

Apply the idea

From the stem and leaf plot of local prices, there are 5 out of 20 ticket prices that are between \$90 and \$100.

This means that the percentage of tickets between \$90 and \$110 is given by:

\displaystyle \text{Percentage}\displaystyle =\displaystyle \dfrac{5}{20}\times 100\%Multiply \dfrac{5}{20} by 100\%
\displaystyle =\displaystyle 25\%Evaluate

Example 2

The back-to-back stem plots show the number of pieces of paper used over several days by Maximillian’s and Charlie’s students.

Maximillian's studentsCharlie's students
707
311\ 2\ 3
828
4\ 332\ 3\ 4
7\ 6\ 549
3\ 252

Key: 6 \vert 1 \vert 2 = 16 \text{ and }12

Which of the following statements are true?

I. Maximillian's students did not use 7 pieces of paper on any day.

II. Charlie's median is higher than Maximillian’s median.

III. The median is greater than the mean in both groups.

A
I and II
B
II and III
C
None of the statements are correct
D
III only
E
II only
F
I only
Worked Solution
Create a strategy

Examine both stem and leaf plots and assess the validity of each statement.

Apply the idea

Statement I: Based on the stem and leaf plot of Maximillian's students, they used 7 pieces of paper on any day. This means that statement I is incorrect.

Statement II: Both groups have 10 data points. For Charlie's students, the median lies between 28 and 32 as these are the 5th and 6th data points, respectively. For Maximillian's students, the median lies between 34 and 45 as these are the 5th and 6th data points, respectively. Calculating the median for each group, we have:

\displaystyle \text{Maximillian's median}\displaystyle =\displaystyle \dfrac{34+45}{2}
\displaystyle =\displaystyle 39.5
\displaystyle \text{Charlie's median}\displaystyle =\displaystyle \dfrac{28+32}{2}
\displaystyle =\displaystyle 30

This means that Maximillian's median is higher than the median of Charlie's, so statement II is incorrect.

Statement III: Calculating the mean for each group, we have

\displaystyle \text{Maximillian's mean}\displaystyle =\displaystyle \dfrac{7+13+28+33+34+45+46+47+52+53}{10}
\displaystyle =\displaystyle 35.8
\displaystyle \text{Charlie's mean}\displaystyle =\displaystyle \dfrac{7+11+12+13+28+32+33+34+49+52}{10}
\displaystyle =\displaystyle 27.1

Comparing the calculated mean of both groups with their respective median, the median is greater than the mean in both groups. This means that the statement III is correct.

So, the correct answer is Option D.

Idea summary

A back-to-back stem plot is very similar to a regular stem plot, in that the "stem" is used to group the scores and each "leaf" indicates the individual scores within each group.

Parallel box plots

Parallel box plots are used to compare two sets of data visually. Remember that a box plot is a visual display of the information in a five number summary. As such, these values are the important parts to compare:

  • Minimum

  • Q1 (lower quartile)

  • Median

  • Q3 (upper quartile)

  • Maximum

Parallel box plots are presented parallel to each other, along the same horizontal scale for comparison. Since they are in the same scale, a visual comparison is fairly straightforward. It is important to clearly label each box plot.

Here is an example:

Two box plots where the first box plot is for under 30s and the second box plot is for over 30s. Ask your teacher for more information.

Looking at the parallel box plots, we can see that overall the under 30s were faster at completing the task. Both the under 30s box plot and the over 30s box plot are slightly negatively skewed. Over 75\% of the under 30s completed the task in under 22 seconds, which is the median time taken by the over 30s. 100\% of the under 30s had finished the task before 75\% of the over 30s had completed it. Overall the under 30s performed better and had a smaller spread of scores. There was a larger variance within the over 30 group, with a range of 24 seconds compared to 20 seconds for the under 30s.

Examples

Example 3

The box plots show the monthly profits (in thousands of dollars) of two derivatives traders over a year:

Ned's monthly profit
5
10
15
20
25
30
35
40
45
50
55
60
Tobias' monthly profit
0
5
10
15
20
25
30
35
40
45
50
55
60
a

Who made a higher median monthly profit?

A
Ned
B
Tobias
Worked Solution
Create a strategy

Find the value that lines up with the vertical line on the number line of the two box plots and choose which between the two box plots has higher median.

Apply the idea

On the box plot of Ned's monthly profit, the median of the monthly profit is 32 (in thousands of dollar). On the contrary, the median of the monthly profit of Tobias is 33 (in thousands of dollar) based on its box plot.

This means Tobias has a higher median monthly profit. So, the correct answer is Option B.

b

Whose profits had a higher interquartile range?

A
Tobias
B
Ned
Worked Solution
Create a strategy

By observing the length of each box of the two box plots, choose which between the box plots has lengthier box. Use the fact that the interquartile range of a data set is represented by the length of the box.

Apply the idea

By comparing the two box plots, the box of Ned's monthly profit is lengthier than the box of Tobias' monthly profit. This implies that Ned's monthly profit has a higher interquartile range as a long box in the box plot indicates a large interquartile range.

So, the correct answer is Option B.

c

Whose profits had a higher range?

A
Ned
B
Tobias
Worked Solution
Create a strategy

Subtract the endpoints of each whisker for the two box plots and choose which has higher difference.

Apply the idea

The box plot of Ned's monthly profit has minimum value of 15 and maximum value of 55. The box plot of Tobias' monthly profit has minimum value of 15 and maximum value of 50.

\displaystyle \text{Ned's range}\displaystyle =\displaystyle 55-15Subtract 15 from 55
\displaystyle =\displaystyle 40Evaluate
\displaystyle \text{Tobias' range}\displaystyle =\displaystyle 50-15Subtract 15 from 50
\displaystyle =\displaystyle 35Evaluate

This means that Ned has profits with higher range. So, the correct answer is Option A.

d

How much more did Ned make in his most profitable month than Tobias did in his most profitable month?

Worked Solution
Create a strategy

Subtract the endpoints of the right whisker of the two box plots. Use the fact that the profit made by a trader in their most profitable month is represented by the end point of the right whisker.

Apply the idea

The endpoint of the right whisker on the box plot of Ned's is 55 and the endpoint of the right whisker on the box plot of Tobias' is 50. Subtracting these endpoints, we have

\displaystyle \text{Difference on profit}\displaystyle =\displaystyle 55-50Subtract 50 from 55
\displaystyle =\displaystyle 5 \text{ thousand dollars}Evaluate

Example 4

The box plots below represent the daily sales made by Carl and Angelina over the course of one month.

Angelina's Sales
0
10
20
30
40
50
60
70
Carl's Sales
0
10
20
30
40
50
60
70
a

What is the range in Angelina's sales?

Worked Solution
Create a strategy

Find the difference between the highest and smallest scores in the data set of Angelina's sales.

Apply the idea

Based on the box plot of Angelina's sales, the smallest score is 2 and the highest score is 51.

\displaystyle \text{Range}\displaystyle =\displaystyle 51-2Subtract the smallest from the highest
\displaystyle =\displaystyle 49Evaluate
b

What is the range in Carl's sales?

Worked Solution
Create a strategy

Find the difference between the highest and smallest scores in the data set of Carl's sales.

Apply the idea

Based on the box plot of Carl's sales, the smallest score is 14 and the highest score is 64.

\displaystyle \text{Range}\displaystyle =\displaystyle 64-14Subtract the smallest from the highest
\displaystyle =\displaystyle 50Evaluate
c

By how much did Carl's median sales exceed Angelina's?

Worked Solution
Create a strategy

Find the value of each median, then find the difference between these values.

Apply the idea
\displaystyle \text{Angelina's median}\displaystyle =\displaystyle 30Find the score under the middle line
\displaystyle \text{Carl's median}\displaystyle =\displaystyle 42Find the score under the middle line
\displaystyle \text{Difference}\displaystyle =\displaystyle 42-30Subtract the medians
\displaystyle =\displaystyle 12Evaluate
d

Considering the middle 50\% of sales for both sales people, whose sales were more consistent?

Worked Solution
Create a strategy

Compare the interquartile ranges.

Apply the idea

The interquartile range will tell us how consistent the middle 50\% of scores are.

\displaystyle \text{Angelina's IQR}\displaystyle =\displaystyle 42-16Subtract Q_1 from Q_3
\displaystyle =\displaystyle 26Evaluate
\displaystyle \text{Carl's IQR}\displaystyle =\displaystyle 50-30Subtract Q_1 from Q_3
\displaystyle =\displaystyle 20Evaluate

Carl has the smaller IQR, so his sales are more consistent.

e

Which salesperson had a more successful sales month?

Worked Solution
Create a strategy

Compare the medians and interquartile ranges.

Apply the idea
\displaystyle \text{Angelina's median}\displaystyle =\displaystyle 30Score the middle vertical line is on
\displaystyle \text{Carl's median}\displaystyle =\displaystyle 42Score the middle vertical line is on

Carl has the higher median, so he had more sales on average.

Reflect and check

By comparing the box plots, we can also see that Carl's lower quartile is equal to Angelina's median. This means that 75\% of Carl's sales are higher than 50\% of Angelina's, which confirms that he has a more successful month.

Idea summary

Parallel box plots are used to compare two or more sets of data visually. These box plots are presented parallel to each other along the same number line using the same scale.

Parallel dot plots

Parallel dot plots are another way to compare two or more sets of data. They must be plotted against the same scale using the same units. This makes the comparison between the data sets easy and ensures it isn't misleading. When creating a parallel dot plot, it's important to take the time to make sure everything is lined up correctly.

Examples

Example 5

A class completed 40 questions for homework. The time needed for boys and girls to finish them was collected, and the data was presented as a parallel dot plot:

An image that shows parallel dot plots of boys time and girls time in minutes. Ask your teacher for more information.
a

Comparing boys and girls, which gender had the highest median time?

A
Girls
B
Both medians are the same
C
Boys
Worked Solution
Create a strategy

To find the median for each set, count the number of dots in the set, then count up halfway to find the dot in the centre.

Apply the idea

Looking at the Boys Time dot plot, we can see that there are 15 dots in total. So the middle value will be the 8th dot, counting from one end. For the Girls Time dot plot, there are 17 dots in total. So the middle value wll be the 9th dot, counting from the end.

Counting up on the Boys Time dot plot, we can see that the median time is 36 minutes (the 8th dot from either end). For the Girls Time dot plot, the median time is 34 minutes (the 9th dot from either end).

So, the boys have highest median time and the correct answer is Option C.

b

Which gender had the largest range?

A
Boys
B
Girls
C
Both ranges are the same
Worked Solution
Create a strategy

Get the difference between the lowest and highest values for each dot plot and choose which has higher value.

Apply the idea

For Boys Time dot plot, the highest value is 40 and the smallest value is 24. For Girls Time dot plot, the highest value is 44 and the smallest value is 28.

\displaystyle \text{Boy's range}\displaystyle =\displaystyle 40-24Subtract 24 from 40
\displaystyle =\displaystyle 16Evaluate
\displaystyle \text{Girl's range}\displaystyle =\displaystyle 44-28Subtract 28 from 44
\displaystyle =\displaystyle 16Evaluate

This means that both ranges for the two genders are the same. So, the correct answer is Option C.

c

Which group has the highest valued mode?

A
Boys
B
Girls
C
Both modes are the same
Worked Solution
Create a strategy

Determine the value in each set that occurs most often and choose which has higher value.

Apply the idea

Among the boys, the most common time is 40 minutes. As for the girls, the most common time is 44 minutes.

So, the girls have the highest valued mode and the correct answer is Option B.

Example 6

Isabelle did an experiment to see how well plants grow in different conditions.

She had 8 plants grow in the sunshine, and 8 that grew in the shade. She measured how tall they grew in centimetres after 2 months, and recorded the information as a parallel dot plot.

a

Which group of plants had a higher range of heights?

A
The plants that grew in shade
B
The plants that grew in sunshine
C
The two group of plants had the same height range
Worked Solution
Create a strategy

Get the difference between the lowest and highest values for each dot plot and choose which has higher value.

Apply the idea

For both Sunshine and Shade dot plots, their highest value is 12 and the smallest value is 9.

\displaystyle \text{Sunshine's range}\displaystyle =\displaystyle 12-9Subtract 9 from 12
\displaystyle =\displaystyle 3Evaluate
\displaystyle \text{Shade's range}\displaystyle =\displaystyle 12-9Subtract 9 from 12
\displaystyle =\displaystyle 3Evaluate

This means that two group of plants had the same height range. So, the correct answer is Option C.

b

Which dot plot shows a positive skew?

A
Shade
B
Neither
C
Sunshine
Worked Solution
Create a strategy

Check which between the dot plots have most lower scores.

Apply the idea

Observing the two dot plots, it can be seen that the Shade dot plot has more lower scores compared with Sunshine dot plot as there are 4 plants with height of 9 cm that grew in the shade.

So, the correct answer is Option A.

c

How much higher is the median height of plants grown in the sunshine than the median height of plants grown in the shade?

Worked Solution
Create a strategy

Find the two medians, and then calculate the difference between them.

Apply the idea

For the plants that grew in the sun, the median is the average of the 4th and 5th scores. This means we get the average of 11 and 12.

\displaystyle \text{Median of sunshine}\displaystyle =\displaystyle \dfrac{11+12}{2}Get the average of 11 and 12
\displaystyle =\displaystyle 11.5 \text{ cm}Evaluate

For the plants that grew in the shade, the median height will also be the average of the 4th and5th scores. This means we get the average of 9 and 10.

\displaystyle \text{Median of shade}\displaystyle =\displaystyle \dfrac{9+10}{2}Get the average of 9 and 10
\displaystyle =\displaystyle 9.5\text{ cm}Evaluate

Getting the height difference of the two medians, we have:

\displaystyle \text{Median height difference}\displaystyle =\displaystyle 11.5-9.5Subtract 9.5 from 11.5
\displaystyle =\displaystyle 2 \text{ cm}Evaluate
Idea summary

Parallel dot plots are another way to compare two or more sets of data. They must be plotted against the same scale using the same units.

Outcomes

U3.AoS1.8

two-way frequency tables, segmented bar charts, back-to-back stem plots, parallel boxplots, and scatterplots, and their application in the context of identifying and describing associations

U3.AoS1.20

construct parallel boxplots and use them to identify and describe associations between a numerical variable and a categorical variable

What is Mathspace

About Mathspace