topic badge

10.05 Comparing data sets

Lesson

Introduction

There are many things to keep in mind when comparing two sets of data. A few of the most important questions to ask yourself are:

  • How do the spreads of data compare?

  • How do the skews compare? Is one set of data more symmetrical?

  • Is there a big difference in the medians?

Back-to-back stem-and-leaf plots

A back-to-back stem-and-leaf plot is very similar to a regular stem-and-leaf plot, in that the "stem" is used to group the scores and each "leaf" indicates the individual scores within each group.

Group AGroup B
7\ 310\ 3\ 6
5\ 021\ 6\ 7\ 8
6\ 5\ 535\ 5\ 6
1\ 141\ 1\ 5\ 6\ 9
8\ 4\ 350\ 3\ 6\ 8
Key 5\vert 2 = 25 Key 2\vert 1 = 21

In a back-to-back stem-and-leaf plot, however, two sets of data are displayed simultaneously. One set of data is displayed with its leaves on the left, and the other with its leaves on the right. The "leaf" values are still written in ascending order from the stem outwards.

This allows us to compare the shape and location of the two distributions side by side.

Examples

Example 1

The weight (in kilograms) of a group of men and women were recorded and presented in a stem and leaf plot as shown.

WomenStemMen
7\ 6\ 35
8\ 6\ 3\ 1\ 162\ 3\ 3\ 8\ 9
3\ 171\ 2\ 4\ 8
83
Key 1\vert 6 = 61 Key 6\vert 2 = 62
a

What is the mean weight of the group of men?

Worked Solution
Create a strategy

Use the formula: \text{Mean}=\dfrac{\text{Sum of scores}}{\text{Number of scores}}

Apply the idea

We are focusing on the stems and the leaves under "Men" on the right side of the stem and leaf plot.

\displaystyle \text{Mean}\displaystyle =\displaystyle \dfrac{62+63+63+68+69+71+72+74+78+83}{10}Substitute the values
\displaystyle =\displaystyle \dfrac{703}{10}Evaluate the addition
\displaystyle =\displaystyle 70.3 \text{ kg}Evaluate the division
b

What is the mean weight of the group of women?

Worked Solution
Create a strategy

Use the formula: \text{Mean}=\dfrac{\text{Sum of scores}}{\text{Number of scores}}

Apply the idea

We are focusing on the stems and the leaves under "Women" on the left side of the stem and leaf plot.

\displaystyle \text{Mean}\displaystyle =\displaystyle \dfrac{53+56+57+61+61+63+66+68+71+73}{10}Substitute the values
\displaystyle =\displaystyle \dfrac{629}{10}Evaluate the addition
\displaystyle =\displaystyle 62.9 \text{ kg}Evaluate the division
c

Which group is heavier?

A
Women
B
Men
Worked Solution
Create a strategy

Compare the mean weights of the two groups.

Apply the idea

The group with the higher mean weight will, on average, be heavier. The men had a mean of 70.3 \text{ kg} which were heavier than the women who had a mean of 62.9 \text{ kg}. So the correct answer is option B.

Idea summary

In a back-to-back stem-and-leaf plot two sets of data are displayed simultaneously. One set of data is displayed with its leaves on the left of the stems, and the other with its leaves on the right. The "leaf" values are still written in ascending order from the stem outwards.

Compare data sets

Consider the following histograms that show the height of students in two basketball teams. We know that one graph represents a team made up of Year 12 students and the other represents a Year 8 team. Which one corresponds to the Year 12 team?

Team A:

The image shows a histogram of Team A. Ask your teacher for more information.

Team B:

The image shows a histogram of Team B. Ask your teacher for more information.

We can still compare these distributions, even though there is clearly a different number of students in the teams, because we are only interested in the shape and location of the data.

In our comparison we want to mention the most significant differences, and also describe relevant characteristics that are the same, or similar.

In this case we can observe these important similarities and differences:

  • In this case we can observe these important similarities and differences:

  • The modal class for Team A of 170-175 cm is much higher than the modal class 150-155 cm for Team B.

  • If we ignore the outlier values in the 195-200 class for Team A, then the range of both distributions is similar, at 35 cm for Team A and 30 cm for Team B.

  • Student heights have greater spread overall for Team A.

  • The heights for Team A appear to be concentrated around the modal class so we can say that the a clustered at 170-180 cm.

Based on these observations, we could confidently say that Team A is the team of Year 12 students. In this case, the decision is clear because of the difference in the height for the modal class, which we would expect to be significantly higher for the older students.

It is important to be able to compare data sets because it helps us make conclusions or judgements about the data. For example, suppose Jim scores \dfrac{5}{10} in a geography test and \dfrac{6}{10} in a history test. Based on those marks alone, it makes sense to say that he did better in history.

But what if everyone else in his geography class scored \dfrac{4}{10}, while everyone else in his history class scored \dfrac{8}{10}? Now we know that Jim had the highest score in the class in geography, and the lowest score in the class in history. With this extra information, it makes more sense to say that he did better in geography.

By comparing the measures of central tendency in a data set (the mean, median and mode), as well as measures of spread (the range and interquartile range), we can make comparisons between different groups and draw conclusions about our data.

Examples

Example 2

A science class with 20 students, was given two different 10 question True/False tests, one about dinosaurs and one about nanotechnology. The results for each topic test are shown below:

The image shows two histograms for the test results of Dinosaurs and Nanotechnology. Ask your teacher for more information.
a

Which topic did the class know more about?

Worked Solution
Create a strategy

Compare the scores for each histogram.

Apply the idea

The scores for the dinosaur topic test range from 4 to 7, while the scores for the nanotechnology topic test range from 7 to 10. So, the class knows more about nanotechnology topic.

b

Which statistical piece of evidence supports your answer?

A
The positive skew of the graph
B
The mean
C
The range
Worked Solution
Create a strategy

Compare the shape and range of each graph.

Apply the idea

The shape of the nanotechnology topic has a negative skew, so option A is not correct. Both topics have the same value of the range, which is 3, so option C is not correct.

The correct answer is option B.

c

Which statistic is the same for each topic?

A
The mode
B
The range
Worked Solution
Create a strategy

Evaluate the mode and range for each topic.

Apply the idea

The mode is the score with the highest frequency. The mode for the dinosaur topic test is 5, and the mode for the nanotechnology topic test is 9.

The range for the dinosaur topic test is 7-4=3, and the range for the nanotechnology topic test is 10-7=3.

The range is the same for both topics. So the correct answer is option B.

d

Calculate the mean for the dinosaur topic test. Give your answer correct to one decimal place.

Worked Solution
Create a strategy

Use the formula: \text{Mean}=\dfrac{\text{Sum of scores}}{\text{Number of scores}}

Apply the idea
\displaystyle \text{Sum}\displaystyle =\displaystyle 5\times 4+9\times 5+4\times 6+2\times 7Multiply each score by its frequency
\displaystyle =\displaystyle 103Evaluate
\displaystyle \text{Mean}\displaystyle =\displaystyle \dfrac{103}{20}Use the formula
\displaystyle \approx\displaystyle 5.2Evaluate the division
e

Calculate the mean for the nanotechnology topic test. Give your answer correct to one decimal place.

Worked Solution
Create a strategy

Use the formula: \text{Mean}=\dfrac{\text{Sum of scores}}{\text{Number of scores}}

Apply the idea
\displaystyle \text{Sum}\displaystyle =\displaystyle 2\times 7+4\times 8+8\times 9+6\times 10Multiply each score by its frequency
\displaystyle =\displaystyle 178Evaluate
\displaystyle \text{Mean}\displaystyle =\displaystyle \dfrac{178}{20}Use the formula
\displaystyle \approx\displaystyle 8.9Evaluate the division
Idea summary

By comparing the measures of central tendency in a data set (the mean, median and mode), as well as measures of spread (the range and interquartile range), we can make comparisons between different groups and draw conclusions about our data.

Outcomes

VCMSP323

Investigate reports of surveys in digital media and elsewhere for information on how data were obtained to estimate population means and medians.

VCMSP325

Construct back-to-back stem-and-leaf plots and histograms and describe data, using terms including ‘skewed’, ‘symmetric’ and ‘bi modal’.

VCMSP326

Compare data displays using mean, median and range to describe and interpret numerical data sets in terms of location (centre) and spread.

What is Mathspace

About Mathspace