topic badge

4.09 Compare data sets

Lesson

It is important to be able to compare data sets because it helps us make conclusions or judgements about the data. Some sets of data are easily compared directly because they are measuring similar things, but sometimes we may need to compare data sets that are quite different.

For example, suppose Jim scores $\frac{5}{10}$510 in a geography test and $\frac{6}{10}$610 in a history test. Based on those marks alone, it makes sense to say that he did better in history. But what if everyone else in his geography class scored $\frac{4}{10}$410, while everyone else in his history class scored $\frac{8}{10}$810? Now we know that Jim had the highest score in the class in geography, and the lowest score in the class in history. With this extra information, it makes more sense to say that he did better in geography.

By using the measures of central tendency of a data set (that is, the mean, median and mode), as well as measures of spread (such as the range, interquartile range and standard deviation), we can make clear comparisons and contrasts between different groups. We can also benefit from examining the shape of the distribution of two sets of data when comparing them.

Worked example

Example 1

The number of minutes spent exercising per day for $10$10 days is recorded for two people who have just signed up for a new gym membership.

Person A: $45,50,50,55,55,60,60,65,65,65$45,50,50,55,55,60,60,65,65,65

Person B: $20,30,45,55,60,60,65,70,70,70$20,30,45,55,60,60,65,70,70,70

(a) Calculate the mean, median, mode, sample standard deviation and range for Person A.

Think: The data set is already in order, which makes finding the median and range much easier. To find the mean we sum the scores and divide it by $10$10 (the number of scores), and then the mode will be the most frequently occurring score. To compute the sample standard deviation, we refer to our calculator.

Do: We see that:

Mean $57$57
Median $57.5$57.5
Mode $65$65
Standard deviation $7.15$7.15 ($2$2 d.p.)
Range $20$20

 

(b) Calculate the mean, median, mode, sample standard deviation and range for Person B.

Think: We will perform the same calculations as before, but using the second data set. Note that this data set is also already ordered.

Do: We see that:

Mean $54.5$54.5
Median $60$60
Mode $70$70
Standard deviation $17.55$17.55 (2 d.p.)
Range $50$50

 

(c) Which person is the most consistent with their exercise, and why?

Think: The person who is most consistent will have scores that are closer together. The two measures of spread that we have found are the range and standard deviation, so we should compare these to see who is most consistent.

Do: Person A has a much smaller range and standard deviation than Person B. In fact, both measures for Person A are less than half those of Person B. We can conclude that Person A is more consistent.

(d) Which person seems to train more overall, and why?

Think: To determine who seems to train more overall, we should consider which measures of central tendency of the two sets are appropriate for comparison

Do: The mode and median for Person B are both larger than for Person A.

However, the mode and median ignores the actual values for the majority of the data set.

If we wanted to see who trained more overall, why not add all the minutes spent at the gym and compare those?

The mean effectively does this, but just divides these values by $10$10 (the number of scores).

So Person A trains more overall.

Reflect: Sometimes two data sets that we want to compare have different sizes. So if we want to compare which data set "performed" typically better, we can't just add up all the scores–the data set with more scores will have an advantage. So we scale each sum by the number of scores in each data set. This gives us a fair comparison of how well each data set performed on average.

There are times that we want to use the mode and median to compare the measures of central tendency. This will be more appropriate when outliers are involved, or if the actual values of the scores are less important.

Practice questions

Question 1

The beaks of two groups of bird are measured, in mm, to determine whether they might be of the same species.

Length of beaks of two groups of birds (in mm.)
Group 1 $33$33 $39$39 $31$31 $27$27 $22$22 $37$37 $30$30 $24$24 $24$24 $28$28
Group 2 $29$29 $44$44 $45$45 $34$34 $31$31 $44$44 $44$44 $33$33 $37$37 $34$34
  1. Calculate the range for Group 1.

  2. Calculate the range for Group 2.

  3. Calculate the mean for Group 1. Give your answer as a decimal.

  4. Calculate the mean for Group 2. Give your answer as a decimal.

  5. Choose the most appropriate statement that describes the set of data.

    Although the ranges are similar, the mean values are significantly different indicating that these two groups of birds are of the same species.

    A

    Although the ranges are similar, the mean values are significantly different indicating that these two groups of birds are not of the same species.

    B

    Although the mean values are similar, the ranges are significantly different indicating that these two groups of birds are not of the same species.

    C

    Although the mean values are similar, the ranges are significantly different indicating that these two groups of birds are of the same species.

    D

question 2

The box plots drawn below show the number of repetitions of a $70$70 kg bar that two weightlifters can lift. They both record their repetitions over $30$30 days.

  1. Which weightlifter has the more consistent results?

    Weightlifter A.

    A

    Weightlifter B.

    B
  2. What statistical evidence supports your answer?

    The mean.

    A

    The range.

    B

    The mode.

    C

    The graph is positively skewed.

    D
  3. Which statistic is the same for each weightlifter?

    The median.

    A

    The mean.

    B

    The mode.

    C
  4. Which weightlifter can do the most repetitions of $70$70 kg?

    Weightlifter A.

    A

    Weightlifter B.

    B

 

Back-to-back stem plots

We have seen a useful way to compare numerical data sets is parallel box plots as these compare the two sets on the same scale. We can create a similar comparison by creating a back-to-back stem plot. The advantage of a back-to-back stem plot is that the original data is retained and we can calculate the mean, mode and other statistics exactly. A disadvantage is that it is only suitable for small to medium data sets and we can only compare two sets of data at a time.

Two sets of data can be displayed side-by-side using a back-to-back stem plot. In the example below, the pulse rates of $18$18 students were recorded before and after exercise.

Reading a back-to-back stem plot is very similar to reading a regular stem plot.

Referring to the example above:

  • The central column displays the stems, with the leaf values on each side.
  • The values on the left are the pulse rates of the students before exercise, while the values on the right are their pulse rates after exercise.
  • In this example, the fourth row of the plot, $4$4 $3$3 $0$0 $\mid$ $8$8 $\mid$ $2$2 $2$2 $6$6, displays pulse rates of $80$80, $83$83 and $84$84 before exercise and pulse rates of $82$82, $82$82 and $86$86 after exercise. They are not necessarily the pulse rates of the same students.
  • On both sides of the stem column, the leafs are displayed in ascending order with the lowest value closest to the stem.

To create a stem plot, it is usually easier to arrange all of the data values in ascending order, before ordering them in the plot.

 

Practice questions

Question 3

The data below shows the results of a survey conducted on the price of concert tickets locally and the price of the same concerts at an international venue.

Local International
Stem Leaf
$6$6 $0$0 $4$4 $6$6 $7$7
$7$7 $3$3 $5$5 $6$6 $6$6 $7$7
$8$8 $2$2 $4$4 $4$4 $5$5 $7$7
$9$9 $1$1 $4$4 $6$6 $7$7 $9$9
$10$10 $4$4
 
Stem Leaf
$6$6 $0$0 $7$7
$7$7 $0$0 $0$0 $3$3 $4$4
$8$8 $0$0 $5$5 $6$6 $6$6
$9$9 $1$1 $1$1 $3$3 $4$4 $6$6
$10$10 $1$1 $4$4 $4$4 $5$5 $6$6
 
Key: $1$1$\mid$$2$2$=$=$12$12
  1. What was the most expensive ticket price at the international venue?

    $\editable{}$ dollars

  2. What was the median ticket price at the international venue? Leave your answer to two decimal places if needed.

  3. What percentage of local ticket prices were cheaper than the international median?

  4. At the international venue, what percentage of tickets cost between $\$90$$90 and $\$110$$110 (inclusive)?

  5. At the local venue, what percentage of tickets cost between $\$90$$90 and $\$100$$100 (inclusive)?

Question 4

10 participants had their pulse measured before and after exercise with results shown in the stem-and-leaf plot below.

Key: 6 | 1 | 2 $=$= 12 and 16
  1. What is the mode pulse rate after exercise?

    $\editable{}$

  2. How many modes are there for the pulse rate before exercise?

    $\editable{}$

  3. What is the range of pulse rates before exercise?

  4. What is the range of pulse rates after exercise?

  5. Calculate the mean pulse rate before exercise.

  6. What is the mean pulse rate after exercise?

  7. What can you conclude from the measures of centre and spread that you have just calculated?

    The range of pulse rates decreases after exercise.

    A

    The range of pulse rates and the mean pulse rate increase after exercise.

    B

    The range of pulse rates increasing after exercise shows that some people are fitter than others.

    C

    The mode pulse rate is the best comparison of pulse rates before and after exercise.

    D

Outcomes

3.3.2.3

compare parallel box plots and back-to-back stem plots for different datasets [complex]

3.3.2.4

compare the characteristics of the shape of histograms using symmetry, skewness and bimodality, where applicable [complex]

What is Mathspace

About Mathspace