topic badge

7.07 Comparing data sets

Lesson

At the start of this chapter we explored the statistical investigation process in which we seek to answer a statistical question. Often we need to make comparisons between different data sets to answer the questions, such as when we need to make a choice between two or more groups (or populations).

For example, if we are looking for the best tree to plant in the school courtyard for shade, we might compare two similar species of tree to determine "Which species of tree grows the fastest?". This is a statistical question and we can collect data by measuring the growth rates of many trees in a nursery.

If it turns out that all of the individual trees of one species grow faster than all of the individuals in the other species then it will be an easy decision. However, that is not always the case–each individual tree grows at a different rate, and there can be overlap between the measurements from both populations.

This is when we need to use statistical methods to analyse and compare data sets.

By comparing the means of central tendency in a data set (that is, the mean, median and mode), as well as measures of spread (range, interquartile range and standard deviation), we can make comparisons between different groups and draw conclusions about our data.

Worked example

The number of minutes spent exercising per day for $10$10 days is recorded for two people who have just signed up for a new gym membership.

Person A: $45,50,50,55,55,60,60,65,65,65$45,50,50,55,55,60,60,65,65,65

Person B: $20,30,45,55,60,60,65,70,70,70$20,30,45,55,60,60,65,70,70,70

(a) Calculate the mean, median, mode, sample standard deviation and range for Person A.

Think: The data set is already in order, which makes finding the median and range much easier. To find the mean we sum the scores and divide it by $10$10 (the number of scores), and then the mode will be the most frequently occurring score. To compute the sample standard deviation, we refer to our calculator.

Do: We see that:

Mean $57$57
Median $57.5$57.5
Mode $65$65
Standard deviation $7.15$7.15 ($2$2 d.p.)
Range $20$20

 

(b) Calculate the mean, median, mode, sample standard deviation and range for Person B.

Think: We will perform the same calculations as before, but using the second data set. Note that this data set is also already ordered.

Do: We see that:

Mean $54.5$54.5
Median $60$60
Mode $70$70
Standard deviation $17.55$17.55 (2 d.p.)
Range $50$50

 

(c) Which person is the most consistent with their exercise, and why?

Think: The person who is most consistent will have scores that are closer together. The two measures of spread that we have found are the range and standard deviation, so we should compare these to see who is most consistent.

Do: Person A has a much smaller range and standard deviation than Person B. In fact, both measures for Person A are less than half those of Person B. We can conclude that Person A is more consistent.

(d) Which person seems to train more overall, and why?

Think: To determine who seems to train more overall, we should consider which measures of central tendency of the two sets are appropriate for comparison

Do: The mode and median for Person B are both larger than for Person A.

However, the mode and median ignores the actual values for the majority of the data set.

If we wanted to see who trained more overall, why not add all the minutes spent at the gym and compare those?

The mean effectively does this, but just divides these values by $10$10 (the number of scores).

So Person A trains more overall.

Reflect: Sometimes two data sets that we want to compare have different sizes. So if we want to compare which data set "performed" typically better, we can't just add up all the scores–the data set with more scores will have an advantage. So we scale each sum by the number of scores in each data set. This gives us a fair comparison of how well each data set performed on average.

There are times that we want to use the mode and median to compare the measures of central tendency. This will be more appropriate when outliers are involved, or if the actual values of the scores are less important.

Practice questions

Question 1

The beaks of two groups of bird are measured, in mm, to determine whether they might be of the same species.

Length of beaks of two groups of birds (in mm.)
Group 1 $33$33 $39$39 $31$31 $27$27 $22$22 $37$37 $30$30 $24$24 $24$24 $28$28
Group 2 $29$29 $44$44 $45$45 $34$34 $31$31 $44$44 $44$44 $33$33 $37$37 $34$34
  1. Calculate the range for Group 1.

  2. Calculate the range for Group 2.

  3. Calculate the mean for Group 1. Give your answer as a decimal.

  4. Calculate the mean for Group 2. Give your answer as a decimal.

  5. Choose the most appropriate statement that describes the set of data.

    Although the ranges are similar, the mean values are significantly different indicating that these two groups of birds are of the same species.

    A

    Although the ranges are similar, the mean values are significantly different indicating that these two groups of birds are not of the same species.

    B

    Although the mean values are similar, the ranges are significantly different indicating that these two groups of birds are not of the same species.

    C

    Although the mean values are similar, the ranges are significantly different indicating that these two groups of birds are of the same species.

    D

question 2

The box plots drawn below show the number of repetitions of a $70$70 kg bar that two weightlifters can lift. They both record their repetitions over $30$30 days.

  1. Which weightlifter has the more consistent results?

    Weightlifter A.

    A

    Weightlifter B.

    B
  2. What statistical evidence supports your answer?

    The mean.

    A

    The range.

    B

    The mode.

    C

    The graph is positively skewed.

    D
  3. Which statistic is the same for each weightlifter?

    The median.

    A

    The mean.

    B

    The mode.

    C
  4. Which weightlifter can do the most repetitions of $70$70 kg?

    Weightlifter A.

    A

    Weightlifter B.

    B

Question 3

Student X scored $86,83,86,88,98$86,83,86,88,98 and

Student Y scored $61,83,50,85,83$61,83,50,85,83 across 5 exams.

  1. Find the mean score of Student X, writing your answer as a decimal.

  2. Find the mean score of Student Y

  3. Find the standard deviation of the scores for Student X, correct to two decimal places.

  4. Find the standard deviation of the scores for Student Y, correct to two decimal places.

  5. Which student performed better?

    Student X

    A

    Student Y

    B
  6. Which student performed more consistently?

    Student X

    A

    Student Y

    B

 

Back-to-back stem plots

We have seen a useful way to compare numerical data sets is parallel box plots as these compare the two sets on the same scale. We can create a similar comparison by creating a back-to-back stem plot. The advantage of a back-to-back stem plot is that the original data is retained and we can calculate the mean, mode and other statistics exactly. A disadvantage is that it is only suitable for small to medium data sets and we can only compare two sets of data at a time.

Two sets of data can be displayed side-by-side using a back-to-back stem plot. In the example below, the pulse rates of $18$18 students were recorded before and after exercise.

Reading a back-to-back stem plot is very similar to reading a regular stem plot.

Referring to the example above:

  • The central column displays the stems, with the leaf values on each side.
  • The values on the left are the pulse rates of the students before exercise, while the values on the right are their pulse rates after exercise.
  • In this example, the fourth row of the plot, $4$4 $3$3 $0$0 $\mid$ $8$8 $\mid$ $2$2 $2$2 $6$6, displays pulse rates of $80$80, $83$83 and $84$84 before exercise and pulse rates of $82$82, $82$82 and $86$86 after exercise. They are not necessarily the pulse rates of the same students.
  • On both sides of the stem column, the leafs are displayed in ascending order with the lowest value closest to the stem.

To create a stem plot, it is usually easier to arrange all of the data values in ascending order, before ordering them in the plot.

 

Practice questions

Question 4

The data below shows the results of a survey conducted on the price of concert tickets locally and the price of the same concerts at an international venue.

Local International
Stem Leaf
$6$6 $0$0 $4$4 $6$6 $7$7
$7$7 $3$3 $5$5 $6$6 $6$6 $7$7
$8$8 $2$2 $4$4 $4$4 $5$5 $7$7
$9$9 $1$1 $4$4 $6$6 $7$7 $9$9
$10$10 $4$4
 
Stem Leaf
$6$6 $0$0 $7$7
$7$7 $0$0 $0$0 $3$3 $4$4
$8$8 $0$0 $5$5 $6$6 $6$6
$9$9 $1$1 $1$1 $3$3 $4$4 $6$6
$10$10 $1$1 $4$4 $4$4 $5$5 $6$6
 
Key: $1$1$\mid$$2$2$=$=$12$12
  1. What was the most expensive ticket price at the international venue?

    $\editable{}$ dollars

  2. What was the median ticket price at the international venue? Leave your answer to two decimal places if needed.

  3. What percentage of local ticket prices were cheaper than the international median?

  4. At the international venue, what percentage of tickets cost between $\$90$$90 and $\$110$$110 (inclusive)?

  5. At the local venue, what percentage of tickets cost between $\$90$$90 and $\$100$$100 (inclusive)?

Question 5

10 participants had their pulse measured before and after exercise with results shown in the stem-and-leaf plot below.

Key: 6 | 1 | 2 $=$= 12 and 16
  1. What is the mode pulse rate after exercise?

    $\editable{}$

  2. How many modes are there for the pulse rate before exercise?

    $\editable{}$

  3. What is the range of pulse rates before exercise?

  4. What is the range of pulse rates after exercise?

  5. Calculate the mean pulse rate before exercise.

  6. What is the mean pulse rate after exercise?

  7. What can you conclude from the measures of centre and spread that you have just calculated?

    The range of pulse rates decreases after exercise.

    A

    The range of pulse rates and the mean pulse rate increase after exercise.

    B

    The range of pulse rates increasing after exercise shows that some people are fitter than others.

    C

    The mode pulse rate is the best comparison of pulse rates before and after exercise.

    D

Outcomes

ACMGM032

compare groups on a single numerical variable using medians, means, IQRs, ranges or standard deviations, as appropriate; interpret the differences observed in the context of the data; and report the findings in a systematic and concise manner

What is Mathspace

About Mathspace