topic badge

11.06 Comparing data distributions

Lesson

It is important to be able to compare data sets because it helps us make conclusions or judgments about the data. Some sets of data are easily compared directly because they are measuring similar things, but sometimes we may need to compare data sets that are quite different. It is important to be able to compare data sets because it helps us make conclusions or judgements about the data.

For example, suppose Jim scores $\frac{5}{10}$510 in a geography test and $\frac{6}{10}$610 in a history test. Based on those marks alone, it makes sense to say that he did better in history. But what if everyone else in his geography class scored $\frac{4}{10}$410, while everyone else in his history class scored $\frac{8}{10}$810? Now we know that Jim had the highest score in the class in geography, and the lowest score in the class in history. With this extra information, it makes more sense to say that he did better in geography.

By using the means of central tendency in a data set (that is, the mean, median and mode), as well as measure of spread (such as the range or standard deviation), we can make comparisons between different groups.

Let's look at an example of data that directly compares similar variables.

 

Worked example

Example 1

Study the bar graph below which shows the changes in tourism rates in different cities during $2011$2011 and $2012$2012, then answer the following questions.


(a) Which city had the highest percentage of tourism in $2011$2011?

Think: The blue lines represent the $2011$2011 tourism rates.

Do: Paris has the tallest blue line ($80%$80%) so Paris had the highest percentage of tourism in $2011$2011.

(b) Which city had the lowest percentage of tourism in $2012$2012?

Think: The red lines represent the $2012$2012 tourism rates.

Do: Rome has the shortest red line (approximately $25%$25%) so Rome had the lowest tourism rates in $2012$2012.

(c) Which city had the highest percentage of tourism in a single year?

Think: The taller the column, the higher the tourism rate.

Do: New York had the highest percentage of tourism in a single year ($90%$90% in $2012$2012).

(d) Which cities had the lowest percentage of tourism in a single year?

Think: The shorter the column, the lower the tourism rate.

Do: Tokyo and Rome had the lowest tourism rates (Tokyo had a $25%$25% in $2011$2011 and Rome had the same rate in $2012$2012).

(e) How much higher is Paris's percentage of tourism in $2011$2011 than that of London in $2012$2012?

Think: We want to find the difference between the two percentages.

Do: In $2011$2011, Paris's rate was $80%$80%.

In $2012$2012, London's rate was also $80%$80%.

Paris' $2011$2011 rate is $0%$0% higher than London's $2012$2012 rate. In other words they're exactly the same!

(f) How much higher is Paris's maximum percentage of tourism over the two years than that of Istanbul?

Think: What are the highest rates for both cities?

Do: Paris's highest rate was $80%$80%.

Istanbul's highest rate was $60%$60%.

So we can see that Paris's highest tourism rate was $20%$20% higher than that of Istanbul.

 

Practice question

Question 1

The ages of employees at two competing fast food restaurants on a Saturday night are recorded. Some statistics are given in the following table:

  Mean Median Range
Berger's Burgers $18$18 $17$17 $6$6
Fry's Fries $18$18 $19$19 $2$2
  1. If the data for each restaurant was represented using a column graph, what would the likely shape of the column graph for Berger's Burgers be?

    Negatively skewed

    A

    Symmetrical

    B

    Positively skewed

    C
  2. Which restaurant likely has the oldest employee on the night the data is recorded?

    Berger's Burgers

    A

    Fry's Fries

    B
  3. Which restaurant has the most employees of a similar age?

    Berger's Burgers

    A

    Fry's Fries

    B
  4. Which statistical piece of evidence supports your answer to part (c)?

    Median

    A

    Mean

    B

    Range

    C
  5. Which restaurant has an older workforce?

    Fry's Fries

    A

    Berger's Burgers

    B
  6. Which statistical piece of evidence supports your answer to part (e)?

    Mean

    A

    Range

    B

    Median

    C

 

Considering the shape of data

While calculating the mean, median, mode and range can tell us a lot about a data set, these calculations can also be very powerful in comparing and contrasting two different data sets. 

We also can benefit from examining the shape of the distribution of two sets of data when comparing them.

 

Worked example

Example 2

The number of minutes spent exercising per day for $10$10 days is recorded for two people who have just signed up for a new gym membership. 

Person A:  $45,50,50,55,55,60,60,65,65,65$45,50,50,55,55,60,60,65,65,65  

Person B:  $20,30,45,55,60,60,65,70,70,70$20,30,45,55,60,60,65,70,70,70

(a) Calculate the mean, median, mode and range for Person A.

Think: The data set is already in order, which makes finding the median and range much easier. To find the mean we sum the scores and divide it by $10$10 (the number of scores), and then the mode will be the most frequently occurring score.

Do: We see that:

mode$=$=$65$65

median$=$=$57.5$57.5

mean$=$=$57$57

range$=$=$20$20

(b) Calculate the mean, median, mode and range for Person B.

Think: We will perform the same calculations as before, but using the second data set. Note that this data set is also already ordered.

Do: We see that:

mode$=$=$70$70

median$=$=$60$60

mean$=$=$54.5$54.5

range$=$=$50$50

(c) Which person is the most consistent with their exercise, and why?

Think: The person who is most consistent will have scores that are closer together. The only measure of spread that we have found here is the range, so we should compare ranges to see who is most consistent.

Do: Person A has a much smaller range than Person B, so we can conclude that Person A is more consistent.

(d) Which person seems to train more overall, and why?

Think: To determine who seems to train more overall, we should compare the measures of central tendency of the two sets.

Do: The mode and median for Person B are both larger than for Person A. 

While the mean for Person B is slightly lower than Person A, this is due to the negative skew of their data.

Overall, the larger mode and median for Person B indicates that they exercise for longer overall.

 

Practice questions

Question 2

The beaks of two groups of bird are measured, in mm, to determine whether they might be of the same species.

Length of beaks of two groups of birds (in mm.)
Group 1 $33$33 $39$39 $31$31 $27$27 $22$22 $37$37 $30$30 $24$24 $24$24 $28$28
Group 2 $29$29 $44$44 $45$45 $34$34 $31$31 $44$44 $44$44 $33$33 $37$37 $34$34
  1. Calculate the range for Group 1.

  2. Calculate the range for Group 2.

  3. Calculate the mean for Group 1. Give your answer as a decimal.

  4. Calculate the mean for Group 2. Give your answer as a decimal.

  5. Choose the most appropriate statement that describes the set of data.

    Although the ranges are similar, the mean values are significantly different indicating that these two groups of birds are of the same species.

    A

    Although the ranges are similar, the mean values are significantly different indicating that these two groups of birds are not of the same species.

    B

    Although the mean values are similar, the ranges are significantly different indicating that these two groups of birds are not of the same species.

    C

    Although the mean values are similar, the ranges are significantly different indicating that these two groups of birds are of the same species.

    D

Question 3

Derek planted some tomato seeds and two seeds sprouted. He puts Seedling A outside in the sun and Seedling B inside in the kitchen. The heights of the seedlings over time, in cm, are graphed below:

Loading Graph...

  1. How many centimetres did Seedling A grow in the second week?

  2. Complete the table below showing the increases in growth for both seedlings:

      Week 1 Week 2 Week 3 Week 4 Week 5 Week 6  Week 7
    Seedling A $3$3 cm $2$2 cm $\editable{}$ cm $\editable{}$ cm $2$2 cm $\editable{}$ cm $\editable{}$ cm
    Seedling B $\editable{}$ cm $\editable{}$ cm $\editable{}$ cm $\editable{}$ cm $\editable{}$ cm $\editable{}$ cm $\editable{}$ cm
  3. Which seedling appears to be growing more slowly?

    Seedling A.

    A

    Seedling B.

    B
  4. Which conditions seem to give optimum growth for a tomato seedling?

    Growing inside in the kitchen.

    A

    Growing outside in the sun.

    B

 

Outcomes

MS11-7

develops and carries out simple statistical processes to answer questions posed

What is Mathspace

About Mathspace