topic badge
Standard Level

12.08 Compare data sets

Lesson

It is important to be able to compare data sets because it helps us make conclusions or judgments about the data. Some sets of data are easily compared directly because they are measuring similar things, but sometimes we may need to compare data sets that are quite different. It is important to be able to compare data sets because it helps us make conclusions or judgements about the data.

For example, suppose Jim scores $\frac{5}{10}$510 in a geography test and $\frac{6}{10}$610 in a history test. Based on those marks alone, it makes sense to say that he did better in history. But what if everyone else in his geography class scored $\frac{4}{10}$410, while everyone else in his history class scored $\frac{8}{10}$810? Now we know that Jim had the highest score in the class in geography, and the lowest score in the class in history. With this extra information, it makes more sense to say that he did better in geography.

By using the means of central tendency in a data set (that is, the mean, median and mode), as well as measure of spread (such as the range or standard deviation), we can make comparisons between different groups.

Let's look at an example of data that directly compares similar variables.

 

Worked example

Example 1

Study the bar graph below which shows the changes in tourism rates in different cities during $2011$2011 and $2012$2012, then answer the following questions.


(a) Which city had the highest percentage of tourism in $2011$2011?

Think: The blue lines represent the $2011$2011 tourism rates.

Do: Paris has the tallest blue line ($80%$80%) so Paris had the highest percentage of tourism in $2011$2011.

(b) Which city had the lowest percentage of tourism in $2012$2012?

Think: The red lines represent the $2012$2012 tourism rates.

Do: Rome has the shortest red line (approximately $25%$25%) so Rome had the lowest tourism rates in $2012$2012.

(c) Which city had the highest percentage of tourism in a single year?

Think: The taller the column, the higher the tourism rate.

Do: New York had the highest percentage of tourism in a single year ($90%$90% in $2012$2012).

(d) Which cities had the lowest percentage of tourism in a single year?

Think: The shorter the column, the lower the tourism rate.

Do: Tokyo and Rome had the lowest tourism rates (Tokyo had a $25%$25% in $2011$2011 and Rome had the same rate in $2012$2012).

(e) How much higher is Paris's percentage of tourism in $2011$2011 than that of London in $2012$2012?

Think: We want to find the difference between the two percentages.

Do: In $2011$2011, Paris's rate was $80%$80%.

In $2012$2012, London's rate was also $80%$80%.

Paris' $2011$2011 rate is $0%$0% higher than London's $2012$2012 rate. In other words they're exactly the same!

(f) How much higher is Paris's maximum percentage of tourism over the two years than that of Istanbul?

Think: What are the highest rates for both cities?

Do: Paris's highest rate was $80%$80%.

Istanbul's highest rate was $60%$60%.

So we can see that Paris's highest tourism rate was $20%$20% higher than that of Istanbul.

 

Practice question

Question 1

The ages of employees at two competing fast food restaurants on a Saturday night are recorded. Some statistics are given in the following table:

  Mean Median Range
Berger's Burgers $18$18 $17$17 $6$6
Fry's Fries $18$18 $19$19 $2$2
  1. If the data for each restaurant was represented using a column graph, what would the likely shape of the column graph for Berger's Burgers be?

    Negatively skewed

    A

    Symmetrical

    B

    Positively skewed

    C
  2. Which restaurant likely has the oldest employee on the night the data is recorded?

    Berger's Burgers

    A

    Fry's Fries

    B
  3. Which restaurant has the most employees of a similar age?

    Berger's Burgers

    A

    Fry's Fries

    B
  4. Which statistical piece of evidence supports your answer to part (c)?

    Median

    A

    Mean

    B

    Range

    C
  5. Which restaurant has an older workforce?

    Fry's Fries

    A

    Berger's Burgers

    B
  6. Which statistical piece of evidence supports your answer to part (e)?

    Mean

    A

    Range

    B

    Median

    C

 

Considering the shape of data

While calculating the mean, median, mode and range can tell us a lot about a data set, these calculations can also be very powerful in comparing and contrasting two different data sets. 

We also can benefit from examining the shape of the distribution of two sets of data when comparing them.

 

Worked example

Example 2

The number of minutes spent exercising per day for $10$10 days is recorded for two people who have just signed up for a new gym membership. 

Person A:  $45,50,50,55,55,60,60,65,65,65$45,50,50,55,55,60,60,65,65,65  

Person B:  $20,30,45,55,60,60,65,70,70,70$20,30,45,55,60,60,65,70,70,70

(a) Calculate the mean, median, mode and range for Person A.

Think: The data set is already in order, which makes finding the median and range much easier. To find the mean we sum the scores and divide it by $10$10 (the number of scores), and then the mode will be the most frequently occurring score.

Do: We see that:

mode$=$=$65$65

median$=$=$57.5$57.5

mean$=$=$57$57

range$=$=$20$20

(b) Calculate the mean, median, mode and range for Person B.

Think: We will perform the same calculations as before, but using the second data set. Note that this data set is also already ordered.

Do: We see that:

mode$=$=$70$70

median$=$=$60$60

mean$=$=$54.5$54.5

range$=$=$50$50

(c) Which person is the most consistent with their exercise, and why?

Think: The person who is most consistent will have scores that are closer together. The only measure of spread that we have found here is the range, so we should compare ranges to see who is most consistent.

Do: Person A has a much smaller range than Person B, so we can conclude that Person A is more consistent.

(d) Which person seems to train more overall, and why?

Think: To determine who seems to train more overall, we should compare the measures of central tendency of the two sets.

Do: The mode and median for Person B are both larger than for Person A. 

While the mean for Person B is slightly lower than Person A, this is due to the negative skew of their data.

Overall, the larger mode and median for Person B indicates that they exercise for longer overall.

 

Practice questions

Question 2

The beaks of two groups of bird are measured, in mm, to determine whether they might be of the same species.

Length of beaks of two groups of birds (in mm.)
Group 1 $33$33 $39$39 $31$31 $27$27 $22$22 $37$37 $30$30 $24$24 $24$24 $28$28
Group 2 $29$29 $44$44 $45$45 $34$34 $31$31 $44$44 $44$44 $33$33 $37$37 $34$34
  1. Calculate the range for Group 1.

  2. Calculate the range for Group 2.

  3. Calculate the mean for Group 1. Give your answer as a decimal.

  4. Calculate the mean for Group 2. Give your answer as a decimal.

  5. Choose the most appropriate statement that describes the set of data.

    Although the ranges are similar, the mean values are significantly different indicating that these two groups of birds are of the same species.

    A

    Although the ranges are similar, the mean values are significantly different indicating that these two groups of birds are not of the same species.

    B

    Although the mean values are similar, the ranges are significantly different indicating that these two groups of birds are not of the same species.

    C

    Although the mean values are similar, the ranges are significantly different indicating that these two groups of birds are of the same species.

    D

question 3

The box plots drawn below show the number of repetitions of a $70$70 kg bar that two weightlifters can lift. They both record their repetitions over $30$30 days.

  1. Which weightlifter has the more consistent results?

    Weightlifter A.

    A

    Weightlifter B.

    B
  2. What statistical evidence supports your answer?

    The mean.

    A

    The range.

    B

    The mode.

    C

    The graph is positively skewed.

    D
  3. Which statistic is the same for each weightlifter?

    The median.

    A

    The mean.

    B

    The mode.

    C
  4. Which weightlifter can do the most repetitions of $70$70 kg?

    Weightlifter A.

    A

    Weightlifter B.

    B

Question 4

Derek planted some tomato seeds and two seeds sprouted. He puts Seedling A outside in the sun and Seedling B inside in the kitchen. The heights of the seedlings over time, in cm, are graphed below:

Loading Graph...

  1. How many centimetres did Seedling A grow in the second week?

  2. Complete the table below showing the increases in growth for both seedlings:

      Week 1 Week 2 Week 3 Week 4 Week 5 Week 6  Week 7
    Seedling A $3$3 cm $2$2 cm $\editable{}$ cm $\editable{}$ cm $2$2 cm $\editable{}$ cm $\editable{}$ cm
    Seedling B $\editable{}$ cm $\editable{}$ cm $\editable{}$ cm $\editable{}$ cm $\editable{}$ cm $\editable{}$ cm $\editable{}$ cm
  3. Which seedling appears to be growing more slowly?

    Seedling A.

    A

    Seedling B.

    B
  4. Which conditions seem to give optimum growth for a tomato seedling?

    Growing inside in the kitchen.

    A

    Growing outside in the sun.

    B

 

Parallel box plots

Parallel box plots are used to compare two sets of data visually. When comparing box plots, the five key data points are going to be the important parts to compare. Remember that the five number summary includes:

  • the lowest data point
  • $Q_1$Q1
  • the median
  • $Q_3$Q3
  • the highest data point

Just like when we look at back-to-back stem plots, we can compare the spread of data in two box plots. We call these parallel box plots as they are presented parallel to each other along the same number line for comparison. They must therefore be in the same scale so that a visual comparison is straightforward. 

It is important to clearly label each box plot. Here we have plotted two sets of data, comparing the time it took two different groups of people to complete an online task. 

Time taken to complete an online task

 

Notice that overall the under $30$30s were faster at completing the task. Both the under $30$30s box plot and the over $30$30s box plot are slightly negatively skewed. Over $75%$75% of the under $30$30s completed the task in under $22$22 seconds, which is the median time taken by the over $30$30s.  $100%$100% of the under $30$30s had finished the task before $75$75% of the over $30$30s had completed it. 
Overall the under $30$30s performed better and had a smaller spread of scores. There was a larger variance within the over $30$30 group, with a range of $24$24 seconds compared to $20$20 seconds for the under $30$30s.

 

Key comparisons

There are many things to keep in mind when comparing two sets of data, and a good place to start is often to compare the information in the five number summaries for each set. A few of the most important questions to ask yourself are:

  • How do the spreads of data compare?
  • How do the skews compare? Is one set of data more symmetrical? 
  • Is there a big difference in the medians?

 

Worked example

Example 3

The box plots show the distances, in centimetres, jumped by two high jumpers.

(a) Who has a higher median jump?

Think: The median is shown by the line in the middle of the box. Whose median line has a higher value?

Do: The middle line for John is at $120$120 cm, while the middle line for Bill is at $110$110 cm. So John has a higher median jump.

(b) Who made the highest jump?

Think: The highest jump is the value furthest to the right for each person.

Do: Notice that Bill doesn't have an upper whisker, so his highest jump was $120$120 cm - the same as his upper quartile height. On the other hand, John's highest jump was $150$150 cm, so John had the highest jump overall.

(c) Who made the lowest jump?

Think: The lowest jump is the value furthest to the left for each person. 

Do: Both John and Bill had a lowest jump of $60$60 cm.

 

Practice questions

QUESTION 5

The box plots show the monthly profits (in thousands of dollars) of two financial traders over a year.

Ned

5
10
15
20
25
30
35
40
45
50
55
60

Tobias

5
10
15
20
25
30
35
40
45
50
55
60

  1. Who made a higher median monthly profit?

    Ned

    A

    Tobias

    B
  2. Whose profits had a higher interquartile range?

    Tobias

    A

    Ned

    B
  3. Whose profits had a higher range?

    Ned

    A

    Tobias

    B
  4. How much more did Ned make in his most profitable month than Tobias did in his most profitable month?

QUESTION 6

The two box plots below show the data collected by the manufacturers on the life-span of light bulbs, measured in thousands of hours.

  1. Complete the following table using the two box plots. Write each answer in terms of hours.

      Manufacturer A Manufacturer B
    Median $\editable{}$ $5000$5000
    Lower Quartile $\editable{}$ $\editable{}$
    Upper Quartile $4500$4500 $\editable{}$
    Range $\editable{}$ $6500$6500
    Interquartile Range $\editable{}$ $\editable{}$
  2. Which manufacturer produces light bulbs with the best lifespan?

    Manufacturer A.

    A

    Manufacturer B.

    B

QUESTION 7

The box plots below represent the daily sales made by Carl and Angelina over the course of one month.

0
10
20
30
40
50
60
70
Angelina's Sales
0
10
20
30
40
50
60
70
Carl's Sales
  1. What is the range in Angelina's sales?

  2. What is the range in Carl’s sales?

  3. By how much did Carl’s median sales exceed Angelina's?

  4. Considering the middle $50%$50% of sales for both sales people, whose sales were more consistent?

    Carl

    A

    Angelina

    B
  5. Which salesperson had a more successful sales month?

    Angelina

    A

    Carl

    B

We have seen a useful way to compare numerical data sets is parallel box plots as these compare the two sets on the same scale. We can create a similar comparison by creating a back-to-back stem plot. Comparing the shape of histograms also provides a visual way to compare two or more sets of data. 

 

What is Mathspace

About Mathspace