topic badge

6.05 Comparisons of data sets

Lesson

Comparisons using summary statistics and frequency displays

It is important to be able to compare data sets because it helps us make conclusions or judgments about the data. For example, say Jim got $\frac{5}{10}$510 in a geography test and $\frac{6}{10}$610 in a history test. Which test did he do better in? Just based on those marks, it makes sense to say he did better in history.

But what about if everyone else in his class got $\frac{4}{10}$410 in geography and $\frac{8}{10}$810 in history? If you had the greatest score in the class in geography and the least score in the class in history, does it really make sense to say you did better in history?

By using the measures of center and measures of spread, we can make comparisons between different groups. We also can benefit from examining the shape of the distribution of two sets of data when comparing them.

Practice questions

Question 1

The weight (in kilograms) of a group of men and women were recorded and presented in a stem and leaf plot as shown.

Men Women
Stem Leaf
$6$6 $5$5 $6$6 $7$7 $8$8 $9$9 $5$5 $6$6 $7$7 $8$8 $9$9
$7$7 $0$0 $1$1 $1$1 $3$3 $3$3 $4$4 $8$8 $9$9 $9$9 $9$9
 
Stem Leaf
$5$5 $1$1 $2$2 $2$2 $2$2 $3$3 $4$4 $6$6 $7$7 $8$8 $9$9 $9$9 $9$9 $9$9
$6$6 $0$0 $0$0 $1$1 $2$2 $2$2 $3$3 $4$4
 

Key: $4$4$\mid$$2$2$=$=$42$42 kg
  1. What is the mean weight of the group of men? Express your answer in decimal form.

  2. What is the mean weight of the group of women? Express your answer in decimal form.

  3. Which group is heavier?

    Women

    A

    Men

    B

    Women

    A

    Men

    B

Question 2

Marge grows two different types of bean plants. She records the number of beans that she picks from each plant for $10$10 days. Her records show:

Plant $A$A: $10,4,4,5,7,10,3,3,9,10$10,4,4,5,7,10,3,3,9,10

Plant $B$B: $8,7,5,5,9,7,8,7,5,6$8,7,5,5,9,7,8,7,5,6

  1. What is the mean number of beans picked per day for Plant $A$A? Leave your answer to one decimal place if needed.

  2. What is the mean number of beans picked per day for Plant $B$B?

  3. What is the range for Plant $A$A?

  4. What is the range for Plant $B$B?

  5. Which plant produces more beans on average?

    Plant B

    A

    Plant A

    B

    Plant B

    A

    Plant A

    B
  6. Which plant has a more consistent yield of beans?

    Plant A

    A

    Plant B

    B

    Plant A

    A

    Plant B

    B

Question 3

Two English classes, each with $15$15 students, sit a $10$10 question multiple choice test, each with four possible answers (only one of which is correct). Their class results, out of $10$10, are below:

Scores out of $10$10
Class 1: $3$3 $2$2 $3$3 $3$3 $4$4 $5$5 $1$1 $1$1 $1$1 $4$4 $2$2 $2$2 $3$3 $3$3 $2$2
Class 2: $8$8 $9$9 $9$9 $8$8 $8$8 $6$6 $8$8 $10$10 $6$6 $8$8 $8$8 $9$9 $6$6 $9$9 $9$9
  1. Calculate the mean (correct to one decimal place), median, mode and range for Class 1:

    Mean = $\editable{}$
    Median = $\editable{}$
    Mode = $\editable{}$
    Range = $\editable{}$
  2. Calculate the mean (correct to one decimal place), median, mode and range for Class 2:

    Mean = $\editable{}$
    Median = $\editable{}$
    Mode = $\editable{}$
    Range = $\editable{}$
  3. Which class was more likely to have studied for their test?

    Class 1

    A

    Class 2

    B

    Class 1

    A

    Class 2

    B
  4. Which statistical pieces of evidence support your answer? Choose all appropriate answers.

    The mean.

    A

    The mode.

    B

    The range.

    C

    The median.

    D

    The mean.

    A

    The mode.

    B

    The range.

    C

    The median.

    D

 

Using parallel box plots

Parallel box plots are used to compare two sets of data visually. When comparing box plots, the $5$5 key data points are going to be the important parts to compare. This $5$5number summary will give you:

  • the least data point
  • the greatest data point
  • the upper quartile
  • the lower quartile and
  • the median

Just like when we look at back-to-back stem and leaf plots, we can compare the spread of data in two box plots. We call these parallel box plots as they are presented parallel to each other along the same number line for comparison. They must, therefore, be in the same scale, so a visual comparison is fairly straightforward. 

It is important to clearly label each box plot. Here we have plotted two sets of data, comparing the time it took two different groups of people to complete an online task. 

 

You can see that with a lower median of 18 compared to 22, overall the under $30$30s were faster at completing the task. Both the under $30$30s box plot and the over $30$30s box plot are slightly negatively skewed. Over $75%$75% of the under $30$30s completed the task in under $22$22 seconds, which is the median time taken by the over $30$30s.  $100%$100% of the under $30$30s had finished the task before $75$75% of the over $30$30s had completed it. 
Overall the under $30$30s performed better and had a smaller spread of scores. There was a larger spread within the over $30$30 group, with an IQR of $11$11 seconds compared to $8$8 seconds for the under $30$30s.

Key Comparisons

When comparing two sets of data we can compare the 5 key points. as shown above. There are key questions you should ask yourself:

How do the spreads of data compare?
How do the skews compare? Is one set of data more symmetrical? 
Is there a big difference in the medians?

Worked example

The box plots show the distances, in centimeters, jumped by two high jumpers.

a) Who has a higher median jump?

Think: The median is shown by the line in the middle of the box. Whose median line has a higher value?

Do: John

b) Who made the greatest jump?

Think: The greatest jump is the end of the whisker for each jumper. Bill doesn't have an upper whisker as his greatest jump was the same as the upper quartile height. Whose jump was the greatest?

Do: John 

c) Who made the least jump?

Think: The least jump is shown on each box plot by the lower whisker. 

Do: Both John and Bill had a least jump of $60$60 cm.

Practice questions

QUESTION 4

The box plots show the monthly profits (in thousands of dollars) of two derivatives traders over a year.

Ned

5
10
15
20
25
30
35
40
45
50
55
60

Tobias

5
10
15
20
25
30
35
40
45
50
55
60

  1. Who made a higher median monthly profit?

    Ned

    A

    Tobias

    B

    Ned

    A

    Tobias

    B
  2. Whose profits had a higher interquartile range?

    Tobias

    A

    Ned

    B

    Tobias

    A

    Ned

    B
  3. Whose profits had a higher range?

    Ned

    A

    Tobias

    B

    Ned

    A

    Tobias

    B
  4. How much more did Ned make in his most profitable month than Tobias did in his most profitable month?

QUESTION 5

The two box plots below show the data collected by the manufacturers on the life-span of light bulbs, measured in thousands of hours.

  1. Complete the following table using the two box plots. Write each answer in terms of hours.

      Manufacturer A Manufacturer B
    Median $\editable{}$ $\editable{}$
    Lower Quartile $\editable{}$ $\editable{}$
    Upper Quartile $\editable{}$ $\editable{}$
    Range $\editable{}$ $\editable{}$
    Interquartile Range $\editable{}$ $\editable{}$
  2. Hence, which manufacturer produces light bulbs with the best lifespan?

    Manufacturer A.

    A

    Manufacturer B.

    B

    Manufacturer A.

    A

    Manufacturer B.

    B

QUESTION 6

The box plots below represent the daily sales made by Carl and Angelina over the course of one month.

0
10
20
30
40
50
60
70
Angelina's Sales
0
10
20
30
40
50
60
70
Carl's Sales
  1. What is the range in Angelina's sales?

  2. What is the range in Carl’s sales?

  3. By how much did Carl’s median sales exceed Angelina's?

  4. Considering the middle $50%$50% of sales for both sales people, whose sales were more consistent?

    Carl

    A

    Angelina

    B

    Carl

    A

    Angelina

    B
  5. Which salesperson had a more successful sales month?

    Angelina

    A

    Carl

    B

    Angelina

    A

    Carl

    B

Outcomes

S-ID.1

Represent data with plots on the real number line (dot plots, histograms, and box plots).

S-ID.2

Use statistics appropriate to the shape of the data distribution to compare center (median, mean) and spread (interquartile range, standard deviation) of two or more different data sets.

S-ID.3

Interpret differences in shape, center, and spread in the context of the data sets, accounting for possible effects of extreme data points (outliers).

What is Mathspace

About Mathspace