topic badge

7.07 Comparing data sets

Lesson

At the start of this chapter we explored the statistical investigation process in which we seek to answer a statistical question. Often we need to make comparisons between different data sets to answer the questions, such as when we need to make a choice between two or more groups (or populations).

For example, if we are looking for the best tree to plant in the school courtyard for shade, we might compare two similar species of tree to determine "Which species of tree grows the fastest?". This is a statistical question and we can collect data by measuring the growth rates of many trees in a nursery.

If it turns out that all of the individual trees of one species grow faster than all of the individuals in the other species then it will be an easy decision. However, that is not always the case–each individual tree grows at a different rate, and there can be overlap between the measurements from both populations.

This is when we need to use statistical methods to analyse and compare data sets.

By comparing the means of central tendency in a data set (that is, the mean, median and mode), as well as measures of spread (range, interquartile range and standard deviation), we can make comparisons between different groups and draw conclusions about our data.

Practice questions

Question 1

Marge grows two different types of bean plants. She records the number of beans that she picks from each plant for $10$10 days. Her records show:

Plant $A$A: $10,4,4,5,7,10,3,3,9,10$10,4,4,5,7,10,3,3,9,10

Plant $B$B: $8,7,5,5,9,7,8,7,5,6$8,7,5,5,9,7,8,7,5,6

  1. What is the mean number of beans picked per day for Plant $A$A? Leave your answer to one decimal place if needed.

  2. What is the mean number of beans picked per day for Plant $B$B?

  3. What is the range for Plant $A$A?

  4. What is the range for Plant $B$B?

  5. Which plant produces more beans on average?

    Plant B

    A

    Plant A

    B
  6. Which plant has a more consistent yield of beans?

    Plant A

    A

    Plant B

    B

Question 2

The residents of two blocks of townhouses were asked the number of pets they own. The frequency of various responses are presented in the dot plots.

According to the data, which of the following statements are true?

  1. Pet ownership is a little lower in block $A$A.

    True

    A

    False

    B
  2. In block $A$A, most households have zero or one pet.

    True

    A

    False

    B
  3. In block $B$B, most households have three or more pets.

    True

    A

    False

    B
  4. In block $A$A, pet ownership is skewed negatively.

    True

    A

    False

    B
  5. Pet ownership ranges from $0$0 to $3$3 pets in block $A$A.

    True

    A

    False

    B
  6. There is more variability in the block $B$B distribution.

    True

    A

    False

    B
  7. Both sets of scores have an outlier.

    True

    A

    False

    B

Question 3

Student X scored $86,83,86,88,98$86,83,86,88,98 and

Student Y scored $61,83,50,85,83$61,83,50,85,83 across 5 exams.

  1. Find the mean score of Student X, writing your answer as a decimal.

  2. Find the mean score of Student Y

  3. Find the standard deviation of the scores for Student X, correct to two decimal places.

  4. Find the standard deviation of the scores for Student Y, correct to two decimal places.

  5. Which student performed better?

    Student X

    A

    Student Y

    B
  6. Which student performed more consistently?

    Student X

    A

    Student Y

    B

 

Comparing histograms

Histograms, and similar graphs (such as column graphs, dot plots and stem and leaf plots) are popular ways to display data because they give a detailed picture of the distribution of data.

However, with histograms it is not always easy to compare the particular statistical values for our data sets. The numeric characteristics that we can readily see in a histogram are limited to following:

  • mode (or modal class)
  • minimum and maximum values
  • spread (indicated by the range)

If the histogram is constructed for grouped data, the minimum, maximum and range can only be estimated because we don't know exactly which values are represented within each class interval.

Furthermore, it is not easy to identify the median or mean or interquartile range from a histogram by inspection, although often it is possible to see approximately where these values would lie. When necessary, we can calculate an estimated value for the mean and standard deviation of the data represented by a histogram.

On the other hand, the histogram does provide excellent insight into non-numeric characteristics of the data which can be important for comparison, including:

  • symmetry and skew
  • modality, including the location and frequency of the mode(s) or modal class(es)
  • size and location of clusters
  • gaps, size and location of gaps
 

Key comparisons for histograms

Many of the comparisons that are described here for histograms can also be used with similar statistical graphs such as column graphs, dot plots, stem and leaf plots and even frequency tables.

If outliers are identified in the histogram, then it is important that these are mentioned in the comparison, with an explanation of how they are handled. In particular, we should state if outliers are included or excluded from comparisons, and the effect that this has on our analysis.

Exploration

Consider the following histograms that show the height of students in two basketball teams. We know that one graph represents a team made up of Year $12$12 students and the other represents a Year $8$8 team. Which one corresponds to the year 12 team?

Team $A$A

Team $B$B

We can still compare these distributions, even though there is clearly a different number of students in the teams, because we are only interested in the shape and location of the data.

In our comparison we want to mention the most significant differences, and also describe relevant characteristics that are the same, or similar.

In this case we can observe these important similarities and differences:

  • both distributions are approximately symmetrical, and uni-modal.
  • the modal class for Team A of $170-175$170175 cm is much higher than the modal class $150-155$150155 cm for Team B.
  • if we ignore the outlier values in the $195-200$195200 class for Team A, then the range of both distributions is similar, at $35$35 cm for Team A and $30$30 cm for Team B.
  • student heights have greater spread overall for Team A
  • the heights for Team A appear to be concentrated around the modal class so we can say that the a clustered at $170-180$170180 cm.

Based on these observations, we could confidently say that Team A is the team of Year $12$12 students. In this case, the decision is clear because of the difference in the height for the modal class, which we would expect to be significantly higher for the older students.

 

Connecting column graphs and box plots

We have already seen how data can be displayed in histograms and in box plots. These two displays are great for being able to identify key features of the shape of the data, as well as the range and in the case of the box plot the inter-quartile range and the median.

We should expect then that the shape of the data would be the same whether it is represented in a polygon, box plot or histogram. Remember that the shape of data can be symmetric, left skewed or right skewed.

Symmetric

 

 

Positive (right) skewed

 

Negative (left) skewed

Looking at the diagrams above, we can see the similarities in the representations.

We can see the skewed tails, where the bulk of the data sits and general shape. These are some of the features we can use to match column graphs and box-and-whisker plots. We can also look at the data range.

 

Worked example

example 1

Match the box plots and histograms together.

To identify matching data start by identifying tails (left or right) and symmetric type data.

  • We can see, that A and 3 have right tails, and thus are both right skewed. So they are a match.
  • C and 2 have left tails, and thus are both left skewed and so are a match.
  • Which leaves B and 1, which are both symmetric data.

 

Practice questions

question 4

Match the column graph shown here to the correct box plot.

510102030405060708090

A bar graph is shown with x-axis ranging from 0 to 90 marked in intervals of 10, and y-axis ranging from 0 to 10 marked in major intervals of 5 and minor intervals of 1. At x=10, the height of the bar is 3. At x=20, the height of the bar is 7. At x=30, the height of the bar is 9. At x=40, the height of the bar is 3. At x=50, the height of the bar is 2. At x=60, the height of the bar is 0. At x=70, the height of the bar is 1. At x=80, the height of the bar is 1. At x=90, the height of the bar is 1.
  1. 10
    20
    30
    40
    50
    60
    70
    80
    90

    A box plot is shown with horizontal axis ranging from 10 to 90 marked in major intervals of 10 and minor intervals of 5. The left whisker starts at 30, the left side of the box is at 40, the vertical line inside the box is at 50, the right side of the box is at 60, and the right whisker ends at 70.
    A

    10
    20
    30
    40
    50
    60
    70
    80
    90

    A box plot is shown with horizontal axis ranging from 10 to 90 marked in major intervals of 10 and minor intervals of 5. The left whisker starts at 10, the left side of the box is at 20, the vertical line inside the box is at 30, the right side of the box is at 40, and the right whisker ends at 90.
    B

    10
    20
    30
    40
    50
    60
    70
    80
    90

    A box plot is shown with horizontal axis ranging from 10 to 90 marked in major intervals of 10 and minor intervals of 5. The left whisker starts at 10, the left side of the box is at 60, the vertical line inside the box is at 70, the right side of the box is at 80, and the right whisker ends at 90.
    C

    10
    20
    30
    40
    50
    60
    70
    80
    90

    A box plot is shown with horizontal axis ranging from 10 to 90 marked in major intervals of 10 and minor intervals of 5. The left whisker starts at 10, the left side of the box is at 20, the vertical line inside the box is at 50, the right side of the box is at 80, and the right whisker ends at 90.
    D

question 5

Match the box plot shown to the correct column graph.

10
20
30
40
50
60
70
80
90

  1. 510102030405060708090

    A

    510102030405060708090

    B

    510102030405060708090

    C

    510102030405060708090

    D

question 6

Consider the following pairs of histograms and box plots:

  1. Which two of these histograms and box plots are correctly paired?

    A
    B
    C
    D
  2. In part (a) we determined that the following histogram/box plot were an incorrect match:

    Which two of the options correctly describe why?

    The box plot has a long tail to the right which indicates positive skew, while the histogram does not appear to be skewed.

    A

    The data on the histogram is widely spread, while the box plot indicates that the data is mostly located around the median.

    B

    The median for the histogram is roughly in the middle, while the median of the box plot is located further to the left.

    C

 

Comparing parallel box plots

Parallel box plots are used to compare two (or more) sets of data visually. When comparing box plots, the $5$5 key numbers are going to be the important parts to consider. The $5$5 number summary will give us the:

  • minimum value
  • lower quartile
  • median
  • upper quartile
  • maximum value

We can also derive other statistics such as the range and inter-quartile range, and make visual observations of symmetry and skew that we can consider in our comparison.

The term parallel is used because the box plots are presented parallel to each other along the same number line for comparison. They must therefore be in the same scale, so a visual comparison is fairly straightforward.

Exploration

Here we have two sets of data, comparing the time it took two different groups of people to complete an online task. It is important to clearly label each box plot.

If we want to choose the best group to complete the task, based only on time (in real life other factors, such as accuracy might be more important), we could consider the following observations:

Note that lower numbers mean that the task was completed faster, so lower is better.

  • the minimum, lower quartile, median, upper quartile and maximum were all lower for the under $30$30s group;
  • the range is lower for the under $30$30s ($20$20 seconds) than the over $30$30s ($24$24 seconds);
  • the interquartile range is lower for the under $30$30s ($8$8 seconds) than the over $30$30s ($9$9 seconds);
  • at least $75%$75% of the under $30$30s completed the task in under $22$22 seconds, which is the median time for the over $30$30s;
  • $100%$100% of the under $30$30s completed the task before the slowest $25%$25% of the over $30$30s.

Every one of these measures is in favour of the under $30$30s so, overall, we can conclude that the under $30$30s performed better.

 

Key comparisons for box plots

When comparing two sets of data we can compare the $5$5 key points as shown above. There are key questions we should ask:

  • How do the spreads of data compare?
  • How do the skews compare? Is one set of data more symmetrical?
  • Is there a big difference in the medians?
  • Can we see regions on one boxplot that extend past the comparable region on the other?

We should always consider what factors are more important for the given situation. In some cases we might be able to make judgements by simply comparing the median value; sometimes the minimum or the maximum value is the critical measurement. In other situations the consistency will be more important than extreme values so we should consider measures of spread to make judgements.

If outliers are identified in the box plots, then it is important that these are mentioned in the comparison, with an explanation of how they are handled. That is, we should state if outliers are included or excluded from comparisons, and the effect that this has.

Exploration

The box plots show the distances, in centimetres, jumped by two high jumpers.

From these box plots, we can make comparison statements such as:

  • Both Bill and Jim have the same minimum jump of $60$60 cm.
  • The median height for John's jump is equal to the maximum height for Bill. So the best $50%$50% of John's jumps are higher than Bill's best jump.
  • Bill is more consistent than John, with the top $50%$50% of Bill's jumps limited to an interval of just $10$10 cm, compared to an interval of $30$30 cm for the top $50%$50% of John's jumps. Furthermore, the range of $60$60 cm for Bill compared to $90$90 cm for John. Both Bill and John have the same interquartile range of $60$60 cm.

Based on this comparison, if we had to choose one of these high jumpers for the school athletics team, we would most likely choose John. In this case, the maximum height that John can achieve is more important than the consistency (but lower height) of Bill's jumps.

 

QUESTION 7

The box plots show the monthly profits (in thousands of dollars) of two derivatives traders over a year.

Ned

5
10
15
20
25
30
35
40
45
50
55
60

Tobias

5
10
15
20
25
30
35
40
45
50
55
60

  1. Who made a higher median monthly profit?

    Ned

    A

    Tobias

    B
  2. Whose profits had a higher interquartile range?

    Tobias

    A

    Ned

    B
  3. Whose profits had a higher range?

    Ned

    A

    Tobias

    B
  4. How much more did Ned make in his most profitable month than Tobias did in his most profitable month?

QUESTION 8

The two box plots below show the data collected by the manufacturers on the life-span of light bulbs, measured in thousands of hours.

  1. Complete the following table using the two box plots. Write each answer in terms of hours.

      Manufacturer A Manufacturer B
    Median $\editable{}$ $5000$5000
    Lower Quartile $\editable{}$ $\editable{}$
    Upper Quartile $4500$4500 $\editable{}$
    Range $\editable{}$ $6500$6500
    Interquartile Range $\editable{}$ $\editable{}$
  2. Hence, which manufacturer produces light bulbs with the best lifespan?

    Manufacturer A.

    A

    Manufacturer B.

    B

QUESTION 9

The box plots below represent the daily sales made by Carl and Angelina over the course of one month.

0
10
20
30
40
50
60
70
Angelina's Sales
0
10
20
30
40
50
60
70
Carl's Sales
  1. What is the range in Angelina's sales?

  2. What is the range in Carl’s sales?

  3. By how much did Carl’s median sales exceed Angelina's?

  4. Considering the middle $50%$50% of sales for both sales people, whose sales were more consistent?

    Carl

    A

    Angelina

    B
  5. Which salesperson had a more successful sales month?

    Angelina

    A

    Carl

    B

Outcomes

2.1.11

compare groups on a single numerical variable using medians, means, IQRs, ranges or standard deviations, and as appropriate; interpret the differences observed in the context of the data and report the findings in a systematic and concise manner

What is Mathspace

About Mathspace