topic badge
AustraliaVIC
VCE 11 General 2023

1.09 Compare sets of data

Lesson

Compare data sets

Often, it is necessary to make comparisons between different data sets to answer questions, or making a choice between two or more groups (or populations).

For example, if looking for the best tree to plant in the school courtyard for shade, it might be useful to compare two similar species of tree to determine "Which species of tree grows the fastest?". This is a statistical question and data can be collected by measuring the growth rates of many trees in a nursery.

If it turns out that all of the individual trees of one species grow faster than all of the individuals in the other species then it will be an easy decision. However, that is not always the case - each individual tree grows at a different rate, and there can be overlap between the measurements from both populations.

This is when statistical methods can be used to analyse and compare data sets.

By comparing the means of central tendency in a data set (that is, the mean, median and mode), as well as measures of spread (range, interquartile range and standard deviation), it is possible to make comparisons between different groups and draw conclusions about the data.

Examples

Example 1

Student X scored 86,83,86,88,98, and Student Y scored 61,83,50,85,83 across 5 exams.

a

Find the mean score of Student X, writing your answer as a decimal.

Worked Solution
Create a strategy

To find the mean, use the formula: \text{Mean}=\dfrac{\text{Sum of scores}}{\text{Number of scores}}

Apply the idea

To calculate the mean we add all the scores of Student X and divide by 5.

\displaystyle \text{Mean}\displaystyle =\displaystyle \frac{86+83+86+88+98}{5}
\displaystyle =\displaystyle 88.2
b

Find the mean score of Student Y, writing your answer as a decimal.

Worked Solution
Create a strategy

To find the mean, use the formula: \text{Mean}=\dfrac{\text{Sum of scores}}{\text{Number of scores}}

Apply the idea

To calculate the mean we add all the scores of Student Y and divide by 5.

\displaystyle \text{Mean}\displaystyle =\displaystyle \frac{61+83+50+85+83}{5}
\displaystyle =\displaystyle 72.4
c

Find the standard deviation of the scores for Student X, correct to two decimal places

Worked Solution
Create a strategy

Use your calculator in Statistics mode to find the standard deviation.

Apply the idea

\text{Standard deviation}=5.15

d

Find the standard deviation of the scores for Student Y, correct to two decimal places.

Worked Solution
Create a strategy

Use your calculator in Statistics mode to find the standard deviation.

Apply the idea

\text{Standard deviation}=14.25

e

Which student performed better?

A
Student Y
B
Student X
Worked Solution
Create a strategy

Choose the student with higher mean score.

Apply the idea

Student X performs better as it has a higher mean score than student Y. So, the correct answer is Option B.

f

Which student performed more consistently?

A
Student X
B
Student Y
Worked Solution
Create a strategy

Choose the student with lower value of standard deviation.

Apply the idea

Student X performs more consistently as it has a lower value of standard deviation which means that its scores are more clustered together. So, the correct answer is Option A.

Idea summary

When comparing data sets:

  • To determine which data set scored higher, we compare the measures of centre.

  • To determine which data set is more consistent, we compare the measures of spread.

Compare histograms

Histograms, and similar graphs (such as column graphs, dot plots and stem and leaf plots) are popular ways to display data because they give a detailed picture of the distribution of data.

However, with histograms it is not always easy to compare the particular statistical values for our data sets. The numeric characteristics most easily identifiable in a histogram are limited to the following:

  • mode (or modal class)

  • minimum and maximum values

  • spread (indicated by the range)

Since histograms are constructed from class intervals, the minimum, maximum and range can only be estimated because we don't know exactly which values are represented within each class interval.

Furthermore, it is not easy to identify the median or mean or interquartile range from a histogram by inspection, although often it is possible to see approximately where these values would lie. When necessary, it is possible to calculate an estimated value for the mean and standard deviation of the data represented by a histogram

On the other hand, the histogram does provide excellent insight into non-numeric characteristics of the data which can be important for comparison, including:

  • symmetry and skew

  • modality, including the location and frequency of the mode(s) or modal class(es)

  • size and location of clusters

  • gaps, size and location of gaps

Key comparisons for histograms:

Many of the comparisons that are described here for histograms can also be used with similar statistical graphs such as column graphs, dot plots, stem and leaf plots and even frequency tables.

If outliers are identified in the histogram, then it is important that these are mentioned in the comparison, with an explanation of how they are handled. In particular, a statement should be made if outliers are included or excluded from comparisons, and the effect that this has on the analysis.

Consider the following histograms that show the height of students in two basketball teams. We know that one graph represents a team made up of Year 12 students and the other represents a Year 8 team. Which one corresponds to the Year 12 team?

Team A:

The image shows a histogram of Team A. Ask your teacher for more information.

Team B:

The image shows a histogram of Team B. Ask your teacher for more information.

We can still compare these distributions, even though there is clearly a different number of students in the teams, because we are only interested in the shape and location of the data.

In our comparison we want to mention the most significant differences, and also describe relevant characteristics that are the same, or similar.

In this case we can observe these important similarities and differences:

  • Both distributions are approximately symmetrical, and uni-modal.

  • The modal class for Team A of 170-175 \text{ cm} is much higher than the modal class 150-155 \text{ cm} for Team B.

  • If we ignore the outlier values in the 195-200 class for Team A, then the range of both distributions is similar, at 35 \text{ cm} for Team A and 30 \text{ cm} for Team B.

  • Student heights have greater spread overall for Team A.

  • The heights for Team A appear to be concentrated around the modal class so we can say that the a clustered at 170-180\text{ cm}.

Based on these observations, we could confidently say that Team A is the team of Year 12 students. In this case, the decision is clear because of the difference in the height for the modal class, which we would expect to be significantly higher for the older students.

Idea summary

The numeric characteristics most easily identifiable in a histogram are limited to the following:

  • mode (or modal class)

  • minimum and maximum values

  • spread (indicated by the range)

On the other hand, the histogram does provide excellent insight into non-numeric characteristics of the data which can be important for comparison, including:

  • symmetry and skew

  • modality, including the location and frequency of the mode(s) or modal class(es)

  • size and location of clusters

  • gaps, size and location of gaps

Histograms and box plots

Recall that data can be displayed in histograms and in box plots. These two displays are great for being able to identify key features of the shape of the data, as well as the range and in the case of the box plot the inter-quartile range and the median.

The shape of the data will be the same whether it is represented in a polygon, box plot or histogram. Remember that the shape of data can be symmetric, left skewed or right skewed.

Symmetric

A histogram and a box plot showing most of its data to the median. Ask your teacher for more information.

Positive (right) skewed

A histogram and a box plot showing most of its data to the left. Ask your teacher for more information.

Negative (left) skewed

A histogram and a box plot showing most of its data to the right. Ask your teacher for more information.

Looking at the diagrams above, notice the similarities in the representations.

Both representations have skewed tails, where the bulk of the data sits and general shape. These are some of the features that can be used to match histograms and box-and-whisker plots. The data range can also be considered.

Examples

Example 2

Which two of these histograms and box plots are correctly paired?

A
A histogram and box plot that are both negatively skewed. Ask your teacher for more information.
B
A histogram and box plot that are both symmetrical. Ask your teacher for more information.
C
A histogram that is symmetrical and box plot that is positively skewed. Ask your teacher for more information.
D
A histogram that is symmetrical and box plot that is negatively skewed. Ask your teacher for more information.
Worked Solution
Create a strategy

Look for characteristics of skewed and symmetric distributions of data.

Apply the idea

In Option A, both the histogram and box plot are negatively skewed.

In Option B, both the histogram and box plot are both symmetrical.

In Option C, the histogram is roughly symmetrical, but the box plot is positively skewed.

In Option D, the histogram is roughly symmetrical, but the box plot is negatively skewed.

The matching pairs are options A and B.

Idea summary

We can compare the shape of the data in histograms and box plots by checking for symmetry and skew.

  • For symmetric data, both graphs should be symmetrical about the median.

  • For positively skewed data, the histogram would have most of the data to the left and a shape with the tail pointing right. The box plot would have the box to the left and a long right whisker.

  • For negatively skewed data, the histogram would have most of the data to the right and a shape with the tail pointing left. The box plot would have the box to the right and a long left whisker.

Parallel box plots

Parallel box plots are used to compare two sets of data visually. When comparing box plots, the 5 key numbers are going to be the important parts to consider. The 5 number summary will give:

  • Minimum

  • Lower quartile \left(Q_1\right)

  • Median

  • Upper quartile \left(Q_3\right)

  • Maximum

Other statistics can be derived such as the range and inter-quartile range, and visual observations can be made of symmetry and skew that should be considered in any comparisons.

The term parallel is used because the box plots are presented parallel to each other along the same number line for comparison. They must therefore be in the same scale, so a visual comparison is fairly straightforward.

Here we have two sets of data, comparing the time it took two different groups of people to complete an online task. It is important to clearly label each box plot.

Two box plots showing two different group, under thirties and over thirties. Ask your teacher for more information.

If we want to choose the best group to complete the task, based only on time (in real life other factors, such as accuracy might be more important), we could consider the following observations:

Note that lower numbers mean that the task was completed faster, so lower is better.

  • The minimum, lower quartile, median, upper quartile and maximum were all lower for the under 30s group;

  • The range is lower for the under 30s (20 seconds) than the over 30s (24 seconds);

  • The interquartile range is lower for the under 30s (8 seconds) than the over 30s (9 seconds);

  • At least 75\% of the under 30s completed the task in under 22 seconds, which is the median time for the over 30s;

  • 100\% of the under 30s completed the task before the slowest 25\% of the over 30s.

Every one of these measures is in favour of the under 30s so, overall, we can conclude that the under 30s performed better.

Key comparisons for box plots:

When comparing two sets of data, compare the 5 key points as shown above. There are key questions that should be asked:

  • How do the spreads of data compare?

  • How do the skews compare? Is one set of data more symmetrical?

  • Is there a big difference in the medians?

  • Can we see regions on one box plot that extend past the comparable region on the other?

Always consider what factors are more important for the given situation. In some cases, it might be necessary to make judgements by simply comparing the median value; sometimes the minimum or the maximum value is the critical measurement. In other situations, the consistency will be more important than extreme values so it is also worth considering measures of spread to make judgements.

If outliers are identified in the box plots, then it is important that these are mentioned in the comparison, with an explanation of how they are handled. That is, there should be a statement if outliers are included or excluded from comparisons, and the effect that this has.

Examples

Example 3

The box plots below represent the daily sales made by Carl and Angelina over the course of one month.

Angelina's Sales
0
10
20
30
40
50
60
70
Carl's Sales
0
10
20
30
40
50
60
70
a

What is the range in Angelina's sales?

Worked Solution
Create a strategy

Find the difference between the highest and smallest scores in the data set of Angelina's sales.

Apply the idea

Based on the box plot of Angelina's sales, the smallest score is 2 and the highest score is 51.

\displaystyle \text{Range}\displaystyle =\displaystyle 51-2Subtract the smallest from the highest
\displaystyle =\displaystyle 49Evaluate
b

What is the range in Carl's sales?

Worked Solution
Create a strategy

Find the difference between the highest and smallest scores in the data set of Carl's sales.

Apply the idea

Based on the box plot of Carl's sales, the smallest score is 14 and the highest score is 64.

\displaystyle \text{Range}\displaystyle =\displaystyle 64-14Subtract the smallest from the highest
\displaystyle =\displaystyle 50Evaluate
c

By how much did Carl's median sales exceed Angelina's?

Worked Solution
Create a strategy

Find the value of each median, then find the difference between these values.

Apply the idea
\displaystyle \text{Angelina's median}\displaystyle =\displaystyle 30Find the score under the middle line
\displaystyle \text{Carl's median}\displaystyle =\displaystyle 42Find the score under the middle line
\displaystyle \text{Difference}\displaystyle =\displaystyle 42-30Subtract the medians
\displaystyle =\displaystyle 12Evaluate
d

Considering the middle 50\% of sales for both sales people, whose sales were more consistent?

Worked Solution
Create a strategy

Compare the interquartile ranges.

Apply the idea

The interquartile range will tell us how consistent the middle 50\% of scores are.

\displaystyle \text{Angelina's IQR}\displaystyle =\displaystyle 42-16Subtract Q_1 from Q_3
\displaystyle =\displaystyle 26Evaluate
\displaystyle \text{Carl's IQR}\displaystyle =\displaystyle 50-30Subtract Q_1 from Q_3
\displaystyle =\displaystyle 20Evaluate

Carl has the smaller \text{IQR}, so his sales are more consistent.

e

Which salesperson had a more successful sales month?

Worked Solution
Create a strategy

Compare the medians and interquartile ranges.

Apply the idea
\displaystyle \text{Angelina's median}\displaystyle =\displaystyle 30Score the middle vertical line is on
\displaystyle \text{Carl's median}\displaystyle =\displaystyle 42Score the middle vertical line is on

Carl has the higher median, so he had more sales on average.

Reflect and check

By comparing the box plots, we can also see that Carl's lower quartile is equal to Angelina's median. This means that 75\% of Carl's sales are higher than 50\% of Angelina's, which confirms that he has a more successful month.

Idea summary

Parallel box plots are used to compare two or more sets of data visually. These box plots are presented parallel to each other along the same number line using the same scale.

When comparing two box plots, compare the 5 key points as shown above. There are key questions that should be asked:

  • How do the spreads of data compare?

  • How do the skews compare? Is one set of data more symmetrical?

  • Is there a big difference in the medians?

  • Can we see regions on one box plot that extend past the comparable region on the other?

Outcomes

U1.AoS1.4

mean 𝑥 and sample standard deviation s

U1.AoS1.5

construct and interpret graphical displays of data, and describe the distributions of the variables involved and interpret in the context of the data

U1.AoS1.6

calculate the values of appropriate summary statistics to represent the centre and spread of the distribution of a numerical variable and interpret in the context of the data

U1.AoS1.7

construct and use parallel boxplots or back-to-back stem plots (as appropriate) to compare the distribution of a numerical variable across two or more groups in terms of centre (median), spread (range and IQR) and outliers, interpreting any observed differences in the context of the data

What is Mathspace

About Mathspace