Often, it is necessary to make comparisons between different data sets to answer questions, or making a choice between two or more groups (or populations).
For example, if looking for the best tree to plant in the school courtyard for shade, it might be useful to compare two similar species of tree to determine "Which species of tree grows the fastest?". This is a statistical question and data can be collected by measuring the growth rates of many trees in a nursery.
If it turns out that all of the individual trees of one species grow faster than all of the individuals in the other species then it will be an easy decision. However, that is not always the case–each individual tree grows at a different rate, and there can be overlap between the measurements from both populations.
This is when statistical methods can be used to analyse and compare data sets.
By comparing the means of central tendency in a data set (that is, the mean, median and mode), as well as measures of spread (range, interquartile range and standard deviation), it is possible to make comparisons between different groups and draw conclusions about the data.
Student X scored $86,83,86,88,98$86,83,86,88,98 and
Student Y scored $61,83,50,85,83$61,83,50,85,83 across 5 exams.
Find the mean score of Student X, writing your answer as a decimal.
Find the mean score of Student Y
Find the standard deviation of the scores for Student X, correct to two decimal places.
Find the standard deviation of the scores for Student Y, correct to two decimal places.
Which student performed better?
Student X
Student Y
Student X
Student Y
Histograms, and similar graphs (such as column graphs, dot plots and stem and leaf plots) are popular ways to display data because they give a detailed picture of the distribution of data.
However, with histograms it is not always easy to compare the particular statistical values for our data sets. The numeric characteristics most easily identifiable in a histogram are limited to the following:
Since histograms are constructed from class intervals, the minimum, maximum and range can only be estimated because we don't know exactly which values are represented within each class interval.
Furthermore, it is not easy to identify the median or mean or interquartile range from a histogram by inspection, although often it is possible to see approximately where these values would lie. When necessary, it is possible to calculate an estimated value for the mean and standard deviation of the data represented by a histogram.
On the other hand, the histogram does provide excellent insight into non-numeric characteristics of the data which can be important for comparison, including:
Many of the comparisons that are described here for histograms can also be used with similar statistical graphs such as column graphs, dot plots, stem and leaf plots and even frequency tables.
If outliers are identified in the histogram, then it is important that these are mentioned in the comparison, with an explanation of how they are handled. In particular, a statement should be made if outliers are included or excluded from comparisons, and the effect that this has on the analysis.
Consider the following histograms that show the height of students in two basketball teams. We know that one graph represents a team made up of Year $12$12 students and the other represents a Year $8$8 team. Which one corresponds to the year 12 team?
We can still compare these distributions, even though there is clearly a different number of students in the teams, because we are only interested in the shape and location of the data.
In our comparison we want to mention the most significant differences, and also describe relevant characteristics that are the same, or similar.
In this case we can observe these important similarities and differences:
Based on these observations, we could confidently say that Team $A$A is the team of Year $12$12 students. In this case, the decision is clear because of the difference in the height for the modal class, which we would expect to be significantly higher for the older students.
Recall that data can be displayed in histograms and in box plots. These two displays are great for being able to identify key features of the shape of the data, as well as the range and in the case of the box plot the inter-quartile range and the median.
The shape of the data will be the same whether it is represented in a polygon, box plot or histogram. Remember that the shape of data can be symmetric, left skewed or right skewed.
Looking at the diagrams above, notice the similarities in the representations.
Both representations have skewed tails, where the bulk of the data sits and general shape. These are some of the features that can be used to match histograms and box-and-whisker plots. The data range can also be considered.
Match the box plots and histograms together.
Think: To identify matching data start by identifying tails (left or right) and symmetric type data.
Do:
Consider the following pairs of histograms and box plots:
Which two of these histograms and box plots are correctly paired?
In part (a) we determined that the following histogram/box plot were an incorrect match:
Which two of the options correctly describe why?
The box plot has a long tail to the right which indicates positive skew, while the histogram does not appear to be skewed.
The data on the histogram is widely spread, while the box plot indicates that the data is mostly located around the median.
The median for the histogram is roughly in the middle, while the median of the box plot is located further to the left.
Parallel box plots are used to compare two (or more) sets of data visually. When comparing box plots, the $5$5 key numbers are going to be the important parts to consider. The $5$5 number summary will give:
Other statistics can be derived such as the range and inter-quartile range, and visual observations can be made of symmetry and skew that should be considered in any comparisons.
The term parallel is used because the box plots are presented parallel to each other along the same number line for comparison. They must therefore be in the same scale, so a visual comparison is fairly straightforward.
Here we have two sets of data, comparing the time it took two different groups of people to complete an online task. It is important to clearly label each box plot.
If we want to choose the best group to complete the task, based only on time (in real life other factors, such as accuracy might be more important), we could consider the following observations:
Note that lower numbers mean that the task was completed faster, so lower is better.
Every one of these measures is in favour of the under $30$30s so, overall, we can conclude that the under $30$30s performed better.
When comparing two sets of data, compare the $5$5 key points as shown above. There are key questions that should be asked:
Always consider what factors are more important for the given situation. In some cases, it might be necessary to make judgements by simply comparing the median value; sometimes the minimum or the maximum value is the critical measurement. In other situations, the consistency will be more important than extreme values so it is also worth considering measures of spread to make judgements.
If outliers are identified in the box plots, then it is important that these are mentioned in the comparison, with an explanation of how they are handled. That is, there should be a statement if outliers are included or excluded from comparisons, and the effect that this has.
The box plots show the distances, in centimetres, jumped by two high jumpers.
From these box plots, comparison statements can be made, such as:
Based on this comparison, if we had to choose one of these high jumpers for the school athletics team, we would most likely choose John. In this case, the maximum height that John can achieve is more important than the consistency (but lower height) of Bill's jumps.
The box plots below represent the daily sales made by Carl and Angelina over the course of one month.
0 10 20 30 40 50 60 70 Angelina's Sales |
0 10 20 30 40 50 60 70 Carl's Sales |
What is the range in Angelina's sales?
What is the range in Carl’s sales?
By how much did Carl’s median sales exceed Angelina's?
Considering the middle $50%$50% of sales for both sales people, whose sales were more consistent?
Carl
Angelina
Which salesperson had a more successful sales month?
Angelina
Carl