At the start of this chapter we explored the statistical investigation process in which we seek to answer a statistical question. Often we need to make comparisons between different data sets to answer the questions, such as when we need to make a choice between two or more groups (or populations).
For example, if we are looking for the best tree to plant in the school courtyard for shade, we might compare two similar species of tree to determine "Which species of tree grows the fastest?". This is a statistical question and we can collect data by measuring the growth rates of many trees in a nursery.
If it turns out that all of the individual trees of one species grow faster than all of the individuals in the other species then it will be an easy decision. However, that is not always the case–each individual tree grows at a different rate, and there can be overlap between the measurements from both populations.
This is when we need to use statistical methods to analyse and compare data sets.
By comparing the means of central tendency in a data set (that is, the mean, median and mode), as well as measures of spread (range, interquartile range and standard deviation), we can make comparisons between different groups and draw conclusions about our data.
Marge grows two different types of bean plants. She records the number of beans that she picks from each plant for $10$10 days. Her records show:
Plant $A$A: $10,4,4,5,7,10,3,3,9,10$10,4,4,5,7,10,3,3,9,10
Plant $B$B: $8,7,5,5,9,7,8,7,5,6$8,7,5,5,9,7,8,7,5,6
What is the mean number of beans picked per day for Plant $A$A? Leave your answer to one decimal place if needed.
What is the mean number of beans picked per day for Plant $B$B?
What is the range for Plant $A$A?
What is the range for Plant $B$B?
Which plant produces more beans on average?
Plant B
Plant A
Which plant has a more consistent yield of beans?
Plant A
Plant B
The residents of two blocks of townhouses were asked the number of pets they own. The frequency of various responses are presented in the dot plots.
According to the data, which of the following statements are true?
Pet ownership is a little lower in block $A$A.
True
False
In block $A$A, most households have zero or one pet.
True
False
In block $B$B, most households have three or more pets.
True
False
In block $A$A, pet ownership is skewed negatively.
True
False
Pet ownership ranges from $0$0 to $3$3 pets in block $A$A.
True
False
There is more variability in the block $B$B distribution.
True
False
Both sets of scores have an outlier.
True
False
Student X scored $86,83,86,88,98$86,83,86,88,98 and
Student Y scored $61,83,50,85,83$61,83,50,85,83 across 5 exams.
Find the mean score of Student X, writing your answer as a decimal.
Find the mean score of Student Y
Find the standard deviation of the scores for Student X, correct to two decimal places.
Find the standard deviation of the scores for Student Y, correct to two decimal places.
Which student performed better?
Student X
Student Y
Student X
Student Y
Histograms, and similar graphs (such as column graphs, dot plots and stem and leaf plots) are popular ways to display data because they give a detailed picture of the distribution of data.
However, with histograms it is not always easy to compare the particular statistical values for our data sets. The numeric characteristics that we can readily see in a histogram are limited to following:
If the histogram is constructed for grouped data, the minimum, maximum and range can only be estimated because we don't know exactly which values are represented within each class interval.
Furthermore, it is not easy to identify the median or mean or interquartile range from a histogram by inspection, although often it is possible to see approximately where these values would lie. When necessary, we can calculate an estimated value for the mean and standard deviation of the data represented by a histogram.
On the other hand, the histogram does provide excellent insight into non-numeric characteristics of the data which can be important for comparison, including:
Many of the comparisons that are described here for histograms can also be used with similar statistical graphs such as column graphs, dot plots, stem and leaf plots and even frequency tables.
If outliers are identified in the histogram, then it is important that these are mentioned in the comparison, with an explanation of how they are handled. In particular, we should state if outliers are included or excluded from comparisons, and the effect that this has on our analysis.
Consider the following histograms that show the height of students in two basketball teams. We know that one graph represents a team made up of Year $12$12 students and the other represents a Year $8$8 team. Which one corresponds to the year 12 team?
We can still compare these distributions, even though there is clearly a different number of students in the teams, because we are only interested in the shape and location of the data.
In our comparison we want to mention the most significant differences, and also describe relevant characteristics that are the same, or similar.
In this case we can observe these important similarities and differences:
Based on these observations, we could confidently say that Team A is the team of Year $12$12 students. In this case, the decision is clear because of the difference in the height for the modal class, which we would expect to be significantly higher for the older students.
We have already seen how data can be displayed in histograms and in box plots. These two displays are great for being able to identify key features of the shape of the data, as well as the range and in the case of the box plot the inter-quartile range and the median.
We should expect then that the shape of the data would be the same whether it is represented in a polygon, box plot or histogram. Remember that the shape of data can be symmetric, left skewed or right skewed.
Looking at the diagrams above, we can see the similarities in the representations.
We can see the skewed tails, where the bulk of the data sits and general shape. These are some of the features we can use to match column graphs and box-and-whisker plots. We can also look at the data range.
To identify matching data start by identifying tails (left or right) and symmetric type data.
Match the column graph shown here to the correct box plot.
Match the box plot shown to the correct column graph.
Consider the following pairs of histograms and box plots:
Which two of these histograms and box plots are correctly paired?
In part (a) we determined that the following histogram/box plot were an incorrect match:
Which two of the options correctly describe why?
The box plot has a long tail to the right which indicates positive skew, while the histogram does not appear to be skewed.
The data on the histogram is widely spread, while the box plot indicates that the data is mostly located around the median.
The median for the histogram is roughly in the middle, while the median of the box plot is located further to the left.
Parallel box plots are used to compare two (or more) sets of data visually. When comparing box plots, the $5$5 key numbers are going to be the important parts to consider. The $5$5 number summary will give us the:
We can also derive other statistics such as the range and inter-quartile range, and make visual observations of symmetry and skew that we can consider in our comparison.
The term parallel is used because the box plots are presented parallel to each other along the same number line for comparison. They must therefore be in the same scale, so a visual comparison is fairly straightforward.
Here we have two sets of data, comparing the time it took two different groups of people to complete an online task. It is important to clearly label each box plot.
If we want to choose the best group to complete the task, based only on time (in real life other factors, such as accuracy might be more important), we could consider the following observations:
Note that lower numbers mean that the task was completed faster, so lower is better.
Every one of these measures is in favour of the under $30$30s so, overall, we can conclude that the under $30$30s performed better.
When comparing two sets of data we can compare the $5$5 key points as shown above. There are key questions we should ask:
We should always consider what factors are more important for the given situation. In some cases we might be able to make judgements by simply comparing the median value; sometimes the minimum or the maximum value is the critical measurement. In other situations the consistency will be more important than extreme values so we should consider measures of spread to make judgements.
If outliers are identified in the box plots, then it is important that these are mentioned in the comparison, with an explanation of how they are handled. That is, we should state if outliers are included or excluded from comparisons, and the effect that this has.
The box plots show the distances, in centimetres, jumped by two high jumpers.
From these box plots, we can make comparison statements such as:
Based on this comparison, if we had to choose one of these high jumpers for the school athletics team, we would most likely choose John. In this case, the maximum height that John can achieve is more important than the consistency (but lower height) of Bill's jumps.
The box plots show the monthly profits (in thousands of dollars) of two financial traders over a year.
Ned |
5 10 15 20 25 30 35 40 45 50 55 60 |
Tobias |
5 10 15 20 25 30 35 40 45 50 55 60 |
Who made a higher median monthly profit?
Ned
Tobias
Whose profits had a higher interquartile range?
Tobias
Ned
Whose profits had a higher range?
Ned
Tobias
How much more did Ned make in his most profitable month than Tobias did in his most profitable month?
The two box plots below show the data collected by the manufacturers on the life-span of light bulbs, measured in thousands of hours.
Complete the following table using the two box plots. Write each answer in terms of hours.
Manufacturer A | Manufacturer B | |
---|---|---|
Median | $\editable{}$ | $5000$5000 |
Lower Quartile | $\editable{}$ | $\editable{}$ |
Upper Quartile | $4500$4500 | $\editable{}$ |
Range | $\editable{}$ | $6500$6500 |
Interquartile Range | $\editable{}$ | $\editable{}$ |
Which manufacturer produces light bulbs with the best lifespan?
Manufacturer A.
Manufacturer B.
The box plots below represent the daily sales made by Carl and Angelina over the course of one month.
0 10 20 30 40 50 60 70 Angelina's Sales |
0 10 20 30 40 50 60 70 Carl's Sales |
What is the range in Angelina's sales?
What is the range in Carl’s sales?
By how much did Carl’s median sales exceed Angelina's?
Considering the middle $50%$50% of sales for both sales people, whose sales were more consistent?
Carl
Angelina
Which salesperson had a more successful sales month?
Angelina
Carl