We previously looked at the quartiles of a data set, and found the first quartile, the median, and the third quartile. Remember that the quartiles can be useful to give some basic insight into the internal spread of data, whereas the range only uses the difference between the two extreme data points, the maximum and minimum. We can use the quartiles in combination with the two extremes of a data set to simplify the data into a five number summary:
The five numbers from the five number summary break up a set of scores into four parts with $25%$25% of the scores in each quartile. Have a look at the diagram here:
So knowing these five key numbers can help us identify regions, such as the top $25%$25%, $50%$50%, and $75%$75% of the scores.
The table shows the number of points scored by a basketball team in each game of their previous season.
$59$59 | $67$67 | $73$73 | $82$82 | $91$91 | $58$58 | $79$79 | $88$88 |
$69$69 | $84$84 | $55$55 | $80$80 | $98$98 | $64$64 | $82$82 |
Sort the data in ascending order.
State the maximum value of the set.
State the minimum value of the set.
Find the median value.
Find the lower quartile.
Find the upper quartile.
Creating a box plot:
For the box plot above, find the:
(a) Range
Think: The range is the difference between the highest score and the lowest score. That is, the difference between the scores at the ends of the whiskers.
Do: For this data set, the range is $18-3=15$18−3=15.
(b) Median
Think: The median is shown by the line inside the rectangular box.
Do: For this data set, the median line is at the score $10$10.
(c) Interquartile range (IQR)
Think: The IQR is the difference between the upper quartile and the lower quartile.
Do: For this set, the lower quartile (at the left end of the box) is $8$8, while the upper quartile (at the right end of the box) is $15$15. This means that the IQR is $15-8=7$15−8=7.
(d) What percentage of scores are in the range $8$8 to $18$18 inclusive?
Think: $8$8 is the first quartile and $18$18 is the maximum value and there are $25%$25% of the data between each quartile.
Do: There is $75%$75% of the data between these values.
Sometimes a data set contains unusually high or low values. These unusual values are called outliers and may arise from data collection errors or due to the natural variation of the data.
We often want to identify the outlier values, and see the characteristics of the data without the effect of the outliers. In this case, we can construct a modified box plot to show the outlier values separately.
We will look more closely at identifying and displaying outliers in sets of data in our next lesson on describing distributions.
Parallel box plots are used to compare two or more sets of data visually. These box plots are presented parallel to each other along the same number line for comparison. They must therefore be in the same scale, so a visual comparison is fairly straightforward. It is important to clearly label each box plot.
Key Comparisons:
When comparing two sets of data we can compare the location of each value of the five number summary. We can also ask ourselves the following questions:
The parallel box plot below shows two sets of data, comparing the time it took two different groups of people to complete an online task.
(a) Which group was generally faster?
Think: Which box plot has its main values further to the left? Is this consistent for all of the values in the five number summary? Are the differences significant? In particular note the difference in the median.
Do: We can see that overall the under $30$30s were faster at completing the task. Each of the numbers in the five number summary are smaller for the under $30$30s and their median is $4$4 seconds faster than the over $30$30s. We also have over $75%$75% of the under $30$30s completed the task in under $22$22 seconds, which is the median time taken by the over $30$30s. $100%$100% of the under $30$30s had finished the task before $75%$75% of the over $30$30s had completed it.
(b) Which group had more consistent completion times?
Think: For consistency note the difference in range and interquartile range. Recall, the smaller a measure of spread the more consistent the scores are.
Do: Overall the under $30$30s had smaller spread of scores. There was a larger variance within the over $30$30 group, with a range of $24$24 seconds compared to $20$20 seconds for the under $30$30s. The interquartile range was also smaller by $3$3 seconds for the under $30$30s group.
The box plots below represent the daily sales made by Carl and Angelina over the course of one month.
0 10 20 30 40 50 60 70 Angelina's Sales |
0 10 20 30 40 50 60 70 Carl's Sales |
What is the range in Angelina's sales?
What is the range in Carl’s sales?
By how much did Carl’s median sales exceed Angelina's?
Considering the middle $50%$50% of sales for both sales people, whose sales were more consistent?
Carl
Angelina
Which salesperson had a more successful sales month?
Angelina
Carl
Select your brand of calculator below to work through an example of determining the $5$5 number summary and constructing box plots efficiently with the aid of technology.
Casio ClassPad
Calculator example coming soon.
TI Nspire
Calculator example coming soon.
Answer the following questions using the given frequency table.
Score |
Frequency |
---|---|
$15$15 |
$13$13 |
$16$16 | $9$9 |
$17$17 | $23$23 |
$18$18 | $19$19 |
$19$19 | $8$8 |
$20$20 | $13$13 |
Complete the five number summary using a CAS calculator.
Minimum: $\editable{}$
Lower quartile: $\editable{}$
Median: $\editable{}$
Upper quartile: $\editable{}$
Maximum: $\editable{}$
Calculate the interquartile range.
Salaries earned by employees at a software company is given in the histogram below.
Use your CAS calculator to construct a box plot, using the class centres.
Calculate the interquartile range.
Using the box plot, approximately what percentage of salaries lie in the range $\$90000$$90000 to $\$100000$$100000?
Complete the following statement.
The highest $25%$25% of salaries lie between $\$\quad$$ $\editable{}$ and $\$\quad$$ $\editable{}$ inclusive.