When we are trying to understand what our data is telling us, we usually find statistics that tell us the location of the data (such as the mean or median) as well as measures of spread, such as the range.
To get a better picture of the distribution of a data set, with a concise set of values, we often use the five number summary.
The five number summary is made up of the minimum and maximum values, the median, and two other values, known as upper and lower quartiles.
We are familiar with the median as the middle value in a data set when the values are arranged in order. The median is a useful statistic that tells us the location of the data.
Quartiles are values at particular locations in the data set – similar to the median, but instead of dividing a data set into halves, they divide a data set into quarters.
$\editable{1}$1 | $\editable{3}$3 | $\editable{4}$4 | $\editable{7}$7 | $\editable{11}$11 | $\editable{12}$12 | $\editable{14}$14 | $\editable{19}$19 |
First locate the median, between the $4$4th and $5$5th values:
Median | ||||||||||||||
$\downarrow$↓ | ||||||||||||||
$\editable{1}$1 | $\editable{3}$3 | $\editable{4}$4 | $\editable{7}$7 | $\editable{11}$11 | $\editable{12}$12 | $\editable{14}$14 | $\editable{19}$19 |
Now there are $4$4 values in each half of the data set, so split each of the four values in half to find the quartiles. We can see the lower quartile is between the $2$2nd and $3$3rd values; there are two values on either side of the first quartile. Similarly, the upper quartile is between the $6$6th and $7$7th values:
lower quartile | Median | upper quartile | ||||||||||||
$\downarrow$↓ | $\downarrow$↓ | $\downarrow$↓ | ||||||||||||
$\editable{1}$1 | $\editable{3}$3 | $\editable{4}$4 | $\editable{7}$7 | $\editable{11}$11 | $\editable{12}$12 | $\editable{14}$14 | $\editable{19}$19 |
We can see that the intervals between the quartiles each contain two values–one quarter of the total number of values in the data set.
The lower quartile is also called the first quartile, or $Q_1$Q1. It is the middle value between the minimum value and the median. To calculate the lower quartile, we identify the scores less than the median (which we call the lower half). Then we determine the middle value of this lower half.
The median is also known as the second quartile, or $Q_2$Q2, which we have already learnt about and it represents the middle value in the sorted data set.
The median is the $\frac{n+1}{2}$n+12th value in the sorted data set, where $n$n is the number of values in the data set.
The upper quartile is also called the third quartile, or $Q_3$Q3. It is the middle value between the median and the maximum value. The upper quartile can be found by identifying the scores in the upper half (above the median). Then we determine the middle value of this upper half.
The range is the difference between the maximum value and minimum value in the data set.
The interquartile range, or IQR, is the difference between the upper quartile and the lower quartile. Half of the values in the data set lie within the interquartile range.
The interquartile range is a useful measure of the spread of data because, unlike the range, it is not affected by unusually large or small values.
The five number summary is the set of values made up of the:
These values break our data set into four parts as shown in this diagram
Knowing the five number summary can help us identify key regions of the data set.
The individual values required for a five number summary are readily obtained using the Statistics mode on a CAS calculator or spreadsheet. It can also be calculated by hand.
If we enter all of the data in one column (in this example column A) we can use these formulas to help us do the calculations quickly.
Statistic | Formula |
---|---|
Minimum | =MIN(A:A) |
$Q_1$Q1 | =QUARTILE(A:A, 1) |
Median | =MEDIAN(A:A) |
$Q_3$Q3 | =QUARTILE(A:A,3) |
Maximum | =MAX(A:A) |
Otherwise, we can do it by hand by finding the median of the whole set and then the median of the two half sets that the median divides it into.
Determine the five number summary and interquartile range for this data set:
$-2,10,-1,6,9,6,-6,1,7$−2,10,−1,6,9,6,−6,1,7.
If we are doing this by hand, we should reorder in ascending order, but if using technology we can enter the data as is with one entry per row:
In order: $-6,-2,-1,1,6,6,7,9,10$−6,−2,−1,1,6,6,7,9,10
Broken into two lists with the median:
Lower half | Median | Upper half |
---|---|---|
$-6,-2,-1,1$−6,−2,−1,1 | $6$6 | $6,7,9,10$6,7,9,10 |
The five number summary:
Minimum | Lower quartile | Median | Upper quartile | Maximum |
---|---|---|---|---|
$-6$−6 | $-1.5$−1.5 | $6$6 | $8$8 | $10$10 |
Note that, in this example, neither quartile is a value from the data set because the positions of the quartiles fall between values.
The interquartile range is the difference between the upper quartile and the lower quartile:
Interquartile range | $=$= | $8.5-(-1.5)$8.5−(−1.5) |
$=$= | $10$10 |
Use class centres to determine the five number summary and interquartile range for the data represented by the histogram:
The histogram data can be represented by the frequency table below:
Class | Class centre | Frequency |
---|---|---|
$30-<40$30−<40 | $35$35 | $5$5 |
$40-<50$40−<50 | $45$45 | $5$5 |
$50-<60$50−<60 | $55$55 | $7$7 |
$60-<70$60−<70 | $65$65 | $1$1 |
$70-<80$70−<80 | $75$75 | $3$3 |
Notice that there are $21$21 items in this list. This means that the $11$11th item is the median and the value halfway between the $5$5th and $6$6th is $Q_1$Q1 and the item between $15$15th and $16$16th is $Q_3$Q3.
So, the five number summary is set out below:
Minimum | Lower quartile | Median | Upper quartile | Maximum |
---|---|---|---|---|
$35$35 | $40$40 | $55$55 | $55$55 | $75$75 |
In this case, median and upper quartile have the same value.
The interquartile range is the difference between the upper quartile and the lower quartile:
Interquartile range | $=$= | $55-40$55−40 |
$=$= | $15$15 |
The table shows the number of points scored by a basketball team in each game of their previous season.
$59$59 | $67$67 | $73$73 | $82$82 | $91$91 | $58$58 | $79$79 | $88$88 |
$69$69 | $84$84 | $55$55 | $80$80 | $98$98 | $64$64 | $82$82 |
Sort the data in ascending order.
State the maximum value of the set.
State the minimum value of the set.
Find the median value.
Find the lower quartile.
Find the upper quartile.
Answer the following questions using the given frequency table.
Score |
Frequency |
---|---|
$15$15 |
$13$13 |
$16$16 | $9$9 |
$17$17 | $23$23 |
$18$18 | $19$19 |
$19$19 | $8$8 |
$20$20 | $13$13 |
Complete the five number summary using a CAS calculator.
Minimum: $\editable{}$
Lower quartile: $\editable{}$
Median: $\editable{}$
Upper quartile: $\editable{}$
Maximum: $\editable{}$
Calculate the interquartile range.
To gain a place in the main race of a car rally, teams must compete in a qualifying round. The median time in the qualifying round determines the cut off time to make it through to the main race. Below are some results from the qualifying round.
$75%$75% of teams finished in $159$159 minutes or less.
$25%$25% of teams finished in $132$132 minutes or less.
$25%$25% of teams finished between with a time between $132$132 and $142$142 minutes.
Determine the cut off time required in the first round to make it through to the main race.
Determine the interquartile range in the qualifying round.
In the qualifying round, the ground was wet, while in the main race, the ground was dry. To make the times more comparable, the finishing time of each team from the qualifying round is reduced by $5$5 minutes. What would be the new median time from the qualifying round?
We start with a number line that covers the full range of values in our data set.
We then plot the values from the five number summary above the number line, and connect them in a certain way to create a box plot. Here is an example:
The two vertical edges of the box show the upper and lower quartiles of the data range. The left hand side of the box is $Q_1$Q1 and the right hand side of the box is $Q_3$Q3. The vertical line inside the box shows the median.
Then there are two lines that extend from the box outwards. The endpoint of the left line is at the minimum value, while the endpoint of the right line is at the maximum value.
The box plot must be drawn parallel to a number line so that the values for the five number summary can be easily read from the graph.
The example above is represents the five number summary set out below:
Minimum | Lower quartile | Median | Upper quartile | Maximum |
---|---|---|---|---|
$18$18 | $51$51 | $68$68 | $87$87 | $100$100 |
The interquartile range (IQR) is the difference between the upper quartile and the lower quartile.
For this example, the IQR is $87-51=36$87−51=36.
Since the marks of the box plot represent quartiles, each region represents $25%$25% of the values in the data set. Hence, in this example, we can make statements such as:
The box plot below shows the age at which a group of people got their driving licences.
What is the oldest age at which someone got their licence?
What is the youngest age at which someone got their licence?
What percentage of people were aged from $18$18 to $22$22?
$10%$10%
$25%$25%
$50%$50%
The middle $50%$50% of responders were within how many years of one another?
$9$9
$6$6
$7$7
$8$8
In which quartile are the ages least spread out?
$4$4th
$1$1st
$3$3rd
$2$2nd
The bottom $50%$50% of responders were within how many years of one another?
$5$5
$4$4
$6$6
Sometimes a data set contains unusually high or low values. These unusual values are called outliers and may arise from data collection errors or due to the natural variation of the data.
We often want to identify the outlier values, and see the characteristics of the data without the effect of the outliers. In this case, we can construct a modified box plot to show the outlier values separately.
In later grades we will look at how to calculate the upper and lower bounds for outliers.
Answer the following questions using the given grouped frequency table.
Class | Class centre | Frequency |
---|---|---|
$40\le x<45$40≤x<45 | $42.5$42.5 | $3$3 |
$45\le x<50$45≤x<50 | $47.5$47.5 | $4$4 |
$50\le x<55$50≤x<55 | $52.5$52.5 | $7$7 |
$55\le x<60$55≤x<60 | $57.5$57.5 | $3$3 |
$60\le x<65$60≤x<65 | $62.5$62.5 | $3$3 |
$65\le x<70$65≤x<70 | $67.5$67.5 | $9$9 |
$70\le x<75$70≤x<75 | $72.5$72.5 | $4$4 |
$75\le x<80$75≤x<80 | $77.5$77.5 | $5$5 |
Complete the five number summary using a CAS calculator.
Minimum: $\editable{}$
Lower quartile: $\editable{}$
Median: $\editable{}$
Upper quartile: $\editable{}$
Maximum: $\editable{}$
Calculate the interquartile range.
Salaries earned by employees at a software company is given in the histogram below.
Use your CAS calculator to construct a box plot, using the class centres.
Calculate the interquartile range.
Using the box plot, approximately what percentage of salaries lie in the range $\$90000$$90000 to $\$100000$$100000?
Complete the following statement.
The highest $25%$25% of salaries lie between $\$\quad$$ $\editable{}$ and $\$\quad$$ $\editable{}$ inclusive.
Parallel box plots are used to compare two sets of data visually.
We call these parallel box plots as they are presented parallel to each other along the same number line for comparison. They must therefore be in the same scale, so a visual comparison is fairly straightforward.
It is important to clearly label each box plot. Here we have plotted two sets of data, comparing the time it took two different groups of people to complete an online task.
We will see in later lessons that this format is very useful for comparing the characteristics of two (or more) data sets.
The heights (in metres) of the boys and girls in a class of $30$30 students were recorded. The results are given in the table below.
Boys: | $1.65$1.65 | $1.66$1.66 | $1.67$1.67 | $1.68$1.68 | $1.63$1.63 | $1.62$1.62 | $1.61$1.61 | $1.60$1.60 | $1.75$1.75 | $1.76$1.76 | $1.77$1.77 | $1.78$1.78 | $1.73$1.73 | $1.72$1.72 | $1.71$1.71 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Girls: | $1.55$1.55 | $1.56$1.56 | $1.57$1.57 | $1.58$1.58 | $1.53$1.53 | $1.52$1.52 | $1.51$1.51 | $1.50$1.50 | $1.69$1.69 | $1.70$1.70 | $1.71$1.71 | $1.72$1.72 | $1.67$1.67 | $1.66$1.66 | $1.65$1.65 |
Complete the table for the given data of the heights of boys in the class.
Minimum | $\editable{}$ |
---|---|
First quartile | $\editable{}$ |
Median | $\editable{}$ |
Third quartile | $\editable{}$ |
Maximum | $\editable{}$ |
Complete the table for the given data of the heights of girls in the class.
Minimum | $\editable{}$ |
---|---|
First quartile | $\editable{}$ |
Median | $\editable{}$ |
Third quartile | $\editable{}$ |
Maximum | $\editable{}$ |
Draw a parallel box plot for this data.