Box plots are a great way of displaying numerical data, as they clearly show all of the quartiles in a data set. Let's take a look at the features of box plots.
To create a box plot it is easiest to first calculate the five number summary for the data set.
A five number summary consists of the:
Minimum value
Lower quartile (Q_1)
Median
Upper quartile (Q_3)
Maximum value
Once we have these five values we can plot them on a number line and create our box plot.
The diagram below shows a nice summary of all this information:
The two vertical edges of the box show the quartiles of the data range. The left hand side of the box is the lower quartile (Q_1) and the right hand side of the box is the upper quartile (Q_3). The vertical line inside the box shows the median (the middle score) of the data.
Then there are two lines that extend from the box outwards. The endpoint of the left line is at the minimum score, while the endpoint of the right line is at the maximum score.
For the following box plot:
Find the lowest score.
Find the highest score.
Find the range.
Find the median.
Find the interquartile range (\text{IQR}).
Using the following box plot :
What percentage of scores lie between:
10.9 and 11.2
10.8 and 10.9
11.1 and 11.3
10.9 and 11.3
10.8 and 11.2
In which quartile (or quartiles) is the data the most spread out?
Consider the following data set: 20, \, 36, \, 52, \, 56, \, 24, \, 16, \, 40, \, 4, \, 28
Complete the table for the given data:
Minimum | \quad \quad |
---|---|
Lower quartile | \quad \quad |
Median | \quad \quad |
Upper quartile | \quad \quad |
Maximum | \quad \quad |
Construct a box plot for the data.
The features of a box-and-whisker plot is shown below:
A list of the minimum, lower quartile, median, upper quartile, and maximum values is often called the five number summary.
One quartile represents 25\% of the data set.
Creating a box-and-whisker plot:
Put the data in ascending order (from smallest to largest).
Find the median (middle value) of the data.
To divide the data into quarters, find the median (middle value) between the minimum value and the median, as well as between the median and the maximum value.
We can now see that data can be displayed in histograms and box plots. These two displays are great for being able to identify key features of the shape of the data, as well as the range and in the case of the box plot the IQR and the median.
We should expect then that the shape of the data would be the same whether it is represented in a box plot or histogram. Remember that the shape of data can be symmetric, positively skewed or negatively skewed.
Symmetric
Positive skew
Negative skew
Let's now match some histograms to their correct box plot representation.
Match each histogram to a box plot.
Consider the following pairs of histograms and box plots.
Which two of these histograms and box plots are correctly paired?
In part (a) we determined that the following histogram/box plot were an incorrect match:
Which two of the options correctly describe why?
We can compare the shape of the data in histograms and box plots by checking for symmetry and skew.
For symmetric data, both graphs should be symmetrical about the median.
For positively skewed data, the histogram would have most of the data to the left and a shape with the tail pointing right. The box plot would have the box to the left and a long right whisker.
For negatively skewed data, the histogram would have most of the data to the right and a shape with the tail pointing left. The box plot would have the box to the right and a long left whisker.
An outlier is an event that is very different from the norm and results in a score that is far away from the rest of the scores. For example, suppose there are 10 people in a long jump contest. Nine of those people managed a jump between 4 and 5 metres, while one person only jumped 1 metre. That one person is an outlier, as their jump is so much shorter than everyone else's.
To determine if a data value is an outlier, we use a rule that involves the interquartile range (IQR). This rule gives us the upper and lower fences of a box plot. A fence refers to the upper and lower boundaries, and any score which lies outside of the fences are classified as outliers.
Calculating outliers:
A data point is classified as an outlier if it lies above the upper fence or below the lower fence. These are calculated as follows: \text{Lower fence} = \text{Lower quartile} -1.5 \times \text{Interquartile Range}\\ \text{Upper fence} = \text{Upper quartile} +1.5 \times \text{Interquartile Range}
Using the five number summary and the upper and lower fences we can construct and box and whisker plot and identify any outliers.
The above diagram shows the construction of the box plots and how the upper and lower fences are constructed. Any data points outside of these are outliers.
Consider following set of data: 9,\,5,\,3,\,2,\,6,\,1
Complete the five-number summary for this data set.
Minimum | |
---|---|
Lower quartile | |
Median | |
Upper quartile | |
Maximum |
Calculate the interquartile range.
Calculate the value of the lower fence.
Calculate the value of the upper fence.
Would the value -3 be considered an outlier?
Calculating outliers:
A data point is classified as an outlier if it lies above the upper fence or below the lower fence. These are calculated as follows: \text{Lower fence} = \text{Lower quartile} -1.5 \times \text{Interquartile Range}\\ \text{Upper fence} = \text{Upper quartile} +1.5 \times \text{Interquartile Range}
Outliers can skew or change the shape of our data. If we compare two data sets that are identical apart from one outlier, we will see the following:
The set with an outlier will have a much larger range.
The median of each set will be close together, and might be the same.
The mean of each set will be noticeably different.
The mode of each set will be the same, since an outlier added to a data set will never be the most frequent score.
In particular, adding an outlier to a set of data will change the shape as follows:
Adding a low outlier | Adding a high outlier |
---|---|
The range increases significantly. | The range increases significantly. |
The median could decrease a little. | The median could increase a little. |
The mean decreases significantly. | The mean increases significantly. |
The mode will not change. | The mode will not change. |
Consider the following set of data: 53,\,46,\,25,\,50,\,30,\,30,\,40,\,30,\,47,\,109
Complete the following table of summary statistics.
Mean | \qquad \qquad |
---|---|
Median | \qquad \qquad |
Mode | \qquad \qquad |
Range | \qquad \qquad |
Which data value is an outlier?
Complete the following table of summary statistics after removing the outlier 109:
Mean | \qquad \qquad |
---|---|
Median | \qquad \qquad |
Mode | \qquad \qquad |
Range | \qquad \qquad |
Let A be the original data set and B be the data set without the outlier.
Complete the table using the symbols >,< and = to compare the statistics before and after removing the outlier.
\text{With outlier} | \text{Without\ outlier} | ||
Mean: | A | ⬚ | B |
Median: | A | ⬚ | B |
Mode: | A | ⬚ | B |
Range: | A | ⬚ | B |
In particular, adding an outlier to a set of data will change the shape as follows:
Adding a low outlier | Adding a high outlier |
---|---|
The range increases significantly. | The range increases significantly. |
The median could decrease a little. | The median could increase a little. |
The mean decreases significantly. | The mean increases significantly. |
The mode will not change. | The mode will not change. |