6. Descriptive Statistics

Lesson

While measures of center summarize the middle of the data, measures of spread describe the variability within the data set. We want to quantify if all of the numbers are close together, far apart, or somewhere in between.

You should already be familiar with the range and mean absolute deviation (MAD). We will now look at interquartile range (IQR) and standard deviation as measures of spread, as well as how box plots can be used to display the spread.

In previous grades, we looked at quartiles and found that when we quarter a data set, we can determine the five number summary. These are the minimum value, the lower quartile $\left(Q_1\right)$(`Q`1), the median, the upper quartile $\left(Q_3\right)$(`Q`3) and the maximum value. So here is the five number summary:

**Minimum value: **The minimum value is the least score in a data set.

**Lower Quartile $\left(Q_1\right)$( Q1):** The lower quartile is also called the first quartile. It is the middle score between the least score and the median and it represents the $25$25th percentile.

The lower quartile is the $\frac{n+1}{4}$`n`+14th score, where $n$`n` is the total number of scores.

**Median**: The median is the middle score in a data set.

It is calculated as the $\frac{n+1}{2}$`n`+12th score, where $n$`n` is the total number of scores.

**Upper Quartile $\left(Q_3\right)$( Q3)**: The upper quartile is also called the third quartile. It is the middle score between the greatest score and the median and it represents the $75$75th percentile.

The upper quartile is the $\frac{3\left(n+1\right)}{4}$3(`n`+1)4th score, where $n$`n` is the total number of scores.

**Maximum value:** The maximum value is the greatest score in a data set.

The five numbers from the five number summary break up our set of scores into four parts. Have a look at the diagram here:

The table shows the number of points scored by a basketball team in each game of their previous season.

$59$59 | $67$67 | $73$73 | $82$82 | $91$91 | $58$58 | $79$79 | $88$88 |

$69$69 | $84$84 | $55$55 | $80$80 | $98$98 | $64$64 | $82$82 |

Sort the data in ascending order.

State the maximum value of the set.

State the minimum value of the set.

Find the median value.

Find the lower quartile.

Find the upper quartile.

The interquartile range (IQR) is the difference between the upper quartile $\left(Q_3\right)$(`Q`3) and the lower quartile $\left(Q_1\right)$(`Q`1). It is the range of the middle $50%$50% of the data. The IQR can be better than the range because it will ignore outliers. It is also easy to read from a box plot.

Calculating the IQR

Interquartile Range | $=$= | Upper Quartile - Lower Quartile |

IQR | $=$= | $Q_3-Q_1$Q3−Q1 |

Answer the following, given this set of scores:

$33,38,50,12,33,48,41$33,38,50,12,33,48,41

Sort the scores in ascending order.

Find the number of scores.

Find the median.

Find the first quartile of the set of scores.

Find the third quartile of the set of scores.

Find the interquartile range.

Box plots are a great way of displaying quantitative (numerical) data as they clearly show all the quartiles in a data set. Since statisticians are interested in what's "normal," they assume that most scores will be somewhere in the middle. As such, the box in box plots indicates the middle half of the scores. They are the best type of display if you are looking to highlight the spread of the data or if there are outliers.

- We start with a number line that covers the values in our data set using an appropriate scale
- Create the box using the lower quartile, median, and upper quartile.
- Draw the whiskers by extending horizontal lines from the box to the minimum and maximum

The diagram below shows a nice summary of all this information:

Each quartile represents $25%$25% of the data set.

In other words, the least score to the lower quartile represents $25%$25% of the data, the lower quartile to the median represents another $25%$25%, the median to the upper quartile is another $25%$25% and the upper quartile to the greatest score represents another $25%$25%.

In measures of center, we looked at the idea that outliers are extreme values which do not fit within a data set. There is actually a mathematical formula which we can use to calculate if extreme values are outliers or not.

Once we have our five number summary, we can use this information to determine whether a data point can be considered an outlier.

To do this, we calculate the upper and lower bounds for outliers. Any data that is above the upper bound or below the lower bound will be considered an outlier.

Calculating the bounds

Lower outlier(s)$`Q`1−1.5×`I``Q``R`

Upper outlier(s)$>Q_3+1.5\times IQR$>`Q`3+1.5×`I``Q``R`

If a value is an outlier, we can represent it visually in a box plot. We plot the outliers with a dot or an x and only extend the whisker to next non-outlier.

For the box plot shown below, find each of the following:

0 2 4 6 8 10 12 14 16 18 20 score |

Least score: $\editable{}$ Greatest score: $\editable{}$ Range: $\editable{}$ Median: $\editable{}$ Interquartile Range: $\editable{}$

You have been asked to represent this data in a box plot. Answer the following questions:

$20,36,52,56,24,16,40,4,28$20,36,52,56,24,16,40,4,28

Complete the table for the given data:

Minimum $\editable{}$ Lower Quartile $\editable{}$ Median $\editable{}$ Upper Quartile $\editable{}$ Maximum $\editable{}$ Construct a box plot for the data.

0102030405060Data

Consider the following set of data:

$9$9 $5$5 $3$3 $2$2 $6$6 $1$1

Complete the five-number summary for this data set.

Minimum $\editable{}$ Lower quartile $\editable{}$ Median $\editable{}$ Upper quartile $\editable{}$ Maximum $\editable{}$ Calculate the interquartile range.

Calculate the value of the lower fence.

Calculate the value of the upper fence.

Would the value $-3$−3 be considered an outlier?

No

AYes

BNo

AYes

B

Recall that the mean absolute deviation (MAD) is the average of the absolute differences between each value and the mean. We take each value, subtract the mean from it, take the absolute value and find the average of those. Using mathematical notation, where $n$`n` is the number of values in the data set:

$MAD$MAD |
$=$= | $\frac{\left|x_1-\overline{x}\right|+\left|x_2-\overline{x}\right|+...+\left|x_n-\overline{x}\right|}{n}$|x1−x|+|x2−x|+...+|xn−x|n |

$MAD$MAD |
$=$= | $\frac{1}{n}\sum_{i=1}^n\left|x_i-\overline{x}\right|$1nn∑i=1|xi−x| |

The standard deviation is similar, but instead of taking the absolute value of the differences, we take the square of the differences. Both taking the absolute value and squaring the difference ensures that we are only working with positive numbers. To compensate for squaring, we take the square root of it all at the end.

When we are working with a sample data set, not a population, we divide by $n-1$`n`−1, not $n$`n`. See if you can find out why by doing some research.

Using mathematical notation, where $n$`n` is the number of values in the data set:

$s_x$sx |
$=$= | $\sqrt{\frac{\left(x_1-\overline{x}\right)^2+\left(x_2-\overline{x}\right)^2+...+\left(x_n-\overline{x}\right)^2}{n-1}}$√(x1−x)2+(x2−x)2+...+(xn−x)2n−1 |

$s_x$sx |
$=$= | $\sqrt{\frac{1}{n-1}\sum_{i=1}^n\left(x_i-\overline{x}\right)^2}$√1n−1n∑i=1(xi−x)2 |

Steps to calculate the standard deviation

- Calculate the sample mean. $\overline{x}=\frac{1}{n}\Sigma_{i=1}^n\ x_i$
`x`=1`n`Σ`n``i`=1`x``i` - Find the difference from the mean for each score. $x_i-\overline{x}$
`x``i`−`x`. - Square each of the differences. $\left(x_i-\overline{x}\right)^2$(
`x``i`−`x`)2 - Sum the squared differences. $\Sigma\left(x_i-\overline{x}\right)^2$Σ(
`x``i`−`x`)2 - Divide the sum by one less than the number of scores. $\frac{1}{n-1}\Sigma\left(x_i-\overline{x}\right)^2$1
`n`−1Σ(`x``i`−`x`)2 - Take the square root. $s=\sqrt{\frac{1}{n-1}\Sigma\left(x_i-\overline{x}\right)^2}$
`s`=√1`n`−1Σ(`x``i`−`x`)2

Find the following based on this set of scores:

$19,18,14,19,10$19,18,14,19,10

Find the mean.

Complete the following table.

Score($x$ `x`)$(x-$( `x`−mean$)$)$(x-$( `x`−mean$)^2$)2$19$19 $\editable{}$ $\editable{}$ $18$18 $\editable{}$ $\editable{}$ $14$14 $\editable{}$ $\editable{}$ $19$19 $\editable{}$ $\editable{}$ $10$10 $\editable{}$ $\editable{}$ Thus, find the sample standard deviation, correct to 2 decimal places.

Find the range of the set of scores.

To determine which statistics to use, we will look at the presence of outliers and the distribution (more on that in 9.04 Data distributions).

Remember!

Data distribution | Evenly spread with no outliers | Skewed to one side or with outliers |
---|---|---|

Measure of center |
Mean | Median |

Measure of spread |
Standard deviation | Interquartile range |

Use statistics appropriate to the shape of the data distribution to compare center (median, mean) and spread (interquartile range, standard deviation) of two or more different data sets.