Measures of central tendency, or measures of location, refer to statistical quantities that tell us where the middle of the scores is (the average). There are 3 of these measures: mean, median and mode. They can all be referred to as "averages" of the data set.
The mean is what we typically consider to be the average of all the scores.
If the data is given as single scores, you calculate the mean by adding up all the scores, then dividing the total by the number of scores.
If the data is given in a frequency table then sum the scores multiplied each by their frequency before dividing by the total number of data points.
If the data is grouped, then first find the midpoint of each class interval, sum the midpoints multiplied by the frequency and then divide by the total of the frequency column.
The median is the middle score in a data set when the scores are arranged in numerical order.
There are two ways you can find the median:
Write the numbers in the data set in ascending order, then find the middle score by crossing out a number at each end until you are left with one in the middle.
Calculate what score would be in the middle using the formula: \text{middle term}= \dfrac{n+1}{2}, then count up in ascending order until you reach the score that is that term.
Note: If the data is grouped using class intervals you must add up the frequency column until you come to the interval where the middle score must lie. The interval will be called the median class.
The mode is the most frequently occurring score.
To find the mode, count which score you see most frequently in your data set. If the data is in a frequency table, the score with the highest frequency is the mode. If the data is grouped then the modal class will be the class interval with the highest frequency.
Consider the table below.
Score | Frequency |
---|---|
1-4 | 2 |
5-8 | 7 |
9-12 | 15 |
13-16 | 5 |
17-20 | 1 |
Use the midpoint of each class interval to determine an estimate for the mean of the sample distribution. Round your answer to one decimal place.
Which is the modal group?
Measures of centre tell us the location of data.
mean - is the sum of values divided by the number of values.
median - is the middle value when the values are sorted.
mode - is the value that occurs most often.
The range, interquartile range, variance and standard deviation are all measures of spread. They tell us about how spread out the scores are.
The range is the difference between the highest score and the lowest score.
To calculate the range, you need to subtract the lowest score from the highest score.
The interquartile range (IQR) gives a measure of spread of the middle 50\% of the data set.
The interquartile range often gives a better indication of the internal spread than the range does, as it is less affected by individual scores that are unusually high or low, which are the outliers.
Remember to make sure the data set is ordered before finding the quartiles or the median.
To calculate the interquartile range, subtract the first quartile from the third quartile. That is,\text{IQR} = Q_3 -Q_1
Consider the following set of scores:33,\,38,\,50,\,12,\,33,\,48,\,41
Sort the scores in ascending order.
Find the number of scores.
Find the median.
Find the first quartile of the set of scores.
Find the third quartile of the set of scores.
Find the interquartile range.
The column graph shows the number of pets that each student in a class owns.
Find the first quartile of the set of scores.
Find the third quartile of the set of scores.
Find the interquartile range.
The range is the difference between the highest score and the lowest score.
To calculate the interquartile range:
Standard deviation is a measure of spread, which helps give us a meaningful estimate of the variability in a data set. While the quartiles gave us a measure of spread about the median, the standard deviation gives us a measure of spread with respect to the mean. It is a weighted average of the distance of each data point from the mean. A small standard deviation indicates that most scores are close to the mean, while a large standard deviation indicates that the scores are more spread out away from the mean value.
We can calculate the standard deviation for a population or a sample.
The symbols used are:
\displaystyle \text{Population standard deviation} | \displaystyle = | \displaystyle \sigma \text{ (lower case)} |
\displaystyle \text{Sample standard deviation} | \displaystyle = | \displaystyle s |
In statistics mode on a calculator, the following symbols might be used:
\displaystyle \text{Population standard deviation} | \displaystyle = | \displaystyle \sigma_n \text{ (lower case)} |
\displaystyle \text{Sample standard deviation} | \displaystyle = | \displaystyle \sigma_{n-1} |
Standard deviation is a very powerful way of comparing the spread of different data sets, particularly if there are different means and population numbers.
Standard deviation can be calculated using a formula. However, as this process is time consuming we will be using our calculator to find the standard deviation. Ensure settings are correct for the data given, this is particularly important when changing between data that is in a simple list to data that is in a frequency table. For questions in the exercise set, assume standard deviation refers to the population standard deviation unless otherwise stated.
Find the population standard deviation of the following set of scores by using the statistics mode on the calculator: 8, \, 20, \, 16, \, 9, \, 9, \, 15, \, 5, \, 17, \, 19, \, 6
Round your answer to two decimal places.
The table shows the number of goals scored by a football team in each game of the year.
\text{Score } (x) | \text{Frequency } (f) |
---|---|
0 | 3 |
1 | 1 |
2 | 5 |
3 | 1 |
4 | 5 |
5 | 5 |
In how many games were 0 goals scored?
Determine the median number of goals scored. Leave your answer to one decimal place if necessary.
Calculate the mean number of goals scored each game. Leave your answer to two decimal places if necessary.
Use your calculator to find the population standard deviation. Leave your answer to two decimal places if necessary.
Standard deviation is a weighted average of how far each piece of data varies from the mean. It takes every data point into account, and so is significantly impacted by outliers.
For each measure of spread:
A larger value indicates a wider spread (more variable) data set.
A smaller value indicates a more tightly packed (less variable) data set.
An outlier is an event that is very different from the norm and results in a score that is far away from the rest of the scores.
A data point is classified as an outlier if it lies more than 1.5 interquartile ranges above the upper quartile or more than 1.5 interquartile ranges below the lower quartile:
Below Q_1-1.5\times \text{IQR} or more than Q_3+1.5\times \text{IQR}.
A five number summary consists of the:
Minimum value
Lower quartile (Q_1)
Median
Upper quartile (Q_3)
Maximum value
Using the five number summary we can construct a box and whisker plot.
The two vertical edges of the box show the quartiles of the data range. The left hand side of the box is the lower quartile (Q_1) and the right hand side of the box is the upper quartile (Q_3). The vertical line inside the box shows the median (the middle score) of the data.
Then there are two lines that extend from the box outwards. The endpoint of the left line is at the minimum score, while the endpoint of the right line is at the maximum score.
VO_2 Max is a measure of how efficiently your body uses oxygen during exercise. The more physically fit you are, the higher your VO_2 Max. Here are some people’s results when their VO_2 Max was measured. 46, \, 27, \, 32, \, 46, \, 30, \, 25, \, 41,24, \, 26, \, 29, \, 21, \, 21, \, 26, \, 47, \, 21, \, 30, \, 41, \, 26, \, 28, \, 26, \, 76
Sort the values into ascending order.
Determine the median VO_2 Max.
Determine the upper quartile value. Leave your answer as a decimal if necessary.
Determine the lower quartile value. Leave your answer as a decimal if necessary.
Calculate 1.5 \times IQR, where IQR is the interquartile range. Leave your answer as a decimal if necessary.
An outlier is a score that is more than 1.5 \times IQR above or below the Upper Quartile or Lower Quartile respectively. State the outlier.
Here is a box and whisker plot for the data.
An average untrained healthy person has a VO_2 Max between 30 and 40. The majority of this group of people are likely to:
An outlier is an event that is very different from the norm and results in a score that is far away from the rest of the scores.
A data point is classified as an outlier if it is below Q_1-1.5\times \text{IQR} or more than Q_3+1.5\times \text{IQR}.
The five number summary is represented on a box plot:
Measures of central tendency and measure of spread can be very powerful in comparing and contrasting two different data sets.
We also can benefit from examining the shape of the distribution of two sets of data when comparing them.
Data may be described as symmetrical or asymmetrical.
There are many cases where the data tends to be around a central value with no bias left or right. In such a case, roughly 50 \% of scores will be above the mean and 50 \% of scores will be below the mean. In other words, the mean and median roughly coincide.
In the distribution above, the peak of the data represents the mean, the median and the mode (taken as the centre of the modal class): all these measures of central tendency are equal for this symmetrical distribution.
A uniform distribution is a symmetrical distribution where each outcome is equally likely, so the frequency should be the same for each outcome. For example, when rolling dice the outcomes are equally likely, while we might get an irregular column graph if only a small number of rolls were performed if we continued to roll the dice the distribution would approach a uniform distribution like that shown below.
If a data set is asymmetrical instead (i.e. it isn't symmetrical), it may be described as skewed.
A data set that has positive skew (sometimes called a 'right skew') has a longer tail of values to the right of the data set. The mass of the distribution is concentrated on the left of the figure.
A positively skewed graph looks something like this:
A data set that has negative skew (sometimes called a 'left skew') has a longer tail of values to the left of the data set. The mass of the distribution is concentrated on the right of the figure.
A negatively skewed graph looks something like this:
In a set of data, a cluster occurs when a large number of the scores are grouped together within a small range. Clustering may occur at a single location or several locations. For example, annual wages for a factory may cluster around \$40\,000 for unskilled factory workers, \$55\,000 for tradespersons and \$70\,000 for management. The data may also have clear gaps where values are either very uncommon or not possible in the data set. If the data has two clear peaks then the shape is called bimodal.
State whether the scores in each histogram are positively skewed, negatively skewed or symmetrical (approximately).
The beaks of two groups of bird are measured, in mm, to determine whether they might be of the same species.
\text{Group } 1 | 33 | 39 | 31 | 27 | 22 | 37 | 30 | 24 | 24 | 28 |
---|---|---|---|---|---|---|---|---|---|---|
\text{Group } 2 | 29 | 44 | 45 | 34 | 31 | 44 | 44 | 33 | 37 | 34 |
Calculate the range for Group 1.
Calculate the range for Group 2.
Calculate the mean for Group 1. Give your answer as a decimal.
Calculate the mean for Group 2. Give your answer as a decimal.
Choose the most appropriate statement that describes the set of data.
The box plots drawn below show the number of repetitions of a 70kg bar that two weightlifters can lift. They both record their repetitions over 30 days.
Which weightlifter has the more consistent results?
What statistical evidence supports your answer?
Which statistic is the same for each weightlifter?
Which weightlifter can do the most repetitions of 70 kg?
When comparing data sets:
To determine which data set scored higher, we compare the measures of centre.
To determine which data set is more consistent, we compare the measures of spread.