Sometimes we want to talk about a data set without having to refer to every single result. In other words, we want to summarise the data set to learn more about it and make comparisons.
In the last lesson we introduced the mode, the most frequently occurring score. In this lesson we will learn about three more ways we can summarise numerical data sets.
The mean of a data set is an average score.
Three friends are planning a trip to Alice Springs. They plan to fly there, and discover that the airline imposes a weight limit on their luggage of $20$20 kg per person. On the night before the flight they weigh their luggage and find that their luggage weights form this data set:
$17,18,22$17,18,22
One of them has packed too much. They decide to share their luggage around so that they all carry the same amount. How much does each person carry now?
Thinking about it using more mathematical language, we are sharing the total luggage equally among three groups. As a mathematical expression, we find:
$\frac{17+18+22}{3}=\frac{57}{3}=19$17+18+223=573=19
Each person carries $19$19 kg. This amount is the mean of the data set.
If we replace every number in a numerical data set with the mean, the sum of the numbers in the data set will be the same.
To calculate the mean, use the formula:
$\text{mean}=\frac{\text{sum of scores}}{\text{number of scores}}$mean=sum of scoresnumber of scores
Find the mean of this data set:
$4,7,1,2,3$4,7,1,2,3
Think: There are $5$5 scores, so we should add these numbers all together and divide by $5$5.
Do: $4+7+1+2+3=17$4+7+1+2+3=17, and $17\div5=3.4$17÷5=3.4.
Reflect: Even though all the numbers in the data set are whole numbers, the mean is a decimal. If the data set was produced from a survey question "How many siblings do you have?", we would say the mean number of siblings was $3.4$3.4, even though it isn't possible to have $0.4$0.4 siblings! The mean is a way to summarise data - it is not part of the data set itself.
The median of a data set is another kind of average.
Seven people were asked about their weekly income, and their responses form this data set:
$\$300,\$400,\$400,\$430,\$470,\$490,\$2900$$300,$400,$400,$430,$470,$490,$2900
The mean of this data set is $\frac{\$5390}{7}=\$770$$53907=$770, but this amount doesn't represent the data set very well. Six out of seven people earn much less than this.
Instead we can select the median, which is the middle score. We remove the biggest and the smallest scores:
$\$400,\$400,\$430,\$470,\$490$$400,$400,$430,$470,$490
Then the next biggest and the next smallest:
$\$400,\$430,\$470$$400,$430,$470
Then the next biggest and the next smallest:
$\$430$$430
There is only one number left, and this is the median - so for this data set the median is $\$430$$430. This weekly income is much closer to the other scores in the data set, and summarises the set better.
It helped that this set was already in order, and that there were an odd number of scores. What happens when this isn't the case?
Six people were asked to count the number of advertisements they saw while browsing the internet for an hour, and their responses form this data set:
$96,39,0,40,33,27$96,39,0,40,33,27
To find the median let's do the same thing we did before - we remove the biggest and the smallest scores:
$39,40,33,27$39,40,33,27
And the next biggest and the next smallest:
$39,33$39,33
Now that we are down to two scores, we find the number directly in between them. We can add the numbers together and divide by $2$2, just like finding a mean:
$\frac{39+33}{2}=\frac{72}{2}=36$39+332=722=36
The median number of advertisements that the six people saw was $36$36. This means that $50%$50% of people saw more than $36$36, and $50%$50% saw less than $36$36.
Finding the median of an ordered data set is much easier - if the set is scrambled up, you may want to rewrite it in order first.
The median of a numerical data set is the "middle" score, and its definition changes depending on the number of scores in the data set.
If there are an odd number of scores, the median will be the middle score.
If there are an even number of scores, the median will be the number in between the middle two scores, and half the scores will be greater than the median, and half will be less than the median.
The range is the difference between the highest and the lowest score in a data set. Unlike the mean and the median, the range doesn't measure the center - instead it measures how spread out it is.
Two bus drivers, Kenji and Björn, track how many passengers board their buses each day for a week. Their results are displayed in this table:
M | T | W | T | F | |
---|---|---|---|---|---|
Kenji | $10$10 | $13$13 | $14$14 | $16$16 | $11$11 |
Björn | $2$2 | $27$27 | $13$13 | $5$5 | $17$17 |
Both data sets have the same median and the same mean, but the sets are quite different. To calculate the range, we start by finding the highest and lowest number of passengers for each driver:
Highest | Lowest | |
---|---|---|
Kenji | $16$16 | $10$10 |
Björn | $27$27 | $2$2 |
Now we subtract the lowest from the highest to find the difference, which is the range:
Range | |||
---|---|---|---|
Kenji | $16-10$16−10 | $=$= | $6$6 |
Björn | $27-2$27−2 | $=$= | $25$25 |
Notice how Kenji's range is quite small, at least compared to Björn's. We might say that Kenji's route is more predictable, and that Björn's route is much more variable.
We can see that the range does not say anything about the size of the scores, just their spread.
The range of a numerical data set is the difference between the highest and the lowest score.
$\text{Range}=\text{Highest score}-\text{Lowest score}$Range=Highest score−Lowest score
Find the mean of the following scores:
$6$6, $14$14, $10$10, $13$13, $5$5, $9$9, $14$14, $15$15
Give your answer as a decimal.
Find the median of the following scores:
$3,18,10,19,12,5,6,20,7$3,18,10,19,12,5,6,20,7
Find the range of the following scores:
$10,16,6,18,17,11,9,15,14$10,16,6,18,17,11,9,15,14
We can find the mode, mean, median and range from a frequency table. These will be the same as the mode, mean, median and range from a list of data but we can use the frequency table to make it quicker.
Find the mode, mean, median and range of the following data.
Score ($x$x) | Frequency ($f$f) |
---|---|
$1$1 | $6$6 |
$2$2 | $9$9 |
$3$3 | $1$1 |
$4$4 | $6$6 |
$5$5 | $8$8 |
$6$6 | $6$6 |
$7$7 | $6$6 |
$8$8 | $2$2 |
$9$9 | $8$8 |
The mode is the score with the highest frequency. Looking at the frequency table, the score $2$2 has a frequency of $9$9 and all of the other scores have a lower frequency. So the mode is $2$2.
To find the mean we add together all of the scores. Since each score occurs multiple times, we can save time by multiplying the scores by the frequencies. Notice that we've assigned the score the pronumeral $x$x and the frequency the pronumeral $f$f. We want to find the product $xf$xf for each score.
Score ($x$x) | Frequency ($f$f) | $xf$xf |
---|---|---|
$1$1 | $6$6 | $6$6 |
$2$2 | $9$9 | $18$18 |
$3$3 | $1$1 | $3$3 |
$4$4 | $6$6 | $24$24 |
$5$5 | $8$8 | $40$40 |
$6$6 | $6$6 | $36$36 |
$7$7 | $6$6 | $42$42 |
$8$8 | $2$2 | $16$16 |
$9$9 | $8$8 | $72$72 |
Now if we add up the $xf$xf column, we will get the sum of all of the scores, and if we add up the frequency column we will get the total number of scores. Dividing the two sums will give us the mean.
$\frac{\text{Sum of all scores}}{\text{Total number of scores}}$Sum of all scoresTotal number of scores | $=$= | $\frac{6+18+3+24+40+36+42+16+72}{6+9+1+6+8+6+6+2+8}$6+18+3+24+40+36+42+16+726+9+1+6+8+6+6+2+8 |
Using the definition of the mean |
$=$= | $\frac{221}{52}$22152 |
Evaluate the sums |
|
$\frac{\text{Sum of all scores}}{\text{Total number of scores}}$Sum of all scoresTotal number of scores | $=$= | $4.25$4.25 |
Evaluate the quotient |
To find the median, we can find the cumulative frequency for each score. The cumulative frequency is the sum of the frequencies of the score and each of the scores below it. The cumulative frequency of the first row will be the frequency of that row. For each subsequent row, add the frequency to the cumulative frequency of the row before it.
Score ($x$x) | Frequency ($f$f) | Cumulative frequency |
---|---|---|
$1$1 | $6$6 | $6$6 |
$2$2 | $9$9 | $15$15 |
$3$3 | $1$1 | $16$16 |
$4$4 | $6$6 | $22$22 |
$5$5 | $8$8 | $30$30 |
$6$6 | $6$6 | $36$36 |
$7$7 | $6$6 | $42$42 |
$8$8 | $2$2 | $44$44 |
$9$9 | $8$8 | $52$52 |
The final row has a cumulative frequency of $52$52, so there are $52$52 scores in total. This means that the median will be the mean of the $26$26th and $27$27th scores in order.
Looking at the cumulative frequency table, there are $22$22 scores less than or equal to $4$4 and $30$30 scores less than or equal to $5$5. This means that the $26$26th and $27$27th scores are both $5$5, so the median is $5$5.
Finally, we can find the range just by looking at the score column. The highest score is $9$9 and the lowest is $1$1, so the range will be $9-1=8$9−1=8.