topic badge

1.02 Describing data

Lesson

Measures of central tendency

Measures of central tendency, or measures of location, refer to statistical quantities that tell us where the middle of the scores is (the average). There are 3 of these measures: mean, median and mode. They can all be referred to as "averages" of the data set.

The mean is what we typically consider to be the average of all the scores.

If the data is given as single scores, you calculate the mean by adding up all the scores, then dividing the total by the number of scores.

If the data is given in a frequency table then sum the scores multiplied each by their frequency before dividing by the total number of data points.

If the data is grouped, then first find the midpoint of each class interval, sum the midpoints multiplied by the frequency and then divide by the total of the frequency column.

The median is the middle score in a data set when the scores are arranged in numerical order.

There are two ways you can find the median:

  1. Write the numbers in the data set in ascending order, then find the middle score by crossing out a number at each end until you are left with one in the middle.

  2. Calculate what score would be in the middle using the formula: \text{middle term}= \dfrac{n+1}{2}, then count up in ascending order until you reach the score that is that term.

Note: If the data is grouped using class intervals you must add up the frequency column until you come to the interval where the middle score must lie. The interval will be called the median class.

The mode is the most frequently occurring score.

To find the mode, count which score you see most frequently in your data set. If the data is in a frequency table, the score with the highest frequency is the mode. If the data is grouped then the modal class will be the class interval with the highest frequency.

Examples

Example 1

Consider the table below.

ScoreFrequency
1-42
5-87
9-1215
13-165
17-201
a

Use the midpoint of each class interval to determine an estimate for the mean of the sample distribution. Round your answer to one decimal place.

Worked Solution
Create a strategy

Determine the midpoint of each class interval by finding the average of two middle scores and then find the mean using these midpoints.

Apply the idea

To find the mean we will need to know the number of scores. We can find this by adding the frequencies in the table.

We calculate the midpoint of each class interval by adding the end points and dividing by 2.

ScoreFrequencyMidpoint
1-42\dfrac{2+3}{2}=2.5
5-87\dfrac{6+7}{2}=6.5
9-1215\dfrac{10+11}{2}=10.5
13-165\dfrac{14+15}{2}=14.5
17-201\dfrac{18+19}{2}=18.5
\text{Total:}30

Now we can calculate the mean by multiplying each midpoint by the frequency and divide by the number of scores.

\displaystyle \text{Mean}\displaystyle =\displaystyle \frac{2.5\times 2+6.5\times 7+10.5\times 15+14.5\times 5+18.5\times 1}{30}
\displaystyle \approx\displaystyle 10.0
b

Which is the modal group?

A
1-4
B
17-20
C
13-16
D
5-8
E
9-12
Worked Solution
Create a strategy

Choose the class interval with the highest frequency.

Apply the idea

There are 15 scores between 9-12. This means that the modal group is 9-12 as it has the highest frequency.

So, the correct answer is Option E.

Idea summary

Measures of centre tell us the location of data.

  • mean - is the sum of values divided by the number of values.

  • median - is the middle value when the values are sorted.

  • mode - is the value that occurs most often.

Measures of spread

The range, interquartile range, variance and standard deviation are all measures of spread. They tell us about how spread out the scores are.

The range is the difference between the highest score and the lowest score.

To calculate the range, you need to subtract the lowest score from the highest score.

The interquartile range (IQR) gives a measure of spread of the middle 50\% of the data set.

The interquartile range often gives a better indication of the internal spread than the range does, as it is less affected by individual scores that are unusually high or low, which are the outliers.

Remember to make sure the data set is ordered before finding the quartiles or the median.

To calculate the interquartile range, subtract the first quartile from the third quartile. That is,\text{IQR} = Q_3 -Q_1

Examples

Example 2

Consider the following set of scores:33,\,38,\,50,\,12,\,33,\,48,\,41

a

Sort the scores in ascending order.

Worked Solution
Create a strategy

Arrange the scores from smallest to largest.

Apply the idea

12,\,33,\,33,\,38,\,41,\,48,\,50

b

Find the number of scores.

Worked Solution
Create a strategy

Count the scores.

Apply the idea

\text{Number of scores} = 7

c

Find the median.

Worked Solution
Create a strategy

Choose the middle score in the ordered list.

Apply the idea

The ordered scores are:12,\,33,\,33,\,38,\,41,\,48,\,50

We can see that the middle score is 38, so this is the median.

d

Find the first quartile of the set of scores.

Worked Solution
Create a strategy

Use the first half of the scores excluding the median.

Apply the idea

The first half of the scores are: 12,\,33,\,33

The median of this set is 33.

So, the first quartile of the original set of scores is 33.

e

Find the third quartile of the set of scores.

Worked Solution
Create a strategy

Use the second half of the scores excluding the median.

Apply the idea

The second half of the scores are: 41,\,48,\,50

The median of this set is 48.

So, the third quartile of the original set of scores is 48.

f

Find the interquartile range.

Worked Solution
Create a strategy

We can use the interquartile range formula: \text{IQR} = Q_{3} - Q_{1}

Apply the idea
\displaystyle \text{IQR}\displaystyle =\displaystyle 48-33Substitute the quartiles
\displaystyle =\displaystyle 15Evaluate

Example 3

The column graph shows the number of pets that each student in a class owns.

A column graph on the number of pets that each student in a class owns. Ask your teacher for more information.
a

Find the first quartile of the set of scores.

Worked Solution
Create a strategy

Create a frequency table.

Apply the idea
No. PetsFrequency
04
19
27
33
42
52
\text{Total:}27

From the graph, we can create this frequency table.

By adding the frequencies we find that there are 27 scores.

The median will be the \dfrac{27+1}{2}=14th score.

So the first quartile will be the middle of the first 13 scores, which will be the \dfrac{13+1}{2}=7th score. We can see from the table that the 5th to 9th scores are all 1s.

The first quartile of the set of scores is 1 pet.

b

Find the third quartile of the set of scores.

Worked Solution
Create a strategy

Use the frequency table from part (a).

Apply the idea
No. PetsFrequency
04
19
27
33
42
52
\text{Total:}27

Since the median is the \dfrac{27+1}{2}=14th score, the third quartile will be the middle of the last 13 scores, which will be the \dfrac{13+1}{2}=7th score from the end.

If we count from the highest score backwards, we can see that the 7th score from the end is a 3.

The third quartile of the set of scores is 3 pets.

c

Find the interquartile range.

Worked Solution
Create a strategy

We can use the interquartile range formula: \text{IQR} = Q_{3} - Q_{1}

Apply the idea
\displaystyle \text{IQR}\displaystyle =\displaystyle 3-1Substitute the quartiles
\displaystyle =\displaystyle 2Evaluate
Idea summary

The range is the difference between the highest score and the lowest score.

To calculate the interquartile range:

\displaystyle IQR = Q_3 - Q_1
\bm{Q_3}
is the upper quartile
\bm{Q_1}
is the lower quartile

Standard deviation

Standard deviation is a measure of spread, which helps give us a meaningful estimate of the variability in a data set. While the quartiles gave us a measure of spread about the median, the standard deviation gives us a measure of spread with respect to the mean. It is a weighted average of the distance of each data point from the mean. A small standard deviation indicates that most scores are close to the mean, while a large standard deviation indicates that the scores are more spread out away from the mean value.

We can calculate the standard deviation for a population or a sample.

The symbols used are:

\displaystyle \text{Population standard deviation}\displaystyle =\displaystyle \sigma \text{ (lower case)}
\displaystyle \text{Sample standard deviation}\displaystyle =\displaystyle s

In statistics mode on a calculator, the following symbols might be used:

\displaystyle \text{Population standard deviation}\displaystyle =\displaystyle \sigma_n \text{ (lower case)}
\displaystyle \text{Sample standard deviation}\displaystyle =\displaystyle \sigma_{n-1}

Standard deviation is a very powerful way of comparing the spread of different data sets, particularly if there are different means and population numbers.

Standard deviation can be calculated using a formula. However, as this process is time consuming we will be using our calculator to find the standard deviation. Ensure settings are correct for the data given, this is particularly important when changing between data that is in a simple list to data that is in a frequency table. For questions in the exercise set, assume standard deviation refers to the population standard deviation unless otherwise stated.

Examples

Example 4

Find the population standard deviation of the following set of scores by using the statistics mode on the calculator: 8, \, 20, \, 16, \, 9, \, 9, \, 15, \, 5, \, 17, \, 19, \, 6

Round your answer to two decimal places.

Worked Solution
Create a strategy

Enter all the scores into your calculator using the statistics function to find the population standard deviation.

Apply the idea

\text{Standard deviation} =5.30

Example 5

The table shows the number of goals scored by a football team in each game of the year.

\text{Score } (x)\text{Frequency } (f)
03
11
25
31
45
55
a

In how many games were 0 goals scored?

Worked Solution
Create a strategy

Find the frequency of score 0.

Apply the idea

Score 0 has a frequency of 3.

There are 3 games with score of 0.

b

Determine the median number of goals scored. Leave your answer to one decimal place if necessary.

Worked Solution
Create a strategy

To determine the median, get the average of the two middle scores if the total number of games is even or determine the middle score if the total number of games is odd.

Apply the idea

If we add the frequencies we get 20 pieces of data. So the median will be the average of the 10th and the 11th scores.

From the table we can find that the 10th score is a 3, and the 11th score is a 4.

\displaystyle \text{median}\displaystyle =\displaystyle \dfrac{3+4}{2}Find the average
\displaystyle =\displaystyle 3.5Evaluate
c

Calculate the mean number of goals scored each game. Leave your answer to two decimal places if necessary.

Worked Solution
Create a strategy

Use the formula: \text{mean}=\dfrac{\text{sum of scores}}{\text{number of scores}}

Apply the idea

To find the sum of the scores we multiply each score by its frequency and add the products. To find the number of scores we add the frequencies.

\displaystyle \text{mean}\displaystyle =\displaystyle \dfrac{3\times0 + 1 \times 1 + 5 \times 2 + 1 \times 3 + 5\times 4 + 5\times5}{3+1+5+1+5+5}
\displaystyle =\displaystyle 2.95Evaluate
d

Use your calculator to find the population standard deviation. Leave your answer to two decimal places if necessary.

Worked Solution
Create a strategy

After storing the data on your calculator press the population standard deviation button often denoted by \sigma_n.

Apply the idea

\sigma_n=1.75

Idea summary

Standard deviation is a weighted average of how far each piece of data varies from the mean. It takes every data point into account, and so is significantly impacted by outliers.

For each measure of spread:

  • A larger value indicates a wider spread (more variable) data set.

  • A smaller value indicates a more tightly packed (less variable) data set.

Outliers and box plots

An outlier is an event that is very different from the norm and results in a score that is far away from the rest of the scores.

A data point is classified as an outlier if it lies more than 1.5 interquartile ranges above the upper quartile or more than 1.5 interquartile ranges below the lower quartile:

Below Q_1-1.5\times \text{IQR} or more than Q_3+1.5\times \text{IQR}.

A five number summary consists of the:

  • Minimum value

  • Lower quartile (Q_1)

  • Median

  • Upper quartile (Q_3)

  • Maximum value

Using the five number summary we can construct a box and whisker plot.

A box and whisker plot with corresponding labels. Ask your teacher for more information.

The two vertical edges of the box show the quartiles of the data range. The left hand side of the box is the lower quartile (Q_1) and the right hand side of the box is the upper quartile (Q_3). The vertical line inside the box shows the median (the middle score) of the data.

Then there are two lines that extend from the box outwards. The endpoint of the left line is at the minimum score, while the endpoint of the right line is at the maximum score.

Examples

Example 6

VO_2 Max is a measure of how efficiently your body uses oxygen during exercise. The more physically fit you are, the higher your VO_2 Max. Here are some people’s results when their VO_2 Max was measured. 46, \, 27, \, 32, \, 46, \, 30, \, 25, \, 41,24, \, 26, \, 29, \, 21, \, 21, \, 26, \, 47, \, 21, \, 30, \, 41, \, 26, \, 28, \, 26, \, 76

a

Sort the values into ascending order.

Worked Solution
Create a strategy

Arrange the values from smallest to largest numbers.

Apply the idea

21, \, 21, \, 21, \, 24, \, 25, \, 26, \, 26, \, 26, \, 26, \, 27, \, 28, \, 29, \, 30, \, 30, \, 32, \, 41, \, 41, \, 46, \, 46, \, 47, \, 76

b

Determine the median VO_2 Max.

Worked Solution
Create a strategy

To determine the median, get the average of the two middle scores if the total number of games is even or determine the middle score if the total number of games is odd.

Apply the idea

Since there is a total of 21 scores, the median is the 11th score which is 28.

c

Determine the upper quartile value. Leave your answer as a decimal if necessary.

Worked Solution
Create a strategy

Use the second half of the scores excluding the median.

Apply the idea

The upper half of the data is: 29, \, 30, \, 30, \, 32, \, 41, \, 41, \, 46, \, 46, \, 47, \, 76.

Since there are 10 scores the upper quartile would be the average of the 5th and 6th scores.

\displaystyle Q_3\displaystyle =\displaystyle \dfrac{41+41}{2}Average the two middle scores
\displaystyle =\displaystyle 41Evaluate
d

Determine the lower quartile value. Leave your answer as a decimal if necessary.

Worked Solution
Create a strategy

Use the first half of the scores excluding the median.

Apply the idea

The lower half of the data is: 21, \, 21, \, 21, \, 24, \, 25, \, 26, \, 26, \, 26, \, 26, \, 27.

Since there are 10 scores the lower quartile would be the average of the 5th and 6th scores.

\displaystyle Q_1\displaystyle =\displaystyle \dfrac{25+26}{2}Take the average of the middle scores
\displaystyle =\displaystyle 25.5Evaluate
e

Calculate 1.5 \times IQR, where IQR is the interquartile range. Leave your answer as a decimal if necessary.

Worked Solution
Create a strategy

Use the interquartile range formula: \text{IQR} = Q_{3} - Q_{1}, and the answers from parts (c) and (d).

Apply the idea
\displaystyle 1.5 \times IQR\displaystyle =\displaystyle 1.5 \times (Q_3 - Q_1)
\displaystyle =\displaystyle 1.5 \times (41-25.5)Substitute the values
\displaystyle =\displaystyle 23.25Evaluate
f

An outlier is a score that is more than 1.5 \times IQR above or below the Upper Quartile or Lower Quartile respectively. State the outlier.

Worked Solution
Create a strategy

Subtract 1.5 \times IQR from the Lower Quartile and add 1.5\times IQR to the Upper Quartile. Compare the answers to the list of scores.

Apply the idea
\displaystyle Q_1-1.5 \times IQR\displaystyle =\displaystyle 25.5-23.25Subtract the values
\displaystyle =\displaystyle 2.25Evaluate

There is no score lower than 2.25.

\displaystyle Q_3 + 1.5\times IQR\displaystyle =\displaystyle 41+23.25Add the values
\displaystyle =\displaystyle 64.25Evaluate

The score which is higher than 64.25 is 76. So it is the outlier.

g

Here is a box and whisker plot for the data.

20
30
40
50
60
70
80

An average untrained healthy person has a VO_2 Max between 30 and 40. The majority of this group of people are likely to:

A
do moderate amounts of exercise
B
be professional athletes
C
do none to moderate amounts of exercise
Worked Solution
Create a strategy

Consider where the values like on the box plot.

Apply the idea

Half of the group from the box plot have VO_2 Max lower than 30, and 75\% of the group have VO_2 Max lower than approximately 40. So most of this group are as fit as an untrained healthy person or worse.

So the description "none to moderate amounts of exercise" is most appropriate for approximately 75\% of the people represented by the box plot.

So, the correct answer is Option C.

Idea summary

An outlier is an event that is very different from the norm and results in a score that is far away from the rest of the scores.

A data point is classified as an outlier if it is below Q_1-1.5\times \text{IQR} or more than Q_3+1.5\times \text{IQR}.

The five number summary is represented on a box plot:

A box and whisker plot with corresponding labels. Ask your teacher for more information.

Shape of data

Measures of central tendency and measure of spread can be very powerful in comparing and contrasting two different data sets.

We also can benefit from examining the shape of the distribution of two sets of data when comparing them.

Data may be described as symmetrical or asymmetrical.

There are many cases where the data tends to be around a central value with no bias left or right. In such a case, roughly 50 \% of scores will be above the mean and 50 \% of scores will be below the mean. In other words, the mean and median roughly coincide.

A bell-shaped curve

The normal distribution is a common example of a symmetrical distribution of data. The normal distribution looks like this bell-shaped curve.

A symmetrical curve drawn over the histogram. Ask your teacher for more information.

This picture shows how a data set that has an approximate normal distribution may appear in a histogram. The dark line shows the nice, symmetrical curve that can be drawn over the histogram that the data roughly follows.

In the distribution above, the peak of the data represents the mean, the median and the mode (taken as the centre of the modal class): all these measures of central tendency are equal for this symmetrical distribution.

A uniform distribution is a symmetrical distribution where each outcome is equally likely, so the frequency should be the same for each outcome. For example, when rolling dice the outcomes are equally likely, while we might get an irregular column graph if only a small number of rolls were performed if we continued to roll the dice the distribution would approach a uniform distribution like that shown below.

A bar graph showing a uniform distribution. Ask your teacher for more information.

If a data set is asymmetrical instead (i.e. it isn't symmetrical), it may be described as skewed.

A data set that has positive skew (sometimes called a 'right skew') has a longer tail of values to the right of the data set. The mass of the distribution is concentrated on the left of the figure.

A positively skewed graph looks something like this:

General shape of positively skewed data with left side stretched out. Ask your teacher for more information.

General shape of positively skewed data with left side stretched out.

General shape shown over a histogram of negatively skewed data. Ask your teacher for more information.

General shape shown over a histogram of positively skewed data.

A data set that has negative skew (sometimes called a 'left skew') has a longer tail of values to the left of the data set. The mass of the distribution is concentrated on the right of the figure.

A negatively skewed graph looks something like this:

General shape of negatively skewed data with left side stretched out. Ask your teacher for more information.

General shape of negatively skewed data with left side stretched out

General shape shown over a histogram of negatively skewed data. Ask your teacher for more information.

General shape shown over a histogram of negatively skewed data.

In a set of data, a cluster occurs when a large number of the scores are grouped together within a small range. Clustering may occur at a single location or several locations. For example, annual wages for a factory may cluster around \$40\,000 for unskilled factory workers, \$55\,000 for tradespersons and \$70\,000 for management. The data may also have clear gaps where values are either very uncommon or not possible in the data set. If the data has two clear peaks then the shape is called bimodal.

Examples

Example 7

State whether the scores in each histogram are positively skewed, negatively skewed or symmetrical (approximately).

a
A histogram starting from score 8 to 17. Ask your teacher for more information.
A
Positively skewed
B
Negatively skewed
C
Symmetrical
Worked Solution
Create a strategy

Check where the bulk of the data sits and look at the general shape of the distribution.

Apply the idea

The scores are roughly even in both the high and low end, so the distribution is symmetrical, option C.

b
A histogram with relatively low scores.Ask your teacher for more information.
A
Positively skewed
B
Negatively skewed
C
Symmetrical
Worked Solution
Create a strategy

Check where the bulk of the data sits and look at the general shape of the distribution.

Apply the idea

Most of the scores on the histogram are relatively low, so the distribution is positively skewed, option A.

c
A histogram with relatively high scores.Ask your teacher for more information.
A
Symmetrical
B
Negatively skewed
C
Positively skewed
Worked Solution
Create a strategy

Check where the bulk of the data sits and look at the general shape of the distribution.

Apply the idea

Most of the scores on the histogram are relatively high, so the distribution is negatively skewed, option B.

Example 8

The beaks of two groups of bird are measured, in mm, to determine whether they might be of the same species.

Length of beaks of two groups of birds (in mm)
\text{Group } 133393127223730242428
\text{Group } 229444534314444333734
a

Calculate the range for Group 1.

Worked Solution
Create a strategy

Find the difference between the highest score and lowest score.

Apply the idea

In Group 1, the lowest score is 22 and the highest score is 39.

\displaystyle \text{Range}\displaystyle =\displaystyle 39-22Subtract 22 from 39
\displaystyle =\displaystyle 17Evaluate
b

Calculate the range for Group 2.

Worked Solution
Create a strategy

Find the difference between the highest score and lowest score.

Apply the idea

In Group 2, the lowest score is 29 and the highest score is 45.

\displaystyle \text{Range}\displaystyle =\displaystyle 45-29Subtract 29 from 45
\displaystyle =\displaystyle 16Evaluate
c

Calculate the mean for Group 1. Give your answer as a decimal.

Worked Solution
Create a strategy

Find the average of the scores to find the mean.

Apply the idea

Calculating the mean, we add all the beak lengths of group 1 and divide this sum by the total number of beak measurements in group 1. Note that there are 10 measurements for group 1.

\displaystyle \text{Mean}\displaystyle =\displaystyle \frac{33+39+31+27+22+37+30+24+24+28}{10}
\displaystyle =\displaystyle 29.5
d

Calculate the mean for Group 2. Give your answer as a decimal.

Worked Solution
Create a strategy

Find the average of the scores to find the mean.

Apply the idea

Calculating the mean, we add all the beak lengths of group 2 and divide this sum by the total number of beak measurements in group 2. Note that there are 10 measurements for group 2.

\displaystyle \text{Mean}\displaystyle =\displaystyle \frac{29+44+45+34+31+44+44+33+37+34}{10}
\displaystyle =\displaystyle 37.5
e

Choose the most appropriate statement that describes the set of data.

A
Although the ranges are similar, the mean values are significantly different indicating that these two groups of birds are of the same species.
B
Although the ranges are similar, the mean values are significantly different indicating that these two groups of birds are not of the same species.
C
Although the mean values are similar, the range values are significantly different indicating that these two groups of birds are of the same species.
D
Although the mean values are similar, the range values are significantly different indicating that these two groups of birds are not of the same species.
Worked Solution
Create a strategy

Compare the calculated range and mean values from parts (a) to (d).

Apply the idea

We can summarise the previous results in this table.

RangeMean
\text{Group } 11729.5
\text{Group }21637.5

The ranges are similar, but the mean values are significantly different indicating that these two groups of birds are not of the same species.

So, the correct answer is Option B.

Example 9

The box plots drawn below show the number of repetitions of a 70kg bar that two weightlifters can lift. They both record their repetitions over 30 days.

Box plots for two weightlifters A and B. Ask your teacher for more information.
a

Which weightlifter has the more consistent results?

A
Weightlifter A
B
Weightlifter B
Worked Solution
Create a strategy

Choose weightlifter that has scores that are more clustered together.

Apply the idea

Based on the box plots, the weightlifter that has more consistent results is weighlifter A because the scores on the box plot are more clustered together.

So, the correct answer is Option A.

b

What statistical evidence supports your answer?

A
The mean
B
The range
C
The mode
D
The graph is positively skewed
Worked Solution
Create a strategy

Choose the measure of spread.

Apply the idea

The range supports this answer since it is a measure of how clustered (or spread) a set of data is.

\displaystyle \text{Range A}\displaystyle =\displaystyle 12-6Subtract the minimum from the maximum
\displaystyle =\displaystyle 6Evaluate
\displaystyle \text{Range B}\displaystyle =\displaystyle 15-2Subtract the minimum from the maximum
\displaystyle =\displaystyle 13Evaluate

As expected, the range for weightlifer A is smaller than the range for weightlifter B, which confirms that weightlifter A has more consistent results.

So, the correct answer is Option B.

c

Which statistic is the same for each weightlifter?

A
The median
B
The mean
C
The mode
Worked Solution
Create a strategy

Look for similarities in the box plots.

Apply the idea

From the two box plots, the weighlifters have the same median. The correct answer is Option A.

d

Which weightlifter can do the most repetitions of 70 kg?

A
Weightlifter B
B
Weightlifter A
Worked Solution
Create a strategy

Choose the weightlifter with the highest maximum score.

Apply the idea

Comparing the box plots, Weightlifter B has a higher maximum.

This means that Weightlifter B can do most repetitions of 70 kg. Option A is correct.

Idea summary

When comparing data sets:

  • To determine which data set scored higher, we compare the measures of centre.

  • To determine which data set is more consistent, we compare the measures of spread.

Outcomes

ACMGM048

review the statistical investigation process; for example, identifying a problem and posing a statistical question, collecting or obtaining data, analysing the data, interpreting and communicating the results

What is Mathspace

About Mathspace