topic badge

Limitations of Measures of Centre and Spread

Lesson

We are often interested in the centre of a set of data in the sense that we would like to know where on the number line the data are mostly located. Together with the location of the centre, we may want to know how widely spread the data points are away from the centre.

The usual measures of centre are the arithmetic mean, the median and the mode. In applying these, we first examine the data, bearing in mind what our investigation is intended to show.  Often, we do this with the help of a visual display of some sort, to decide whether any of the standard measures really makes sense for the given data set and to decide which of them would be the most appropriate to use.

Measures of centre can fail to make sense and fail to give useful information in various ways. 

Consider the maternity wing of a hospital. In a given month, $25$25 boys were born and $21$21 girls. The mean number of births in the two genders would not be a very meaningful number. A more useful statistic would be the proportions of each in the total number. In general, it usually makes little sense to find the average of the number of subjects that fall into a small number of categories.

In general, we need data that can take a more nearly continuous range of values in order for an average to give useful information. If, for example, the birth weights of the boys and the girls born in the hospital were recorded, it would then be of interest to know the location of the centre of the spread of the weights.

Consider a set of data about sale prices of houses in a particular city over some span of time. An examination of the data is likely to reveal that a few houses sold for exceptionally high prices while most sold for prices nearer to the middle of the range. Here, our intuitive idea of the middle of the range is related to the median.

The median, in this case, gives a better idea of the typical house price than would the mean. This is because a small number of exceptionally high data values can affect the mean strongly while the median is unaffected by extreme values.

Another difficulty occurs when a data set contains low values and high values but fewer values in between. Neither the mean nor the median would give a satisfactory impression of the centre in this situation. We might describe the data set as being bimodal.

When this occurs we look for an explanation in the possibility that the data has been drawn, perhaps inadvertently, from two different populations. The weights of the babies in the maternity hospital example may be of this kind, the two populations being the males and the females.

Measures of spread also need to be used carefully. The idea of range seems broadly applicable but it conveys no information about how the data is distributed.

Often associated with the mean is the measure of spread called the standard deviation. Many useful computations are done conveniently using the standard deviation, yet it can be misleading in the same way that the mean can be misleading. That is, when the data set contains exceptional values or has an asymmetrical distribution.

In other cases the interquartile range  may give a better indication of the distribution of the data about the centre. We associate measures of spread that use the quartiles with the median. For example, a box-and-whisker plot makes use of the five-number summary: minimum, first quartile, median, third quartile, maximum.  

Again, the five-number summary tends to hide the truth in the case of bimodal data. To use it effectively, we assume the existence of a single central location. Asymmetry in the distribution is reflected in unequal distances between the first quartile and the median and the median and the third quartile.

Example

Examine the following data by constructing a box-plot display.

{21, 21, 23, 24, 24, 25, 30, 36, 45, 46, 46, 47, 47}

The five-number summary consists of the values 21 (minimum), 23.5 (Q1), 30 (median), 46 (Q3), 47 (maximum)

We see that the distribution is positively skewed. More importantly, the whiskers are very short compared with the size of the box. This could indicate that the data set is bimodal and a closer inspection of it shows that this is indeed the case.

Worked Examples

QUESTION 1

A journalist wanted to report on road speed cameras being used as revenue raisers. She obtained data that showed the number of times $20$20 speed cameras issued a fine to motorists in one month. The results were:

$101,102,115,115,121,124,127,128,130,130,143,143,146,162,162,163,178,183,194,977$101,102,115,115,121,124,127,128,130,130,143,143,146,162,162,163,178,183,194,977

  1. Determine the mean number of times a speed camera issued a fine in that month. Give your answer correct to one decimal place.

  2. Determine the median number of times a speed camera issued a fine in that month. Give your answer correct to one decimal place.

  3. Which measure is most representative of the number of fines issued by each speed camera in one month?

    the mean

    A

    the median

    B
  4. Which score causes the mean to be much greater than the median?

  5. The journalist wants to give the impression that speed cameras are just being used to raise revenue. Which statement should she make?

    A sample of $20$20 speed cameras found that the median number of fines in one month was $136.5$136.5.

    A

    A sample of $20$20 speed cameras found that, on average, $182.2$182.2 fines were issued in one month.

    B

QUESTION 2

Teachers at a school suspected that the participation rate of students aged 13 to 15 in extra curricular activities was low compared to other ages. They asked a sample of students if they had participated in any activities in the last year, and recorded their ages. The results are presented in the dot plot.

  1. Fill in the gaps to determine the mean age of participation among the sample of students.

    Mean $=$= $\frac{\editable{}\times10+\editable{}\times11+\editable{}\times12+\editable{}\times13+\editable{}\times14+\editable{}\times15+\editable{}\times16+\editable{}\times17+\editable{}\times18}{5+5+4+1+1+2+3+4+6}$×10+×11+×12+×13+×14+×15+×16+×17+×185+5+4+1+1+2+3+4+6
      $=$= $\frac{50+55+48+13+14+30+48+68+108}{\editable{}}$50+55+48+13+14+30+48+68+108
      $=$= $\frac{\editable{}}{31}$31
      $=$= $14$14
  2. Determine the median age of participation among the sample of students.

  3. By looking at the dot plot, it is clear that:

    the mean and median are not reliable measure for the age of students who participate in activities

    A

    the mean and median are reliable measure for the age of students who participate in activities

    B
  4. Which of the following has caused the mean and median age to be unreliable measures for the age of students who participate in activities?

    the distribution of ages is too symmetrical

    A

    too many people were included in the sample

    B

    the distribution of ages is too spread out

    C

    the majority of scores are away from the middle

    D
  5. What other measure could show that the mean and median are not reflective of the ages of people in the sample?

    range

    A

    standard deviation

    B

QUESTION 3

The column graph shows the total rainfall received during each month of the year.

Which of the following would be a reasonable measurement for the data?

  1. The median month in which rain fell

    A

    The average number of months in a year

    B

    The average monthly rainfall

    C

    The average rainfall in January

    D

What is Mathspace

About Mathspace