Statistics

Lesson

We are often interested in the *centre *of a set of data in the sense that we would like to know where on the number line the data are mostly located. Together with the location of the centre, we may want to know how widely spread the data points are away from the centre.

The usual measures of centre are the arithmetic mean, the median and the mode. In applying these, we first examine the data, bearing in mind what our investigation is intended to show. Often, we do this with the help of a visual display of some sort, to decide whether any of the standard measures really makes sense for the given data set and to decide which of them would be the most appropriate to use.

Measures of centre can fail to make sense and fail to give useful information in various ways.

Consider the maternity wing of a hospital. In a given month, $25$25 boys were born and $21$21 girls. The mean number of births in the two genders would not be a very meaningful number. A more useful statistic would be the proportions of each in the total number. In general, it usually makes little sense to find the average of the number of subjects that fall into a small number of categories.

In general, we need data that can take a more nearly continuous range of values in order for an average to give useful information. If, for example, the birth weights of the boys and the girls born in the hospital were recorded, it would then be of interest to know the location of the centre of the spread of the weights.

Consider a set of data about sale prices of houses in a particular city over some span of time. An examination of the data is likely to reveal that a few houses sold for exceptionally high prices while most sold for prices nearer to the middle of the range. Here, our intuitive idea of the *middle* of the range is related to the median.

The median, in this case, gives a better idea of the typical house price than would the mean. This is because a small number of exceptionally high data values can affect the mean strongly while the median is unaffected by extreme values.

Another difficulty occurs when a data set contains low values and high values but fewer values in between. Neither the mean nor the median would give a satisfactory impression of the centre in this situation. We might describe the data set as being *bimodal*.

When this occurs we look for an explanation in the possibility that the data has been drawn, perhaps inadvertently, from two different populations. The weights of the babies in the maternity hospital example may be of this kind, the two populations being the males and the females.

Measures of spread also need to be used carefully. The idea of *range* seems broadly applicable but it conveys no information about how the data is distributed.

Often associated with the mean is the measure of spread called the *standard deviation*. Many useful computations are done conveniently using the standard deviation, yet it can be misleading in the same way that the mean can be misleading. That is, when the data set contains exceptional values or has an asymmetrical distribution.

In other cases the *interquartile range* may give a better indication of the distribution of the data about the centre. We associate measures of spread that use the quartiles with the median. For example, a box-and-whisker plot makes use of the five-number summary: minimum, first quartile, median, third quartile, maximum.

Again, the five-number summary tends to hide the truth in the case of bimodal data. To use it effectively, we assume the existence of a single central location. Asymmetry in the distribution is reflected in unequal distances between the first quartile and the median and the median and the third quartile.

Examine the following data by constructing a box-plot display.

{21, 21, 23, 24, 24, 25, 30, 36, 45, 46, 46, 47, 47}

The five-number summary consists of the values 21 (minimum), 23.5 (Q1), 30 (median), 46 (Q3), 47 (maximum)

We see that the distribution is positively skewed. More importantly, the *whiskers* are very short compared with the size of the *box*. This could indicate that the data set is bimodal and a closer inspection of it shows that this is indeed the case.

A journalist wanted to report on road speed cameras being used as revenue raisers. She obtained data that showed the number of times $20$20 speed cameras issued a fine to motorists in one month. The results were:

$101,102,115,115,121,124,127,128,130,130,143,143,146,162,162,163,178,183,194,977$101,102,115,115,121,124,127,128,130,130,143,143,146,162,162,163,178,183,194,977

Determine the mean number of times a speed camera issued a fine in that month. Give your answer correct to one decimal place.

Determine the median number of times a speed camera issued a fine in that month. Give your answer correct to one decimal place.

Which measure is most representative of the number of fines issued by each speed camera in one month?

the mean

Athe median

Bthe mean

Athe median

BWhich score causes the mean to be much greater than the median?

The journalist wants to give the impression that speed cameras are just being used to raise revenue. Which statement should she make?

A sample of $20$20 speed cameras found that the median number of fines in one month was $136.5$136.5.

AA sample of $20$20 speed cameras found that, on average, $182.2$182.2 fines were issued in one month.

BA sample of $20$20 speed cameras found that the median number of fines in one month was $136.5$136.5.

AA sample of $20$20 speed cameras found that, on average, $182.2$182.2 fines were issued in one month.

B

Teachers at a school suspected that the participation rate of students aged 13 to 15 in extra curricular activities was low compared to other ages. They asked a sample of students if they had participated in any activities in the last year, and recorded their ages. The results are presented in the dot plot.

Fill in the gaps to determine the mean age of participation among the sample of students.

Mean $=$= $\frac{\editable{}\times10+\editable{}\times11+\editable{}\times12+\editable{}\times13+\editable{}\times14+\editable{}\times15+\editable{}\times16+\editable{}\times17+\editable{}\times18}{5+5+4+1+1+2+3+4+6}$×10+×11+×12+×13+×14+×15+×16+×17+×185+5+4+1+1+2+3+4+6 $=$= $\frac{50+55+48+13+14+30+48+68+108}{\editable{}}$50+55+48+13+14+30+48+68+108 $=$= $\frac{\editable{}}{31}$31 $=$= $14$14 Determine the median age of participation among the sample of students.

By looking at the dot plot, it is clear that:

the mean and median are not reliable measure for the age of students who participate in activities

Athe mean and median are reliable measure for the age of students who participate in activities

Bthe mean and median are not reliable measure for the age of students who participate in activities

Athe mean and median are reliable measure for the age of students who participate in activities

BWhich of the following has caused the mean and median age to be unreliable measures for the age of students who participate in activities?

the distribution of ages is too symmetrical

Atoo many people were included in the sample

Bthe distribution of ages is too spread out

Cthe majority of scores are away from the middle

Dthe distribution of ages is too symmetrical

Atoo many people were included in the sample

Bthe distribution of ages is too spread out

Cthe majority of scores are away from the middle

DWhat other measure could show that the mean and median are not reflective of the ages of people in the sample?

range

Astandard deviation

Brange

Astandard deviation

B

The column graph shows the total rainfall received during each month of the year.

Which of the following would be a reasonable measurement for the data?

The median month in which rain fell

AThe average number of months in a year

BThe average monthly rainfall

CThe average rainfall in January

DThe median month in which rain fell

AThe average number of months in a year

BThe average monthly rainfall

CThe average rainfall in January

D