# Statistics on grouped data

Lesson

This data set represents observations on $320$320 experimental subjects. The task confronting the researcher is to make sense of the results. That is, the researcher wishes to discover the useful information that may be concealed within the numbers.

It is not difficult, using spreadsheet software to calculate

• mean:  $45$45
• median: $45$45
• Q1: $36$36 and Q3: $55$55

One simple interpretation of these statistics is that $25%$25% of the observations are at or below $36$36 and $75%$75% of the observations are at or below $55$55. In effect, we have split the data into four subsets.

The computer also tells us the minimum, $11$11, and the maximum, $72$72. With this $5$5-number summary, a box-plot can be constructed as an early step towards making the data intelligible.

This gives some idea of the location and spread of the data points but it should be borne in mind that there are still many ways in which the data could be distributed. The same boxplot would arise if the numbers were fairly evenly spread or if there were clusters around the first and third quartiles or with many other arrangements. A picture can be helpful but it may also be misleading.

## There is another way!

Another way of grouping the data is to allocate the numbers to several evenly spaced classes that together cover the range. In general, a decision is made about the number of classes into which the data is to be split and a count is made of the number of data points occurring in each of the classes.

In this way, a frequency table is constructed. It shows how frequently the observations fall into each of the classes. The information in a frequency table can then be displayed as a histogram.

In the following examples, there are two different histograms generated from the original set of data. In the first, a frequency table with classes of width $15$15 units was used. In the second, the class width is $5$5 units.

Although these two histograms come from the same data, they appear to be different. The second of them suggests that the data may be bimodal - having two peaks or modal classes. This property can indicate that data from two distinct populations are combined in the data set - males and females, for example.  Again, it is important to recognise that a visual representation of this kind can be used either to reveal or to hide facts about the data. When exploring a data set it is a good idea to examine it in several different ways, always relating representations by summary statistics and frequency tables to the original raw data.

Before digital technology made it possible to carry out statistical calculations on large data sets with relative ease, methods were in use for estimating statistics including the mean and the median from a frequency table. Although superseded by modern technology, these methods are still taught in schools and the exercises that follow this chapter relate to them.

Consider the $10$10-$25$25 class in the first of the above histograms. The height of the column indicates that there are $18$18 observations in the class. However, the exact values of those observations are unknown. The strategy is to assume that the average of the observations in the class is equal to the central value of the class, namely $\frac{10+25}{2}=17.5$10+252=17.5. If this number is now multiplied by the number of observations in the class, $18$18, an approximation for the total of the observations in the class is obtained. So, in the first class, the total is approximately $17.5\times18=315$17.5×18=315

In the $25$25-$40$40 class there are $112$112 observations with a class centre of $32.5$32.5, making an approximate class total of $32.5\times112=3640$32.5×112=3640

Similarly, in the next three classes, there are approximate totals of $5652.5$5652.5, $4375$4375 and $387.5$387.5. Summing the estimates for all the classes gives $14370$14370. Since there are $320$320data points, an estimate for the mean is $\frac{14370}{320}=44.9$14370320=44.9. This is quite close to the exact mean of $45$45  and thus, the technique was a handy one when no other shortcuts were available.

If the data numbers were arranged in order of increasing size, the median would be the average of the middle two numbers. That is, the $160$160th and $161$161st numbers in the list. These observations fall in the third frequency table class, which includes observations numbered $131$131 through to $249$249. If it is assumed that these $118$118 observations are evenly spread through the interval $40$40-$55$55 of length $15$15, the observations must be about $\frac{15}{118}=0.1271$15118=0.1271 units apart.  The $130$130th observation could be about midway between the averages of the $2$2nd and $3$3rd classes, namely $40$40. Thus, the $160$160th observation would be  approximately $40+\left(160-130\right)\times0.1271\approx40+3.8=43.8$40+(160130)×0.127140+3.8=43.8.  This compares quite well with the true median of $45$45.  Formulas that encapsulate similar calculations for the median are available but are not needed in reality.

#### Worked Examples

##### Question 1

A survey was conducted which asked 30 people how many books they had read in the past month.

Number of books read Frequency
$1$1-$5$5 $2$2
$6$6-$10$10 $11$11
$11$11-$15$15 $15$15
$16$16-$20$20 $2$2
1. Based on the frequency table provided, choose all correct statements from the list below.

$11$11 people have read between $6$6 and $10$10 books in the past month.

A

$28$28 people have read at most $15$15 books in the past month.

B

We cannot determine from the table how many people have read exactly $12$12 books.

C

We can determine that $2$2 people have read exactly $5$5 books in the past month.

D

$11$11 people have read between $6$6 and $10$10 books in the past month.

A

$28$28 people have read at most $15$15 books in the past month.

B

We cannot determine from the table how many people have read exactly $12$12 books.

C

We can determine that $2$2 people have read exactly $5$5 books in the past month.

D

##### Question 2

Consider the table below.

Score Frequency
$1$1 - $4$4 $2$2
$5$5 - $8$8 $7$7
$9$9 - $12$12 $15$15
$13$13 - $16$16 $5$5
$17$17 - $20$20 $1$1
1. Use the midpoint of each class interval to determine an estimate for the mean of the following sample distribution. Round your answer to one decimal place.

2. Which is the modal group?

$1$1 - $4$4

A

$17$17 - $20$20

B

$13$13 - $16$16

C

$5$5 - $8$8

D

$9$9 - $12$12

E

$1$1 - $4$4

A

$17$17 - $20$20

B

$13$13 - $16$16

C

$5$5 - $8$8

D

$9$9 - $12$12

E

##### Question 3

Consider the scores that a statistician collected below.

 $28$28 $31$31 $33$33 $35$35 $34$34 $33$33 $32$32 $31$31 $36$36 $37$37 $40$40 $38$38 $39$39 $40$40 $36$36 $37$37 $38$38 $43$43 $44$44 $45$45 $42$42 $41$41 $44$44 $41$41 $42$42 $45$45 $43$43 $42$42 $43$43 $41$41 $49$49 $48$48 $47$47 $46$46 $50$50 $50$50 $48$48 $49$49 $47$47 $46$46 $25$25 $24$24 $22$22 $23$23 $27$27 $28$28 $26$26 $29$29 $29$29 $26$26
1. Find the mean of each row and leave your final answer in the empty boxes below. Your answers should be in decimal form.

Mean of Row $1$1 $=$= $\editable{}$

Mean of Row $2$2 $=$= $\editable{}$

Mean of Row $3$3 $=$= $\editable{}$

Mean of Row $4$4 $=$= $\editable{}$

Mean of Row $5$5 $=$= $\editable{}$

2. Calculate the mean of the 50 scores shown.

3. Rearrange the above 50 scores as grouped data by adding the frequency to each class interval in the table below:

Class Intervals Frequency
$21$21 - $25$25 $\editable{}$
$26$26 - $30$30 $\editable{}$
$31$31 - $35$35 $\editable{}$
$36$36 - $40$40 $\editable{}$
$41$41 - $45$45 $\editable{}$
$46$46 - $50$50 $\editable{}$
4. Use the frequency table in part (c) and the midpoints of each class interval to find the total mean.

Class Intervals Frequency
$21$21 - $25$25 $4$4
$26$26 - $30$30 $7$7
$31$31 - $35$35 $7$7
$36$36 - $40$40 $9$9
$41$41 - $45$45 $13$13
$46$46 - $50$50 $10$10