IB MYP 4 2021 Edition
topic badge
10.02 Collecting and displaying data
Lesson

We get the best data from a census because it includes the entire population. However, it's not always possible to conduct a census, so we often get our data from surveys instead.

When we take a survey it is important that the results are representative of the population. This means that the results that we get for any question we ask of the survey would be the same as if we asked it of a census. This also means that the mean, median, mode and range of the survey should be very close to the same results of the census (although getting exactly the same results is almost impossible).

If a survey is not representative, we call it biased. There are a number of potential sources of bias that we should avoid:

  • Consider who is being surveyed. If the people being surveyed do not resemble the population, the survey is likely to be biased. For example, surveying train travellers about their opinions on public transport will likely give very different results than a census of the entire population.
  • Also consider how many people are being surveyed. Asking one person's opinion will not tell you anything about anyone else's opinion. In general, the bigger the number of people being surveyed, the closer the results will be to a census.
  • Make sure that the questions being asked actually address the question at hand. For example, asking, "Do you approve of the current governing party?" does not give the same results as asking, "Will you vote for the current governing party in the next election?"
  • Avoid questions which use emotive language or might otherwise influence the results of the survey. For example, asking, "Do you watch the most popular sport, soccer?" will be biased unlike asking, "Do you watch soccer?". These are referred to as "leading questions" as they lead the person being surveyed to a particular answer.

Once we have collected data we need to find a way to organise and display it.

 

Summarising data from a frequency table

We can find the mode, mean, median and range from a frequency table. These will be the same as the mode, mean, median and range from a list of data but we can use the frequency table to make it quicker.

Exploration

Find the mode, mean, median and range of the following data.

Score ($x$x) Frequency ($f$f)
$1$1 $6$6
$2$2 $9$9
$3$3 $1$1
$4$4 $6$6
$5$5 $8$8
$6$6 $6$6
$7$7 $6$6
$8$8 $2$2
$9$9 $8$8

The mode is the score with the highest frequency. Looking at the frequency table, the score $2$2 has a frequency of $9$9 and all of the other scores have a lower frequency. So the mode is $2$2.

To find the mean we add together all of the scores. Since each score occurs multiple times, we can save time by multiplying the scores by the frequencies. Notice that we've assigned the score the pronumeral $x$x and the frequency the pronumeral $f$f. We want to find the product $xf$xf for each score.

Score ($x$x) Frequency ($f$f) $xf$xf
$1$1 $6$6 $6$6
$2$2 $9$9 $18$18
$3$3 $1$1 $3$3
$4$4 $6$6 $24$24
$5$5 $8$8 $40$40
$6$6 $6$6 $36$36
$7$7 $6$6 $42$42
$8$8 $2$2 $16$16
$9$9 $8$8 $72$72

Now if we add up the $xf$xf column, we will get the sum of all of the scores, and if we add up the frequency column we will get the total number of scores. Dividing the two sums will give us the mean.

$\frac{\text{Sum of all scores}}{\text{Total number of scores}}$Sum of all scoresTotal number of scores $=$= $\frac{6+18+3+24+40+36+42+16+72}{6+9+1+6+8+6+6+2+8}$6+18+3+24+40+36+42+16+726+9+1+6+8+6+6+2+8

Using the definition of the mean

  $=$= $\frac{221}{52}$22152

Evaluate the sums

$\frac{\text{Sum of all scores}}{\text{Total number of scores}}$Sum of all scoresTotal number of scores $=$= $4.25$4.25

Evaluate the quotient

To find the median, we can find the cumulative frequency for each score. The cumulative frequency is the sum of the frequencies of the score and each of the scores below it. The cumulative frequency of the first row will be the frequency of that row. For each subsequent row, add the frequency to the cumulative frequency of the row before it.

Score ($x$x) Frequency ($f$f) Cumulative frequency
$1$1 $6$6 $6$6
$2$2 $9$9 $15$15
$3$3 $1$1 $16$16
$4$4 $6$6 $22$22
$5$5 $8$8 $30$30
$6$6 $6$6 $36$36
$7$7 $6$6 $42$42
$8$8 $2$2 $44$44
$9$9 $8$8 $52$52

The final row has a cumulative frequency of $52$52, so there are $52$52 scores in total. This means that the median will be the mean of the $26$26th and $27$27th scores in order.

Looking at the cumulative frequency table, there are $22$22 scores less than or equal to $4$4 and $30$30 scores less than or equal to $5$5. This means that the $26$26th and $27$27th scores are both $5$5, so the median is $5$5.

Finally, we can find the range just by looking at the score column. The highest score is $9$9 and the lowest is $1$1, so the range will be $9-1=8$91=8.

 

Grouped frequency tables

When the data are more spread out, sometimes it doesn't make sense to record the frequency for each separate result and instead we group results together to get a grouped frequency table.

 

Grouped frequency table

A grouped frequency table combines multiple results into a single group. We can find the frequency of a group by adding all the frequencies of the results contained in that group.

Exploration

A teacher wants to express the heights (in cm) of her students in a table using the following data points:

 

$189,154,146,162,165,156,192,175,167,174$189,154,146,162,165,156,192,175,167,174

$161,153,184,177,155,192,169,166,148,170$161,153,184,177,155,192,169,166,148,170

$168,151,186,152,195,169,143,164,170,177$168,151,186,152,195,169,143,164,170,177

 

She realises that if each result has its own frequency then the table would have too many rows, so instead she grouped the results into sets of $10$10 cm. As a result, her grouped frequency table looked like this:

Height (cm) Frequency
$140-149$140149  
$150-159$150159  
$160-169$160169  
$170-179$170179  
$180-189$180189  
$190-199$190199  

To fill in the frequency for each group, the teacher counted the number of results that fell into the range of each group.

For example, the group $150-159$150159 would include the results:

$154,156,153,155,151,152$154,156,153,155,151,152

Since there are $6$6 results that fall into the range of this group, this group has a frequency of $6$6.

Using this method, the teacher filled in the grouped frequency table to get:

Height (cm) Frequency
$140-149$140149 $3$3
$150-159$150159 $6$6
$160-169$160169 $9$9
$170-179$170179 $6$6
$180-189$180189 $3$3
$190-199$190199 $3$3

Looking at the table, she can see that the modal class is the group $160-169$160169, since it has the highest frequency.

By adding the frequencies in the bottom two rows she could also see that $6$6 students were at least $180$180 cm tall. There are $30$30 students in the class in total, so she now knows that $\frac{6}{30}$630 of her students, or one fifth of the class, are taller than $180$180 cm.

 

Modal class

The modal class in a grouped frequency table is the group that has the highest frequency.

If there are multiple groups that share the highest frequency then there will be more than one modal class.

 

As we can see, grouped frequency tables are useful when the data are more spread out. While the teacher could have obtained the same information from a normal frequency table, the grouping of the results condensed the data into an easier to interpret form.

However, the drawback of a grouped frequency table is that the data becomes less precise, since we have grouped multiple data points together rather than looking at them individually.

 

Summarising data from a grouped frequency table

When finding the mean and median of grouped data we want to first find the class centre of each group. The class centre is the mean of the highest and lowest possible scores in the group.

Exploration

Estimate the mean and median of the following data.

Group Frequency ($f$f)
$1-5$15 $7$7
$6-10$610 $2$2
$11-15$1115 $4$4
$16-20$1620 $7$7

First we find the class centre for each group. This is just the average of the endpoints of the group. For example, the first group is $1$1 to $5$5, so the class centre is $\frac{1+5}{2}=3$1+52=3.

Group Class Centre ($x$x) Frequency ($f$f)
$1-5$15 $3$3 $7$7
$6-10$610 $8$8 $2$2
$11-15$1115 $13$13 $4$4
$16-20$1620 $18$18 $7$7

Notice that we've given the class centre the pronumeral $x$x this time. This is because we will use the class centre in the same way that we used the score for ungrouped data.

To find the mean, we want to make an $xf$xf column again. In this case, $x$x is the class centre.

Group Class Centre ($x$x) Frequency ($f$f) $xf$xf
$1-5$15 $3$3 $7$7 $21$21
$6-10$610 $8$8 $2$2 $16$16
$11-15$1115 $13$13 $4$4 $52$52
$16-20$1620 $18$18 $7$7 $126$126

Dividing the sum of the $xf$xf column by the sum of the $f$f column gives us $\frac{21+16+52+126}{7+2+4+7}=\frac{215}{20}=10.75$21+16+52+1267+2+4+7=21520=10.75.

Similarly for the median we want to make a cumulative frequency table.

Group Class Centre ($x$x) Frequency ($f$f) Cumulative frequency
$1-5$15 $3$3 $7$7 $7$7
$6-10$610 $8$8 $2$2 $9$9
$11-15$1115 $13$13 $4$4 $13$13
$16-20$1620 $18$18 $7$7 $20$20

Since there are $20$20 scores, we look for the $10$10th and $11$11th scores, which are both in the group $11-15$1115. While we don't know the exact score of the median, we can use the class centre $13$13 as our estimate for the median.

 

Stem and leaf plot

A stem and leaf plot, or stem plot, is used for organising and displaying numerical data. It is appropriate for small to moderately sized data sets. 

In a stem and leaf plot, the units digit in each data value is split from the other digits, to become the 'leaf'. The remaining digits become the 'stem'.

The values in a stem and leaf plot are generally arranged in ascending order (from lowest to highest) from the centre out. This is called an ordered stem and leaf plot.

The data values $10,13,16,21,26,27,28,35,35,36,41,41,45,46,49,50,53,56,58$10,13,16,21,26,27,28,35,35,36,41,41,45,46,49,50,53,56,58 are displayed in the stem and leaf plot below.

  • The stems are arranged in ascending order, to form a column, with the lowest value at the top 
  • The leaf values are arranged in ascending order from the stem out, in rows, next to their corresponding stem 
  • There are no commas or other symbols between the leaves, only a space between them 

 

Back to back stem and leaf plot

When comparing two sets of data we can use a back-to-back stem-and-leaf plot seen below:

Both sides are read as the central "stem" number and then the "leaf" number. The first row of the stem-and-leaf plot reads as $13,17$13,17 for Group A and $10,13,16$10,13,16 for Group B.

Frequency histogram

Although a histogram looks similar to a bar chart, there are a number of important differences between them:

  • Histograms show the distribution of data values, whereas a bar chart is used to compare data values.
  • Histograms are used for numerical data, whereas bar charts are often used for categorical data.
  • A histogram has a numerical scale on both axes, while a bar chart only has a numerical scale on the vertical axis. 
  • The columns in a bar chart could be re-ordered, without affecting the representation of the data. In a histogram, each column corresponds with a range of values on a continuous scale, so the columns cannot be re-ordered.
 
Histogram   Bar chart

 

Key features of a frequency histogram:

  • The horizontal axis is a continuous numerical scale (like a number line). It represents numerical data, such as time, height, mass or temperature, and may be divided into class intervals.
  • The vertical axis is the frequency of each data value or class interval.
  • There are no gaps between the columns because the horizontal axis is a continuous scale. It is possible for a class interval to have a frequency of zero, but this is not the same as having gaps between each column.
  • It is good practice, when creating a histogram, to leave a half-column-width gap between the vertical axis and the first column.

Note: frequency histograms and polygons are usually for numerical continuous data however you may be asked to draw these for numerical discrete data as well.

Displays for grouped data

Continuous numerical data, such as times, heights, weights or temperatures, are based on measurements, so any data value is possible within a large range of values. Because the range of values can be quite large it can be more practical and efficient to organise the raw data into groups or class intervals of equal range in the frequency distribution table. 

The class centre is the average of the endpoints of each interval.

For example, if the class interval is $45-50$4550, the class centre is calculated as follows:

Class interval $=$= $\frac{45+50}{2}$45+502
Class interval $=$= $47.5$47.5
 

Because the class centre is an average of the endpoints, it is often used as a single value to represent the class interval.

As an example, the following frequency distribution table and histogram represent the times taken for $72$72 runners to complete a ten kilometre race.

Class interval Class Centre Frequency
$45-50$4550 $47.5$47.5 $9$9
$50-55$5055 $52.5$52.5 $7$7
$55-60$5560 $57.5$57.5 $20$20
$60-65$6065 $62.5$62.5 $30$30
$65-70$6570 $67.5$67.5 $6$6

 

Remember!
  • Every data value must go into exactly one and only one class interval
  • Each class interval must be the same size, e.g. $1-5$15, $5-10$510, $10-15$1015..., $10-20$1020, $20-30$2030, $30-40$3040,...
  • The class centre is the average of the end points of the class interval

Practice questions

question 1

Consider the survey question and the sample and determine whether the outcomes are likely to be biased or not.

  1. Yvonne is asking people on her soccer team, "What's your favourite sport?"

    Biased results

    A

    Not biased results

    B

    Biased results

    A

    Not biased results

    B
  2. Lachlan randomly selected people from his school to find about the school sports. He asked "What's your favourite school sport?"

    Biased results

    A

    Not biased results

    B

    Biased results

    A

    Not biased results

    B
  3. Tricia randomly selected people from her school and asked, "The local AFL team is donating money to our school this term. What's your favourite sport?"

    Biased results

    A

    Not biased results

    B

    Biased results

    A

    Not biased results

    B

question 2

This stem-and-leaf plot records the ages of customers at a beachside café last Sunday.

Stem Leaf
$1$1 $0,4,7$0,4,7
$2$2 $1,4,5,7$1,4,5,7
$3$3 $1,3,9$1,3,9
$4$4 $1,3,5,6,8,9$1,3,5,6,8,9
$5$5 $4,5,6,7,8,9$4,5,6,7,8,9
$6$6 $0,2,3,6$0,2,3,6
 
Key: $5$5$\mid$$2$2$=$=$52$52
  1. Complete the frequency table for this data:

    Age Frequency
    $10-19$1019 $\editable{}$
    $20-29$2029 $\editable{}$
    $30-39$3039 $\editable{}$
    $40-49$4049 $\editable{}$
    $50-59$5059 $\editable{}$
    $60-69$6069 $\editable{}$

question 3

Examine the histogram given and answer the following questions.

ScoreFrequency10203001234

  1. Which number occured most frequently?

  2. How many scores of $3$3 were there?

  3. How many more scores of $1$1 were there than scores of $3$3?

What is Mathspace

About Mathspace