topic badge

10.01 Collecting and displaying data

Lesson

Surveys

We get the best data from a census because it includes the entire population. However, it's not always possible to conduct a census, so we often get our data from surveys instead.

When we take a survey it is important that the results are representative of the population. This means that the results that we get for any question we ask of the survey would be the same as if we asked it of a census. This also means that the mean, median, mode and range of the survey should be very close to the same results of the census (although getting exactly the same results is almost impossible).

If a survey is not representative, we call it biased. There are a number of potential sources of bias that we should avoid:

  • Consider who is being surveyed. If the people being surveyed do not resemble the population, the survey is likely to be biased. For example, surveying train travellers about their opinions on public transport will likely give very different results than a census of the entire population.

  • Also consider how many people are being surveyed. Asking one person's opinion will not tell you anything about anyone else's opinion. In general, the bigger the number of people being surveyed, the closer the results will be to a census.

  • Make sure that the questions being asked actually address the question at hand. For example, asking, "Do you approve of the current governing party?" does not give the same results as asking, "Will you vote for the current governing party in the next election?"

  • Avoid questions which use emotive language or might otherwise influence the results of the survey. For example, asking, "Do you watch the most popular sport, soccer?" will be biased unlike asking, "Do you watch soccer?". These are referred to as "leading questions" as they lead the person being surveyed to a particular answer.

Once we have collected data we need to find a way to organise and display it.

Examples

Example 1

Consider the survey question and the sample and determine whether the outcomes are likely to be biased or not.

a

Yvonne is asking people on her soccer team, "What's your favourite sport?"

Worked Solution
Create a strategy

Consider the following:

  • Consider who is being surveyed.

  • How many people are being surveyed.

  • Whether the question being asked actually address the question at hand.

  • Consider whether the question is leading.

Apply the idea

Since the question being asked is about the favourite sport, the soccer team will probably answer soccer since they play it. So the outcomes are likely to be biased.

b

Lachlan randomly selected people from his school to find about the school sports. He asked, "What's your favourite school sport?"

Worked Solution
Apply the idea

Lachlan randomly selected students so there is no bias with the system. The question is not leading. So the outcomes are not likely to be biased.

c

Tricia randomly selected people from her school and asked, "The local AFL team is donating money to our school this term. What's your favourite sport?"

Worked Solution
Apply the idea

The question uses leading language by stating the donation of the local AFL team, so people may feel pressured to choose AFL. So the outcomes are likely to be biased.

Idea summary

There are a number of potential sources of bias that we should avoid:

  • Consider who is being surveyed. If the people being surveyed do not resemble the population, the survey is likely to be biased.

  • Also consider how many people are being surveyed. Asking one person's opinion will not tell you anything about anyone else's opinion.

  • Make sure that the questions being asked actually address the question at hand.

  • Avoid questions which use emotive language or might otherwise influence the results of the survey.

Summarise data from a frequency table

We can find the mode, mean, median and range from a frequency table. These will be the same as the mode, mean, median and range from a list of data but we can use the frequency table to make it quicker.

The cumulative frequency is the sum of the frequencies of the score and each of the scores below it. The cumulative frequency of the first row will be the frequency of that row. For each subsequent row, add the frequency to the cumulative frequency of the row before it.

Examples

Example 2

Find the median from the frequency distribution table:

ScoreFrequency
232
2426
2537
2624
2725
Worked Solution
Create a strategy

To find the median we need to add a cumulative frequency column.

Apply the idea

To find the median, we create a cumulative frequency column.

ScoreFrequencyCumulative frequency
2322
24262+26=28
253728+37=65
262465+24=89
272589+25=114

Since there are 114 scores, the median will be the average of the 57th and 58th score. Looking at the cumulative frequency table, there are 28 scores less than or equal to 24 and 65 scores less than or equal to 25. This means that the 57th and 58th scores are both 25, so the median is 25.

\text{Median}=25

Idea summary

We can use the frequency table to find the mean, mode, median, and range of a data set.

The cumulative frequency is the sum of the frequencies of the score and each of the scores below it. Adding a cumulative frequency column to a frequency table is helpful for finding the median.

Adding an xf column to a frequency table is helpful for finding the mean.

Grouped frequency tables

When the data are more spread out, sometimes it doesn't make sense to record the frequency for each separate result and instead we group results together to get a grouped frequency table.

A grouped frequency table combines multiple results into a single group. We can find the frequency of a group by adding all the frequencies of the results contained in that group.

The modal class in a grouped frequency table is the group that has the greatest frequency. If there are multiple groups that share the greatest frequency then there will be more than one modal class.

As we can see, grouped frequency tables are useful when the data are more spread out. While the teacher could have obtained the same information from a normal frequency table, the grouping of the results condensed the data into an easier to interpret form.

The drawback of a grouped frequency table is that the data becomes less precise, since we have grouped multiple data points together rather than looking at them individually.

When finding the mean and median of grouped data we want to first find the class centre of each group. The class centre is the mean of the highest and lowest possible scores in the group.

Examples

Example 3

Consider the table below.

ScoreFrequency
1-4 2
5-8 7
9-1215
13-16 5
17-20 1
a

Use the midpoint of each class interval to determine an estimate for the mean of the following sample distribution. Round your answer to one decimal place.

Worked Solution
Create a strategy

To find the midpoint of each class, find the average of the two end points.

Apply the idea

For the first class, we get the midpoint, or class centre, to be \dfrac{1+4}{2}=2.5. Using the same process for each class we get the following midpoints:

ScoreMidpointFrequency
1-4 2.5 2
5-8 6.5 7
9-12 10.515
13-16 14.5 5
17-20 18.5 1

To find the mean, we multiply each midpoint by the frequency, add them together and divide by the number of scores. The number of scores can be found by adding the frequencies.

\displaystyle \text{Number of scores}\displaystyle =\displaystyle 2+7+15+5+1Add the frequencies
\displaystyle =\displaystyle 30Evaluate
\displaystyle \text{Mean}\displaystyle \approx\displaystyle \dfrac{2.5\times 2+6.5\times 7+10.5\times 15+14.5\times 5+18.5\times 1}{30}Estimate the average
\displaystyle \approx\displaystyle 10.0Evaluate and round
b

Find the modal class of the data.

Worked Solution
Create a strategy

Choose the class that has the highest frequency.

Apply the idea

Looking at the table, we can see that the modal class is the group 9-12, since it has the highest frequency.

Idea summary

A grouped frequency table combines multiple results into a single group. We can find the frequency of a group by adding all the frequencies of the results contained in that group.

The modal class in a grouped frequency table is the group that has the greatest frequency. If there are multiple groups that share the greatest frequency then there will be more than one modal class.

When estimating the mean and median of grouped data we use the class centre of each group.

The class centre is the mean of the highest and lowest possible scores in the group.

Stem-and-leaf plot

A stem-and-leaf plot, or stem plot, is used for organising and displaying numerical data. It is appropriate for small to moderately sized data sets.

In a stem-and-leaf plot, the units digit in each data value is split from the other digits, to become the 'leaf'. The remaining digits become the 'stem'.

The values in a stem-and-leaf plot are generally arranged in ascending order (from lowest to highest) from the centre out. This is called an ordered stem-and-leaf plot.

StemLeaf
10\ 3\ 6
21\ 6\ 7\ 8
35\ 5\ 6
41\ 1\ 5\ 6\ 9
50\ 3\ 6\ 8
Key 2\vert 1 = 21

The data values 10,\,13,\,16,\,21,\,26,\,27,\,28,\,35,\,35,\,36,\,41,\,41,\,45,\,46,\,49,\,50,\,53,\,56,\,58 are displayed in this stem-and-leaf plot.

  • The stems are arranged in ascending order, to form a column, with the lowest value at the top

  • The leaf values are arranged in ascending order from the stem out, in rows, next to their corresponding stem

  • There are no commas or other symbols between the leaves, only a space between them

The image shows two stem-and-leaf plots. Ask your teacher for more information.
Group AGroup B
7\ 310\ 3\ 6
5\ 021\ 6\ 7\ 8
6\ 5\ 535\ 5\ 6
1\ 141\ 1\ 5\ 6\ 9
8\ 4\ 350\ 3\ 6\ 8
Key 3\vert 1 = 13 unitsKey 1\vert 0 = 10 units

When comparing two sets of data we can use a back-to-back stem-and-leaf plot seen here.

Both sides are read as the central "stem" number and then the "leaf" number. The first row of the stem-and-leaf plot reads as 13,\,17 for Group A and 10,\,13,\,16 for Group B.

Examples

Example 4

StemLeaf
10\ 4\ 7
21\ 4\ 5\ 7
31\ 3\ 9
41\ 3\ 5\ 6\ 8\ 9
54\ 5\ 6\ 7\ 8\ 9
60\ 2\ 3\ 6
Key 5\vert 2 = 52

This stem-and-leaf plot records the ages of customers at a beachside café last Sunday.

Complete the frequency table for this data:

AgeFrequency
10-19
20-29
30-39
40-49
50-59
60-69
Worked Solution
Create a strategy

Count the number of leaves in each row, and write this number in the frequency table.

Apply the idea
AgeFrequency
10-193
20-294
30-393
40-496
50-596
60-694

The leaves in the first row are: 0,\,4,\,7, which represent 10, 14, and 17. These ages are between 10 and 19, so we should write a 3 in the first row. We do the same for the other age groups.

Idea summary

In a stem-and-leaf plot, the units digit in each data value is split from the other digits, to become the 'leaf'. The remaining digits become the 'stem'.

The values in a stem-and-leaf plot are generally arranged in ascending order from the centre out. This is called an ordered stem-and-leaf plot.

Frequency histogram

Although a histogram looks similar to a bar chart, there are a number of important differences between them:

  • Histograms show the distribution of data values, whereas a bar chart is used to compare data values.

  • Histograms are used for numerical data, whereas bar charts are often used for categorical data.

  • A histogram has a numerical scale on both axes, while a bar chart only has a numerical scale on the vertical axis.

  • The columns in a bar chart could be re-ordered, without affecting the representation of the data. In a histogram, each column corresponds with a range of values on a continuous scale, so the columns cannot be re-ordered.

A Histogram:

A histogram with the data of distribution of running times for a 10 kilometres race. Ask your teacher for more information.

A Bar chart:

A bar graph with the data of waste generated in Australia by industry. Ask your teacher for more information.

Key features of a frequency histogram:

  • The horizontal axis is a continuous numerical scale (like a number line). It represents numerical data, such as time, height, mass or temperature, and may be divided into class intervals.

  • The vertical axis is the frequency of each data value or class interval.

  • There are no gaps between the columns because the horizontal axis is a continuous scale. It is possible for a class interval to have a frequency of zero, but this is not the same as having gaps between each column.

  • It is good practice, when creating a histogram, to leave a half-column-width gap between the vertical axis and the first column.

Note: frequency histograms and polygons are usually for numerical continuous data however you may be asked to draw these for numerical discrete data as well.

Continuous numerical data, such as times, heights, weights or temperatures, are based on measurements, so any data value is possible within a large range of values. Because the range of values can be quite large it can be more practical and efficient to organise the raw data into groups or class intervals of equal range in the frequency distribution table.

The class centre is the average of the endpoints of each interval.

For example, if the class interval is 45-50, the class centre is calculated as follows:

\displaystyle \text{Class interval}\displaystyle =\displaystyle \dfrac{45+50}{2}
\displaystyle =\displaystyle 47.5

As an example, the following frequency distribution table and histogram represent the times taken for 72 runners to complete a ten kilometre race.

A histogram and table with the data on running times for a 10 kilometres race. Ask your teacher for more information.

Examples

Example 5

Consider the histogram given:

A histogram with the data of distribution of scores. Ask your teacher for more information
a

Which number occured most frequently?

Worked Solution
Create a strategy

The most frequently occuring score will have the tallest column.

Apply the idea

\text{Score}=2

b

How many scores of 1 were there?

Worked Solution
Create a strategy

Examine how far the column for 1 reaches up the vertical axis.

Apply the idea

\text{Frequency}=14

c

How many more scores of 1 were there than scores of 4?

Worked Solution
Create a strategy

Subtract the number of 4s from the number of 1s.

Apply the idea

By looking at the histogram, we can see that there were 4 scores of 4.

\displaystyle \text{Difference}\displaystyle =\displaystyle 14-4Subtract the frequencies
\displaystyle =\displaystyle 10Evaluate the subtraction
Idea summary

Key features of a frequency histogram:

  • The horizontal axis is a continuous numerical scale (like a number line). It represents numerical data, such as time, height, mass or temperature, and may be divided into class intervals.

  • The vertical axis is the frequency of each data value or class interval.

  • There are no gaps between the columns because the horizontal axis is a continuous scale. It is possible for a class interval to have a frequency of zero, but this is not the same as having gaps between each column.

  • It is good practice, when creating a histogram, to leave a half-column-width gap between the vertical axis and the first column.

For grouped data:

  • Every data value must go into exactly one and only one class interval.

  • Each class interval must be the same size, e.g. 1-5,5-10,10-15,\ldots , 10-20, 20-30, 30-40, \ldots

  • The class centre is the average of the end points of the class interval.

Outcomes

VCMSP324

Identify everyday questions and issues involving at least one numerical and at least one categorical variable, and collect data directly from secondary sources.

What is Mathspace

About Mathspace