We get the best data from a census because it includes the entire population. However, it's not always possible to conduct a census, so we often get our data from surveys instead.
When we take a survey it is important that the results are representative of the population. This means that the results that we get for any question we ask of the survey would be the same as if we asked it of a census. This also means that the mean, median, mode and range of the survey should be very close to the same results of the census (although getting exactly the same results is almost impossible).
If a survey is not representative, we call it biased. There are a number of potential sources of bias that we should avoid:
Consider who is being surveyed. If the people being surveyed do not resemble the population, the survey is likely to be biased. For example, surveying train travellers about their opinions on public transport will likely give very different results than a census of the entire population.
Also consider how many people are being surveyed. Asking one person's opinion will not tell you anything about anyone else's opinion. In general, the bigger the number of people being surveyed, the closer the results will be to a census.
Make sure that the questions being asked actually address the question at hand. For example, asking, "Do you approve of the current governing party?" does not give the same results as asking, "Will you vote for the current governing party in the next election?"
Avoid questions which use emotive language or might otherwise influence the results of the survey. For example, asking, "Do you watch the most popular sport, soccer?" will be biased unlike asking, "Do you watch soccer?". These are referred to as "leading questions" as they lead the person being surveyed to a particular answer.
Once we have collected data we need to find a way to organise and display it.
Consider the survey question and the sample and determine whether the outcomes are likely to be biased or not.
Yvonne is asking people on her soccer team, "What's your favourite sport?"
Lachlan randomly selected people from his school to find about the school sports. He asked, "What's your favourite school sport?"
Tricia randomly selected people from her school and asked, "The local AFL team is donating money to our school this term. What's your favourite sport?"
There are a number of potential sources of bias that we should avoid:
Consider who is being surveyed. If the people being surveyed do not resemble the population, the survey is likely to be biased.
Also consider how many people are being surveyed. Asking one person's opinion will not tell you anything about anyone else's opinion.
Make sure that the questions being asked actually address the question at hand.
Avoid questions which use emotive language or might otherwise influence the results of the survey.
We can find the mode, mean, median and range from a frequency table. These will be the same as the mode, mean, median and range from a list of data but we can use the frequency table to make it quicker.
The cumulative frequency is the sum of the frequencies of the score and each of the scores below it. The cumulative frequency of the first row will be the frequency of that row. For each subsequent row, add the frequency to the cumulative frequency of the row before it.
Find the median from the frequency distribution table:
Score | Frequency |
---|---|
23 | 2 |
24 | 26 |
25 | 37 |
26 | 24 |
27 | 25 |
We can use the frequency table to find the mean, mode, median, and range of a data set.
The cumulative frequency is the sum of the frequencies of the score and each of the scores below it. Adding a cumulative frequency column to a frequency table is helpful for finding the median.
Adding an xf column to a frequency table is helpful for finding the mean.
When the data are more spread out, sometimes it doesn't make sense to record the frequency for each separate result and instead we group results together to get a grouped frequency table.
A grouped frequency table combines multiple results into a single group. We can find the frequency of a group by adding all the frequencies of the results contained in that group.
The modal class in a grouped frequency table is the group that has the greatest frequency. If there are multiple groups that share the greatest frequency then there will be more than one modal class.
As we can see, grouped frequency tables are useful when the data are more spread out. While the teacher could have obtained the same information from a normal frequency table, the grouping of the results condensed the data into an easier to interpret form.
The drawback of a grouped frequency table is that the data becomes less precise, since we have grouped multiple data points together rather than looking at them individually.
When finding the mean and median of grouped data we want to first find the class centre of each group. The class centre is the mean of the highest and lowest possible scores in the group.
Consider the table below.
Score | Frequency |
---|---|
1-4 | 2 |
5-8 | 7 |
9-12 | 15 |
13-16 | 5 |
17-20 | 1 |
Use the midpoint of each class interval to determine an estimate for the mean of the following sample distribution. Round your answer to one decimal place.
Find the modal class of the data.
A grouped frequency table combines multiple results into a single group. We can find the frequency of a group by adding all the frequencies of the results contained in that group.
The modal class in a grouped frequency table is the group that has the greatest frequency. If there are multiple groups that share the greatest frequency then there will be more than one modal class.
When estimating the mean and median of grouped data we use the class centre of each group.
The class centre is the mean of the highest and lowest possible scores in the group.
A stem-and-leaf plot, or stem plot, is used for organising and displaying numerical data. It is appropriate for small to moderately sized data sets.
In a stem-and-leaf plot, the units digit in each data value is split from the other digits, to become the 'leaf'. The remaining digits become the 'stem'.
The values in a stem-and-leaf plot are generally arranged in ascending order (from lowest to highest) from the centre out. This is called an ordered stem-and-leaf plot.
Stem | Leaf |
---|---|
1 | 0\ 3\ 6 |
2 | 1\ 6\ 7\ 8 |
3 | 5\ 5\ 6 |
4 | 1\ 1\ 5\ 6\ 9 |
5 | 0\ 3\ 6\ 8 |
Key 2\vert 1 = 21 |
The stems are arranged in ascending order, to form a column, with the lowest value at the top
The leaf values are arranged in ascending order from the stem out, in rows, next to their corresponding stem
There are no commas or other symbols between the leaves, only a space between them
Group A | Group B | |
---|---|---|
7\ 3 | 1 | 0\ 3\ 6 |
5\ 0 | 2 | 1\ 6\ 7\ 8 |
6\ 5\ 5 | 3 | 5\ 5\ 6 |
1\ 1 | 4 | 1\ 1\ 5\ 6\ 9 |
8\ 4\ 3 | 5 | 0\ 3\ 6\ 8 |
Key 3\vert 1 = 13 units | Key 1\vert 0 = 10 units |
Stem | Leaf |
---|---|
1 | 0\ 4\ 7 |
2 | 1\ 4\ 5\ 7 |
3 | 1\ 3\ 9 |
4 | 1\ 3\ 5\ 6\ 8\ 9 |
5 | 4\ 5\ 6\ 7\ 8\ 9 |
6 | 0\ 2\ 3\ 6 |
Key 5\vert 2 = 52 |
Complete the frequency table for this data:
Age | Frequency |
---|---|
10-19 | |
20-29 | |
30-39 | |
40-49 | |
50-59 | |
60-69 |
In a stem-and-leaf plot, the units digit in each data value is split from the other digits, to become the 'leaf'. The remaining digits become the 'stem'.
The values in a stem-and-leaf plot are generally arranged in ascending order from the centre out. This is called an ordered stem-and-leaf plot.
Although a histogram looks similar to a bar chart, there are a number of important differences between them:
Histograms show the distribution of data values, whereas a bar chart is used to compare data values.
Histograms are used for numerical data, whereas bar charts are often used for categorical data.
A histogram has a numerical scale on both axes, while a bar chart only has a numerical scale on the vertical axis.
The columns in a bar chart could be re-ordered, without affecting the representation of the data. In a histogram, each column corresponds with a range of values on a continuous scale, so the columns cannot be re-ordered.
A Histogram:
A Bar chart:
Key features of a frequency histogram:
The horizontal axis is a continuous numerical scale (like a number line). It represents numerical data, such as time, height, mass or temperature, and may be divided into class intervals.
The vertical axis is the frequency of each data value or class interval.
There are no gaps between the columns because the horizontal axis is a continuous scale. It is possible for a class interval to have a frequency of zero, but this is not the same as having gaps between each column.
It is good practice, when creating a histogram, to leave a half-column-width gap between the vertical axis and the first column.
Note: frequency histograms and polygons are usually for numerical continuous data however you may be asked to draw these for numerical discrete data as well.
Continuous numerical data, such as times, heights, weights or temperatures, are based on measurements, so any data value is possible within a large range of values. Because the range of values can be quite large it can be more practical and efficient to organise the raw data into groups or class intervals of equal range in the frequency distribution table.
The class centre is the average of the endpoints of each interval.
For example, if the class interval is 45-50, the class centre is calculated as follows:
\displaystyle \text{Class interval} | \displaystyle = | \displaystyle \dfrac{45+50}{2} | |
\displaystyle = | \displaystyle 47.5 |
As an example, the following frequency distribution table and histogram represent the times taken for 72 runners to complete a ten kilometre race.
Consider the histogram given:
Which number occured most frequently?
How many scores of 1 were there?
How many more scores of 1 were there than scores of 4?
Key features of a frequency histogram:
The horizontal axis is a continuous numerical scale (like a number line). It represents numerical data, such as time, height, mass or temperature, and may be divided into class intervals.
The vertical axis is the frequency of each data value or class interval.
There are no gaps between the columns because the horizontal axis is a continuous scale. It is possible for a class interval to have a frequency of zero, but this is not the same as having gaps between each column.
It is good practice, when creating a histogram, to leave a half-column-width gap between the vertical axis and the first column.
For grouped data:
Every data value must go into exactly one and only one class interval.
Each class interval must be the same size, e.g. 1-5,5-10,10-15,\ldots , 10-20, 20-30, 30-40, \ldots
The class centre is the average of the end points of the class interval.