The growth in computerised technology has made it possible to analyse ever larger and more complex data sets, resulting in more efficient, responsive and adaptable processes. It has allowed for advances in a range of fields such as medicine, environmental science, transportation, manufacturing and logistics.
At the same time, data has also been used to influence opinion and create political divisions. It has intruded on personal privacy and created greater inequality. It is increasingly important for us to understand how data is collected and used, and be aware of the impact it may have on our lives and our future.
One decision that has to be made, before data collection begins, is whether to collect data from every member of a population, or to only collect data from a sample of members within that population.
In statistics, a population refers to every member within any particular group of interest. It could be the entire population of a country, the population of a school, the number of frogs living in a wetland, the number of trees in a forest, or the number of cars in a parking lot.
A survey conducted on every member of a population is called a census. In Australia, a nation-wide census is conducted every five years by the Australian Bureau of Statistics (ABS). Data obtained from the census is used by the government to plan for the future direction of the country. It is needed for planning purposes for such things as the setting of electoral boundaries and the equitable distribution of resources. Apart from a count of people in each dwelling on census night, questions are asked of each household that are intended to inform public policy making.
Collecting data from every member of a population is the most accurate way of gathering information, but it is not always the most practical, and can be very expensive. For these reasons, data is often gathered from a smaller group, or sample, that can be used to estimate the characteristics of the wider population.
The size of the sample is an important consideration. If the sample is too large, it may be too expensive or time-consuming to collect the data. If it is too small, the sample may not be representative of the population.
State whether the following statement is true or false:
In a sample survey, information is obtained from the entire population.
True
False
State whether the following is an instance of a sample or a census:
A random selection of some people at a mall.
Sample
Census
A stock take of all the goods in store.
Sample
Census
A crash test of new cars just manufactured by a factory.
Sample
Census
Asking all the teachers at your school whether they approve of a new class timetable.
Sample
Census
Four common methods used to sample a population are:
Each method has advantages and disadvantages, depending on the situation.
In this method, a sample is formed by selecting members from the population at random, where each member of the population has an equally likely chance of being selected. In simple cases, a sample could be created by drawing names from a hat. For most samples though, it is more common to use a random number generator.
Calculators and spreadsheet applications usually have a random number generator that can generate random decimals between $0$0 and $1$1.
These can be used to create random numbers within any range of values. For example, if we require a random number between:
Creating a simple random sample from a list is quite straightforward as the name suggests. However, disadvantages of this method can include the time or expense needed to gather the full list of a specific population and the bias that could occur when the sample set is not large enough to adequately represent the full population.
Sam wants to randomly select five people to use in a sample from a population of twenty. He begins by assigning each person in the population, a number from $1$1 to $20$20. Using the random number generator on his calculator, he then generates five random numbers between $0$0 and $1$1:
$0.532$0.532, $0.805$0.805, $0.686$0.686, $0.774$0.774, $0.272$0.272
Help Sam, by converting these values to numbers between $1$1 and $20$20, so he can use them to select members for his sample.
Think: To get random numbers between $1$1 and $20$20, we multiply each value by $20$20 and round up to the next whole number:
Do:
First random number | $=$= | $0.532\times20$0.532×20 | ||
$=$= | $10.64$10.64 | |||
$=$= | $11$11 | (Rounded up to the next whole number) |
Repeating this for the remaining values gives us five random numbers between $1$1 and $20$20:
$11$11, $17$17, $14$14, $16$16, $6$6
Anyone who was assigned these numbers in the population would be selected for the sample.
Reflect: In this case, we wanted five unique random numbers. If our calculator generated two numbers that were similar, we could end up with the same value when converted to a number between $1$1 and $20$20. In that situation, we would keep generating additional random numbers until we had five that were unique.
Websites can also produce random numbers. Try the example above again, this time using this generator.
In this method, a sample is formed by choosing a random starting point, then selecting members from the population at regular intervals (i.e. every $5$5th member). In other words, it uses a 'system' for selection. For example, we may choose every fifth name from a list or call every tenth business in the phone book. This method is often favoured by manufacturers for sampling products on a production line.
The image to the left shows every $3$3rd person being picked.
Systematic sampling provides a useful mechanism to select the sample in an efficient organised manner. A possible issue with this sampling technique is when the sample interval coincides with a trait that causes the sample to no longer be random. For example, when selecting every $5$5th chip packet on a conveyor belt to check weight, when by coincidence the machine has a fault causing every $5$5th packet to be underfilled.
A machine produces $400$400 items a day. At what interval should an item be selected in order to obtain a systematic sample of $25$25 items?
Think: To obtain $25$25 samples at even intervals, what size will each interval be?
Do:
Interval required | $=$= | $\frac{\text{population size }}{\text{sample size }}$population size sample size |
$=$= | $\frac{400}{25}$40025 | |
$=$= | $16$16 |
Therefore, every $16$16th item should be selected.
In this method, a sample is formed by members of the population who volunteer themselves for selection. This is a common sampling method in the field of medicine where volunteers may be asked to take part in a medical trial.
This method tends to be used in situations where it may be difficult to randomly select people, perhaps due to ethical or logistical reasons. As a result, self-selected sampling may not be truly representative of the wider population.
In this method, a sample is formed by dividing the population into subgroups (or strata) and then selecting a random sample proportionally from each subgroup. That is, if a subgroup makes up $25%$25% of the population it should also make up $25%$25% of the sample. This is particularly appropriate when there are clearly defined subgroups that are likely to have different opinions or traits, and we want to ensure each subgroup is fairly represented in the sample.
While stratified sampling can be more complex to perform, it can help ensure the sample is representative of the wider population. It can also provide useful statistics on the subgroups and highlight differences between them. A disadvantage is researchers must ensure every member of a population being studied can be classified into one, and only one, subgroup - so this method cannot be applied in all situations.
The number surveyed from a particular subgroup in a stratified sample, can be calculated as follows:
$\text{Number of subgroup to survey}$Number of subgroup to survey | $=$= | $\text{Proportion of population in subgroup}\times\text{Sample size}$Proportion of population in subgroup×Sample size |
$=$= | $\frac{\text{Number with subgroup trait}}{\text{Population size}}\times\text{Sample size}$Number with subgroup traitPopulation size×Sample size |
This calculation should be rounded to the nearest integer, since we cannot survey part of a member of the population.
Martha wants to use a stratified sample of $50$50 members to survey her schools population. If there are $70$70 teachers and $1100$1100 students at her school,
(a) How many teachers should be in the sample?
Think: The total school population is $1170$1170 people. If teachers make up $70$70 out of $1170$1170, we need to find the same proportion of teachers in a sample of $50$50 people.
Do:
Number of teachers in sample | $=$= | $\frac{70}{1170}\times50$701170×50 | ||
$=$= | $2.991$2.991... | |||
$=$= | $3$3 | (Rounded to the nearest whole number) |
(b) How many students should be in the sample?
Think: We use the same approach to find the number of students in the sample.
Do:
Number of students in sample | $=$= | $\frac{1100}{1170}\times50$11001170×50 | ||
$=$= | $47.008$47.008... | |||
$=$= | $47$47 | (Rounded to the nearest whole number) |
Reflect: We need to check that the number of teachers and students add to the required sample size of $50$50.
i.e. $3$3 teachers $+$+ $47$47 students $=$= $50$50.
For a statistical survey the population is deemed to be all people in a city who play in any organised sporting competition.
Which three of the following are samples of that population?
$500$500 spectators chosen from a weekend sports match.
The members of $3$3 teams chosen from the local hockey tournament.
$100$100 people chosen at random from a local park.
All students from a local school who compete in a school sports competition.
All the active members of a local football club.
Choosing every $5$5th person on the class roll to take part in a survey is an example of:
Stratified Sampling
Random Sampling
Systematic Sampling
Users of a particular streaming service can be in one of four categories - Standard, Family, Premium or Business. The table shows the number of people in each category:
Category | Number of People |
---|---|
Standard | $3500$3500 |
Family | $1500$1500 |
Premium | $2000$2000 |
Business | $3000$3000 |
How many customers are there across all the categories?
If a stratified sample of $400$400 is to be taken from the group, what proportion of people will be chosen?
For the sample to be stratified, how many Standard customers should be chosen?
For the sample to be stratified, how many Family customers should be chosen?
For the sample to be stratified, how many Premium customers should be chosen?
For the sample to be stratified, how many Business customers should be chosen?