The growth in computerised technology has made it possible to analyse ever larger and more complex data sets, resulting in more efficient, responsive and adaptable processes. It has allowed for advances in a range of fields such as medicine, environmental science, transportation, manufacturing and logistics.
At the same time, data has also been used to influence opinion and create political divisions. It has intruded on personal privacy and created greater inequality. It is increasingly important for us to understand how data is collected and used, and be aware of the impact it may have on our lives and our future.
Statistics is concerned with the collection and analysis of data. It normally follows a process:
One decision that has to be made, before data collection begins, is whether to collect data from every member of a population, or to only collect data from a sample of members within that population.
In statistics, a population refers to every member within any particular group of interest. It could be the entire population of a country, the population of a school, the number of frogs living in a wetland, the number of trees in a forest, or the number of cars in a parking lot.
A survey conducted on every member of a population is called a census. In Australia, a nation-wide census is conducted every five years by the Australian Bureau of Statistics (ABS). Data obtained from the census is used by the government to plan for the future direction of the country.
Collecting data from every member of a population is the most accurate way of gathering information, but it is not always the most practical, and can be very expensive. For these reasons, data is often gathered from a smaller group, or sample, that can be used to estimate the characteristics of the wider population.
The size of the sample is an important consideration. If the sample is too large, it may be too expensive or time-consuming to collect the data. If it is too small, the sample may not be representative of the population.
In many cases, a population will contain subgroups that each require proportional representation within the sample. As an example, consider a survey of a school where half the population are female, and gender is an important factor in the results of the survey. If the sample contains a higher proportion of males than females, then the sample is not representative of the population. We would say that the sample is biased rather than fair.
Bias in a sample means that certain characteristics of the population are over- or under-represented. One way to reduce bias is to use some form of randomisation in the sampling method.
There are five main methods used to sample a population:
Each method has advantages and disadvantages, depending on the situation.
In this method, a sample is formed by selecting members from the population at random, where each member of the population has an equally likely chance of being selected. In simple cases, a sample could be created by drawing names from a hat. For most samples though, it is more common to use a random number generator.
Calculators and spreadsheet applications usually have a random number generator that can generate random decimals between $0$0 and $1$1.
These can be used to create random numbers within any range of values. For example, if we require a random number between:
While simple random sampling can be one of the cheapest and least time-consuming methods, it may not adequately represent every sub-group within larger populations.
Sam wants to randomly select five people to use in a sample from a population of twenty. He begins by assigning each person in the population, a number from $1$1 to $20$20. Using the random number generator on his calculator, he then generates five random numbers between $0$0 and $1$1:
$0.532$0.532, $0.805$0.805, $0.686$0.686, $0.774$0.774, $0.272$0.272
Help Sam, by converting these values to numbers between $1$1 and $20$20, so he can use them to select members for his sample.
Solution
To get random numbers between $1$1 and $20$20, we multiply each value by $20$20 and round up to the next whole number:
First random number | $=$= | $0.532\times20$0.532×20 | ||
$=$= | $10.64$10.64 | |||
$=$= | $11$11 | (Rounded up to the next whole number) |
Repeating this for the remaining values gives us five random numbers between $1$1 and $20$20:
$11$11, $17$17, $14$14, $16$16, $6$6
Anyone who was assigned these numbers in the population, would be selected for the sample.
Note: in this case we wanted five unique random numbers. If our calculator generated two numbers that were similar, we could end up with the same value when converted to a number between $1$1 and $20$20. In that situation, we would keep generating additional random numbers until we had five that were unique.
In this method, a sample is formed by choosing a random starting point, then selecting members from the population at regular intervals (i.e. every $5$5th member). In other words, it uses a 'system' for selection. This method is often favoured by manufacturers for sampling products on a production line.
While systematic sampling is better than simple random sampling, it can still under- or over-represent a subgroup within the population.
A machine produces $400$400 items a day. At what interval should an item be selected in order to obtain a systematic sample of $25$25 items?
Solution
Interval required | $=$= | $\frac{\text{population size }}{\text{sample size }}$population size sample size |
$=$= | $\frac{400}{24}$40024 | |
$=$= | $16$16 |
Therefore, every $16$16th item should be selected.
In this method, a sample is formed by members of the population who are located conveniently to the interviewer. For example, if surveying people at a shopping centre, convenience sampling would be asking only the people who walk past and look interested in stopping to answer questions.
A disadvantage of this type of sampling is that it is unlikely to be truly representative of the wider population.
In this method, a sample is formed by selecting equal proportions of each subgroup (or strata) within a population. This is a particularly appropriate when there are clearly defined subgroups, and each one requires adequate representation.
For example, if a survey was being conducted on the student population of a school, and it was important to represent all year levels, a sample might be formed by randomly selecting $10%$10% of students from each year group.
While stratified sampling can be more expensive and time-consuming than other methods, it can also provide the best representation of the wider population.
Martha wants to use a stratified sample of $50$50 members to survey her schools population. If there are $70$70 teachers and $1100$1100 students at her school,
Solution
Number of teachers in sample | $=$= | $\frac{70}{1170}\times50$701170×50 | ||
$=$= | $2.991$2.991... | |||
$=$= | $3$3 | (Rounded to the nearest whole number) |
Number of students in sample | $=$= | $\frac{1100}{1170}\times50$11001170×50 | ||
$=$= | $47.008$47.008... | |||
$=$= | $47$47 | (Rounded to the nearest whole number) |
Note: we need to check that the number of teachers and students add to the required sample size of $50$50, i.e. $3$3 teachers $+$+ $47$47 students $=$= $50$50.
Quota sampling is a non-random method of sampling which is similar to stratified sampling. The researchers determine how many of each strata are required to be in the sample and then fill this 'quota', often using convenience sampling. Once the quota for a particular strata is filled, any respondents from that strata are then disregarded.
For example, if researchers are interested in movie preferences for various age groups they may have their first question being "What is your age"? If the person then states an age group that has already had its quota filled, no more questions are asked of the person. This makes quota sampling a fairly time efficient method of sampling.
A school principal wants to estimate the number of students who ride a bicycle to school.
Which two samples should be used to not introduce bias?
All students who are in the school band.
$8$8 students in the hallway.
Ten students from each grade, chosen at random.
$130$130 students during the lunch periods.
Out of $2160$2160 students in a school, $216$216 were chosen at random and asked their favourite colour out of red, blue and yellow with $99$99 choosing red, $63$63 blue and $54$54 yellow.
One in every how many students at the school was sampled?
Estimate the total number of students in the whole school who prefer the colour red.
Estimate the total number of students in the whole school who prefer the colour blue.
Estimate the total number of students in the whole school who prefer the colour yellow.
Data can be collected in a variety of different ways:
The design of the data collection method and the way it is implemented can have a big impact on the quality of the data obtained. Below, we look at some of the considerations when using a questionnaire for collecting data, but similar considerations would apply for each of the other methods.
Collecting data from people in a sample or population, often involves asking them questions and recording their responses. One of the most common survey methods is the questionnaire, where participants answer a set of questions on a printed or online form. Other survey methods, like personal interviews, can be used to collect data in a similar way.
The quality of the data collected depends a lot on the quality of the questions being asked. Some important considerations when designing questionnaires are:
The two main types of questions used in questionnaires are either 'closed' or 'open'.
One of the main issues with questionnaires is knowing whether or not the participant has understood the question or answered the question accurately. There is also the issue of what to do if a question was not answered at all.
A commonly used method of estimating the population of a species is capture-recapture. As the name suggests, this method involves capturing individuals, releasing them, then capturing them again. Simple, right? In practice, it is slightly more complicated, and is based on ratios.
The idea is to capture some individuals, mark them in some way so you know you have already captured them, then release them back into the environment. After a little time has passed, we do the same thing again but this time we compare the number of marked individuals to the number captured. With this information we can estimate the total population!
Seems a bit of a stretch, doesn't it? Actually it boils down to a single ratio: the number of individuals we capture in one go compared to the total population. Let's introduce some notation: let $n$n be the number in the first capture, $N$N be the total population, $k$k be the number of marked individuals in the second capture and $K$K the total number in the second capture. Let's say that during the first capture we mark a certain proportion of the total population, this proportion would be $n:N$n:N. During the second capture, given random conditions and no bias in the selection of individuals, we would expect the ratio $k:K$k:K of marked individuals to the number in the capture to be the same as $n:N$n:N. That's it! with this assumption we can form an equation and solve for $N$N.
$n:N$n:N | $=$= | $k:K$k:K |
$\frac{N}{n}$Nn | $=$= | $\frac{K}{k}$Kk |
$N$N | $=$= | $\frac{nK}{k}$nKk |
Although this method is quite straight-forward, in practice there are some drawbacks:
That said, there are ways around all these issues and ecologists still use this method every day. One way to get more accurate and reliable results is to do three or more captures, comparing the ratio of marked individuals multiple times. Can you think of any other solutions to the drawbacks above?
Idat wants to estimate the total population of skinks in his backyard. He performs a capture-recapture, marking 12 individuals in the first capture and finding 3 marked out of the 9 individuals in the second capture. What is the total population?
Think: We have three of the variables used in the capture-recapture formula, the number in the first capture $n$n, the number of marked individuals in the second capture $k$k, and the number in the second capture $K$K.
Do: Assign the variables correctly then, using the formula, substitute and evaluate $N$N.
$N$N | $=$= | $\frac{nK}{k}$nKk |
$=$= | $\frac{12\times9}{3}$12×93 | |
$=$= | $36$36 |
Reflect: Would this number be normal of any backyard? What sort of factors could affect the population of skinks in a particular backyard?