Whenever a researcher gathers data for the purpose of finding an answer to a question of interest, there is some uncertainty in the result. One way to overcome the uncertainty would be to obtain data on every single element of the sample space. In the social sciences, this would mean conducting a census, which is usually too expensive. In the physical sciences, it would mean making infinitely many measurements of some quantity, which is impossible.

As a viable alternative, data is collected on a representative sample drawn from the sample space. Media organisations wishing to predict the outcome of an election survey a relatively small number of people from the whole population and they feel confident that the results from the sample survey reflect the results that would be obtained if the whole population were to be surveyed.

In the physical sciences, researchers make several measurements of the quantity they are investigating and assume that the true value is located somewhere within the range of results. For example, in his experiment in 1882 to measure the speed of light, Simon Newcomb made 66 repetitions of his experimental procedure.

In a survey, there may be both known and unknown sources of bias that could affect the result. To minimise these, researchers choose the participants in the survey randomly.

A random sample can be selected by an unpredictable process such as drawing names or numbers out of a hat. More professionally it is done with the help of a table of random numbers or with the random number generator on a scientific calculator.

Random numbers are generated on a calculator in such a way that every number between $0$0 and $1$1, up to say $10$10 digits in length, is equally likely to occur. If the calculator is working correctly, it should be possible to choose any sub-interval within the interval $\left[0,1\right]$[0,1] then, the number of times the generated numbers fall within that sub-interval should be proportionally the same as the length of the sub-interval relative to the interval $\left[0,1\right]$[0,1].

Examples

Example 1

Choose $25$25 subjects for a survey from a population of $530$530 people who work in a particular industry.

Number the people $1$1 to $530$530. A calculator typically selects random numbers in the range $0$0 to $1$1. So, if the random numbers were multiplied by $530$530, the range would be from $0$0 to $530$530. Now, add $1$1 to each random number and delete the digits after the decimal point. The generated numbers will match the numbering of the people. The first $25$25 numbers generated in this way indicate who will be in the survey. If it should happen that the same random number occurs twice, simply skip it and take the next random number.

It is also quite easy to generate lists or sequences of random numbers in a desired range using spreadsheet software.

Example 2

In ecological studies, a sampling technique is used to estimate the number of individuals in a population. Suppose a region is home to an unknown number of animals of a particular species. A researcher might capture some of the animals, tag them and then release them back into the environment. Some time later when the released animals can be assumed to have become well-mixed with the rest of the population, another sample of the animals is captured. Some of these are likely to be the previously tagged individuals.

The proportion of tagged individuals in the second sample is likely to be approximately the same as the proportion of the original sample size as a proportion of the whole population.

Suppose the first sample was of $50$50 fish caught in a lake. These were tagged and released. Some time later another $48$48 fish were caught. Of these, four were found to be tagged. From this, it is inferred that $\frac{1}{12}$112of the fish in the lake are tagged. But it is known that there are $50$50 tagged fish in the lake, which is $\frac{1}{12}$112 of the population. Therefore, there must be $50\times12=600$50×12=600 fish in the lake.

This procedure would fail to give reliable results if the original sample were biased in some way and thus did not adequately represent the population; and it would fail if the released individuals were not well mixed with the population, leading to bias in the second sample.

More Worked Examples

QUESTION 1

QUESTION 2

The following table shows the gender of $50$50 Year 12 students at a particular school.

Students $1$1-$10$10	$\text{m },\text{f },\text{m },\text{f },\text{ f },\text{m },\text{m },\text{m },\text{f },\text{m }$m ,f ,m ,f , f ,m ,m ,m ,f ,m
Students $11$11-$20$20	$\text{m },\text{f },\text{m },\text{m },\text{f },\text{m },\text{f },\text{m },\text{f },\text{f }$m ,f ,m ,m ,f ,m ,f ,m ,f ,f
Students $21$21-$30$30	$\text{f },\text{ f },\text{m },\text{m },\text{f },\text{m },\text{m },\text{f },\text{f },\text{f }$f , f ,m ,m ,f ,m ,m ,f ,f ,f
Students $31$31-$40$40	$\text{m },\text{f },\text{m },\text{f },\text{f },\text{f },\text{m },\text{m },\text{m },\text{f }$m ,f ,m ,f ,f ,f ,m ,m ,m ,f
Students $41$41-$50$50	$\text{f },\text{m },\text{f },\text{m },\text{m },\text{ f },\text{m },\text{f },\text{m },\text{f }$f ,m ,f ,m ,m , f ,m ,f ,m ,f

What is the proportion of females in the sample? Express as a fraction in simplest form.
What proportion of the first $5$5 students were female?
What proportion of the first $10$10 students were female?
What proportion of the first $20$20 students were female?
In a systematic sample, every second student is chosen, in the order that they appear, from the first twenty students. How many males will be chosen in the sample?
In a systematic sample, every third student is chosen, in the order that they appear, from the first forty students. How many females will be chosen in the sample?
The school has a population of $440$440 students. If the proportion of males and females in the sample is indicative of the whole school, how many female students are there in the school?

QUESTION 3

An oil spill has spread over an area of $1650$1650 square kilometres. A team of biologists scan an area of $150$150 square kilometres, and find $272$272 dead marine animals. Find $y$y, the estimated number of dead marine animals over the entire area of the oil spill.

Outcomes

12D.C.2.2

Explain the distinction between the terms population and sample, describe the characteristics of a good sample, explain why sampling is necessary, and describe and compare some sampling techniques

Sampling Techniques II