Language and Use of Statistics

Lesson

Whenever a researcher gathers data for the purpose of finding an answer to a question of interest, there is some uncertainty in the result. One way to overcome the uncertainty would be to obtain data on every single element of the sample space. In the social sciences, this would mean conducting a census, which is usually too expensive. In the physical sciences, it would mean making infinitely many measurements of some quantity, which is impossible.

As a viable alternative, data is collected on a representative sample drawn from the sample space. Media organisations wishing to predict the outcome of an election survey a relatively small number of people from the whole population and they feel confident that the results from the sample survey reflect the results that would be obtained if the whole population were to be surveyed.

In the physical sciences, researchers make several measurements of the quantity they are investigating and assume that the true value is located somewhere within the range of results. For example, in his experiment in 1882 to measure the speed of light, Simon Newcomb made 66 repetitions of his experimental procedure.

In a survey, there may be both known and unknown sources of bias that could affect the result. To minimise these, researchers choose the participants in the survey *randomly*.

A random sample can be selected by an unpredictable process such as drawing names or numbers out of a hat. More professionally it is done with the help of a table of random numbers or with the random number generator on a scientific calculator.

Random numbers are generated on a calculator in such a way that every number between $0$0 and $1$1, up to say $10$10 digits in length, is equally likely to occur. If the calculator is working correctly, it should be possible to choose any sub-interval within the interval $\left[0,1\right]$[0,1] then, the number of times the generated numbers fall within that sub-interval should be proportionally the same as the length of the sub-interval relative to the interval $\left[0,1\right]$[0,1].

Choose $25$25 subjects for a survey from a population of $530$530 people who work in a particular industry.

Number the people $1$1 to $530$530. A calculator typically selects random numbers in the range $0$0 to $1$1. So, if the random numbers were multiplied by $530$530, the range would be from $0$0 to $530$530. Now, add $1$1 to each random number and delete the digits after the decimal point. The generated numbers will match the numbering of the people. The first $25$25 numbers generated in this way indicate who will be in the survey. If it should happen that the same random number occurs twice, simply skip it and take the next random number.

It is also quite easy to generate lists or sequences of random numbers in a desired range using spreadsheet software.

In ecological studies, a sampling technique is used to estimate the number of individuals in a population. Suppose a region is home to an unknown number of animals of a particular species. A researcher might capture some of the animals, tag them and then release them back into the environment. Some time later when the released animals can be assumed to have become well-mixed with the rest of the population, another sample of the animals is captured. Some of these are likely to be the previously tagged individuals.

The proportion of tagged individuals in the second sample is likely to be approximately the same as the proportion of the original sample size as a proportion of the whole population.

Suppose the first sample was of $50$50 fish caught in a lake. These were tagged and released. Some time later another $48$48 fish were caught. Of these, four were found to be tagged. From this, it is inferred that $\frac{1}{12}$112of the fish in the lake are tagged. But it is known that there are $50$50 tagged fish in the lake, which is $\frac{1}{12}$112 of the population. Therefore, there must be $50\times12=600$50×12=600 fish in the lake.

This procedure would fail to give reliable results if the original sample were biased in some way and thus did not adequately represent the population; and it would fail if the released individuals were not well mixed with the population, leading to bias in the second sample.

A manager wants to randomly select products on an assembly line to test their quality. She generates a random number between $2$2 and $10$10, which tells her how many products to pass before picking up the next one. She then generates another random number and so on.

The first product she picks up is the first one on the assembly line.

She then generates the following numbers:

$10$10 $11$11 $7$7 $14$14 $4$4

How many products did she test?

$\editable{}$

How many products did she pass before picking up the second product?

$\editable{}$

How many products did she pass between the third and fourth tests?

$\editable{}$

How many products were in front of the third one she tested?

The following table shows the gender of $50$50 Year 12 students at a particular school.

Students $1$1-$10$10 | $\text{m },\text{f },\text{m },\text{f },\text{ f },\text{m },\text{m },\text{m },\text{f },\text{m }$m ,f ,m ,f , f ,m ,m ,m ,f ,m |

Students $11$11-$20$20 | $\text{m },\text{f },\text{m },\text{m },\text{f },\text{m },\text{f },\text{m },\text{f },\text{f }$m ,f ,m ,m ,f ,m ,f ,m ,f ,f |

Students $21$21-$30$30 | $\text{f },\text{ f },\text{m },\text{m },\text{f },\text{m },\text{m },\text{f },\text{f },\text{f }$f , f ,m ,m ,f ,m ,m ,f ,f ,f |

Students $31$31-$40$40 | $\text{m },\text{f },\text{m },\text{f },\text{f },\text{f },\text{m },\text{m },\text{m },\text{f }$m ,f ,m ,f ,f ,f ,m ,m ,m ,f |

Students $41$41-$50$50 | $\text{f },\text{m },\text{f },\text{m },\text{m },\text{ f },\text{m },\text{f },\text{m },\text{f }$f ,m ,f ,m ,m , f ,m ,f ,m ,f |

What is the proportion of females in the sample? Express as a fraction in simplest form.

What proportion of the first $5$5 students were female?

What proportion of the first $10$10 students were female?

What proportion of the first $20$20 students were female?

In a systematic sample, every second student is chosen, in the order that they appear, from the first twenty students. How many males will be chosen in the sample?

In a systematic sample, every third student is chosen, in the order that they appear, from the first forty students. How many females will be chosen in the sample?

The school has a population of $440$440 students. If the proportion of males and females in the sample is indicative of the whole school, how many female students are there in the school?

An oil spill has spread over an area of $1650$1650 square kilometres. A team of biologists scan an area of $150$150 square kilometres, and find $272$272 dead marine animals. Find $y$`y`, the estimated number of dead marine animals over the entire area of the oil spill.

Carry out investigations of phenomena, using the statistical enquiry cycle: A conducting surveys that require random sampling techniques, conducting experiments, and using existing data sets B evaluating the choice of measures for variables and the sampling and data collection methods used C using relevant contextual knowledge, exploratory data analysis, and statistical inference.

Design a questionnaire