10. Sampling and estimation

Lesson

In statistical experiments and surveys, we need to be precise in identifying the group of subjects that are being studied. In such experiments, the population refers to the entire set of items, individuals or events for which we intend to study.

To draw accurate conclusions about a population, such as find the average height of students in Year 12 at a particular school, we could survey each and every member of the population. That is, we could measure the height of each Year 12 student that attends the school. The process of obtaining data from every member of the population is called a census.

In practice, it can be difficult and expensive to gather information about every subject in a population. For this reason, researchers often choose to restrict a study to a sample drawn from the population. If the sample is selected carefully, it can provide good information about the whole population without the expense of conducting a census.

A piece of numerical information about a population is referred to as a population parameter, such as the population mean, population standard deviation and population proportion displaying a given characteristic. For example, the average weight of koalas living in a particular region is a population parameter–the population mean weight.

The same information obtained from a sample from the population is called a sample statistic or an estimator, since it is used to estimate the population parameter. The average weight of a random sample of koalas drawn from the population is an estimator for the average weight of koalas in the population as a whole.

Definitions

Population - the set of all eligible members of the group we intend to study

Sample - a subset of the population. We will use this subset to infer information about the greater population.

Census - The procedure of systematically acquiring and recording information from every member of a given population

In Australian history

On the 6th November 1999 the Australian government held a referendum.

What was this referendum about? How was it conducted?

In November 2017 the Australian government held a plebiscite.

What was the plebiscite about? How was it conducted?

How is this referendum different to this particular plebiscite? Think about your response with reference to the sampling procedures used and the statistical concepts of a census versus a sample, and population parameters versus estimators.

For a statistical survey the population is deemed to be all people in a city who play in any organised sporting competition.

Which of the following are samples of that population?

Choose all appropriate answers.

$500$500 spectators chosen from a weekend sports match.

AThe members of $3$3 teams chosen from the local hockey tournament.

B$100$100 people chosen at random from a local park.

CAll students from a local school who compete in a school sports competition.

DAll the active members of a local football club.

E$500$500 spectators chosen from a weekend sports match.

AThe members of $3$3 teams chosen from the local hockey tournament.

B$100$100 people chosen at random from a local park.

CAll students from a local school who compete in a school sports competition.

DAll the active members of a local football club.

E

For a sample to be useful to make inferences about a population, we need it to be sufficiently large and without undue bias. The sample should be *representative* of the population, such that any subgroups present in the population are represented in the same proportion in the sample. Disproportionate representation of subgroups in sampling can lead to bias–the tendency of a sample statistic to systematically over- or under-estimate a population parameter.

One way to try and ensure a sample is representative of the population is to take a simple random sample. A simple random sample involves a process where each individual has an equal chance of being selected for the sample. For example, if selecting a sample of $50$50 students from a school we could put all the students' names in a hat, mix well and then draw out one at a time or we could assign each student a number and use a random number generator to select the sample.

When taking a sample, it is important to keep in mind the possibility of statistical bias. Our methodology needs to ensure that the sample is representative of the population we are interested in. Let's look at some common ways that bias may be introduced to a sample.

Selection bias typically occurs when certain individuals are more likely to be involved in the study than other individuals.

For example, suppose we randomly sample $1000$1000 households about their enjoyment of skiing. We conduct the survey by a household door knock on the weekends of the three winter months in the year. The recorded response by each door knocker might be one of three letters for each household:

- "$Y$
`Y`" for yes I like skiing - "$N$
`N`" for no I don't like skiing - "$U$
`U`" for unknown (the occupants were not home)

There is a possibility that the proportion of "$N$`N`" responses, given by $p=\frac{N}{N+Y}$`p`=`N``N`+`Y`, will be higher than expected simply because a lot of the regular skiers (expected to respond with $Y$`Y`) are quite likely to be away at the snowfields. This is an example of selection bias.

Exclusion bias occurs when a particular subset of a population is systematically omitted from a study. This is a type of selection bias. For example, a random sample of household couples are surveyed to ascertain the number of children they have. If in defining couples we admit married couples only, then we are guilty of introducing exclusion bias. Because we have systematically omitted couples living in a de-facto relationship our data might well become biased.

Self-selection bias occurs when the response is voluntary. An example would be a phone poll in a newspaper that requests readers to call-in to voice their opinion. The resulting sample would not only be biased to the audience of the newspaper but would also over-represent individuals who have strong opinions.

Design flaw bias is unintentional bias caused by the actual measuring device used to collect the sample. In a quantitative experiment, a measuring device that has been incorrectly calibrated may systematically cause all results to be skewed. In qualitative research, the scope for measurement bias is wider and much more subtle, such as poorly designed questions, leading questions, interview technique, subjective response scales, survey environment and confidentiality of results.

Included under this heading is interviewer bias. This is bias arising by the way the interviewer asks questions of the sample participants. In effect then, interview bias is really a design flaw. This can be avoided if the questions are carefully thought out, written down and distributed to interviewers who themselves would be coached in how to verbally deliver the questions.

Also included under this heading is completion bias. This is bias caused by partially completed surveys. It comes under the broad heading of design flaw bias because in most cases partially completed survey returns arise from poorly considered surveys. Questions can be ambiguous and confusing and/or require long responses, and the participant can easily stop answering after a while, or begin to answer questions carelessly. There are strategies to avoid this. For example, the number of questions could be minimised, and perhaps require simple but direct responses. Also, giving the participant the approximate length of time required to answer the questions is also a good idea.

Reporting bias is defined as selective revealing or suppression of information by the subjects or authors of a study. For example, subjects may like to present themselves in a favourable light and under-report information about drinking habits or unpopular beliefs, particularly if survey results are not confidential.

The term may also be used to refer to authors under-reporting unexpected or undesirable experimental results, attributing the results to sampling or measurement error, while being more accepting of expected or desirable results.

Reporting bias in the form of funding bias can occur when pressure, whether intended or otherwise, is put on researchers to reach certain conclusions that may be favourable to the funding body. For example, a cigarette company might spend money on researching the effects of smoking. It would clearly be in their favour to arrive at conclusions that disputed the health hazards of smoking.

Analytic bias is bias arising from the way sample data is analysed. It is important to realise that data is turned into information by the analysis. This leaves the possibility of bias coming from the analytic choices made by the researcher. As a simple example the average house price in the sample set $\$200000$$200000, $\$300000$$300000 and $\$1000000$$1000000 could be stated as $\$300000$$300000 or $\$500000$$500000 depending on the choice of statistical measure of centre used.

The above biases are certainly not an exhaustive list. There are many others. What is important is that, when taking a random sample, it is essential to think about the real possibility of bias creeping in to the analysis. Also ensure that the sample estimators used are statistically unbiased.

A radio station conducts a poll asking its listeners to call in to say if they are for or against restrictions on scalpers selling tickets for gigs at a higher price.

Why is this not an appropriate way to conduct a poll? Select all that apply.

A large variety of people are likely to call.

AA person can call more than once, so they could be counted more than once.

BPeople with stronger views are more likely to call than those who don’t have a strong view.

CYoung people are more likely to call than elderly pensioners.

DA large variety of people are likely to call.

AA person can call more than once, so they could be counted more than once.

BPeople with stronger views are more likely to call than those who don’t have a strong view.

CYoung people are more likely to call than elderly pensioners.

D

The reason we take samples of the population is to approximate, at a smaller scale, what is happening statistically in the larger population.

For example, if we were to take a random sample of $16$16 year old students in a particular school and measure their heights, and calculate the mean and standard deviation of our sample, our hope is that these would be statistically representative of the population of $16$16 year old students in the school as a whole. To be more certain, we might take another random sample and calculate the mean and standard deviation once again. The mean and standard deviation will vary between samples but give us estimates for the population parameters which we can further assess.

The investigation for this chapter explores this idea further by taking samples from some common probability distributions we have studied, and comparing the graphical displays of our samples, along with a few key statistics, with the graphs and statistics of the distribution itself.

The key findings of the first part of the investigation are as follows:

- The larger the sample, the more likely the graph is a close approximation to the graph of the population distribution.
- The larger the sample, the more likely the mean and standard deviation are close to the parameters of the population distribution.

$X$`X` is a discrete uniform distribution across the integers $1,2,3,4,5,6$1,2,3,4,5,6 and $7$7.

Calculate the mean and standard deviation of the distribution to two decimal places if necessary.

$\mu=$

`μ`=$\editable{}$$\sigma=$

`σ`=$\editable{}$A sample of size $25$25 is taken from this distribution and the graph and table of results are shown below.

Value Frequency $1$1 $2$2 $2$2 $4$4 $3$3 $7$7 $4$4 $3$3 $5$5 $4$4 $6$6 $3$3 $7$7 $2$2 Calculate the mean and standard deviation of this sample to two decimal places if necessary.

$\overline{x}=$

`x`=$\editable{}$$s=$

`s`=$\editable{}$A sample of size $100$100 is taken from this distribution and the graph and table of results are shown below.

Value Frequency $1$1 $16$16 $2$2 $7$7 $3$3 $13$13 $4$4 $21$21 $5$5 $18$18 $6$6 $13$13 $7$7 $12$12 Calculate the mean and standard deviation of this sample to two decimal places if necessary.

$\overline{x}=$

`x`=$\editable{}$$s=$

`s`=$\editable{}$Another sample of size $100$100 is taken from this distribution and the graph and table of results are shown below.

Value Frequency $1$1 $19$19 $2$2 $16$16 $3$3 $8$8 $4$4 $13$13 $5$5 $19$19 $6$6 $14$14 $7$7 $11$11 Calculate the mean and standard deviation of this sample to two decimal places if necessary.

$\overline{x}=$

`x`=$\editable{}$$s=$

`s`=$\editable{}$By considering the three samples and the original distribution, select all statements that apply.

If another sample of size $25$25 was taken, the mean and standard deviation would be the same as the above sample of size $25$25.

AThe larger the sample, the more likely the graph is close to the graph of the population distribution.

BThe larger the sample, the more likely the mean and standard deviation are close to the population distribution.

CA larger sample must have a closer mean and standard deviation to the population than a smaller sample.

DIf another sample of size $25$25 was taken, the mean and standard deviation would be the same as the above sample of size $25$25.

AThe larger the sample, the more likely the graph is close to the graph of the population distribution.

BThe larger the sample, the more likely the mean and standard deviation are close to the population distribution.

CA larger sample must have a closer mean and standard deviation to the population than a smaller sample.

D

$X$`X` is a Bernoulli distribution with $P\left(X=0\right)=0.21$`P`(`X`=0)=0.21 and $P\left(X=1\right)=0.79$`P`(`X`=1)=0.79.

Calculate the mean and standard deviation of the distribution to two decimal places if necessary.

$\mu=$

`μ`=$\editable{}$$\sigma=$

`σ`=$\editable{}$A sample of size $25$25 is taken from this distribution and the graph and table of results are shown below.

Value Frequency $0$0 $6$6 $1$1 $19$19 Calculate the mean and standard deviation of this sample to two decimal places if necessary.

$\overline{x}=$

`x`=$\editable{}$$s=$

`s`=$\editable{}$A sample of size $100$100 is taken from this distribution and the graph and table of results are shown below.

Value Frequency $0$0 $18$18 $1$1 $82$82 Calculate the mean and standard deviation of this sample to two decimal places if necessary.

$\overline{x}=$

`x`=$\editable{}$$s=$

`s`=$\editable{}$A sample of size $100$100 is taken from this distribution and the graph and table of results are shown below.

Value Frequency $0$0 $22$22 $1$1 $78$78 Calculate the mean and standard deviation of this sample to two decimal places if necessary.

$\overline{x}=$

`x`=$\editable{}$$s=$

`s`=$\editable{}$By considering the three samples and the original distribution, select all statements that apply.

If another sample of size $25$25 was taken, the mean and standard deviation would be the same as the above sample of size $25$25.

AThe larger the sample, the more likely the graph is close to the graph of the population distribution.

BA larger sample must have a closer mean and standard deviation to the population than a smaller sample.

CThe larger the sample, the more likely the mean and standard deviation are close to the population distribution.

DABCD

understand the concept of a random sample

discuss sources of bias in samples, and procedures to ensure randomness

investigate the variability of random samples from various types of distributions, including uniform, normal and Bernoulli, using graphical displays of real and simulated data