Populations and Samples

NZ Level 8 (NZC) Level 3 (NCEA) [In development]

Populations and Simple Random Samples

Lesson

When taking a sample, it is important to keep in mind the possibility of statistical bias.

There are a number of types of bias that we need to discuss.

The first of these is referred to estimator bias.

Suppose for example a random sample is taken from a population and the sample mean is calculated as $\overline{x}=\frac{\Sigma x}{n}$`x`=Σ`x``n` where $n$`n` is the number in the sample. This single sample statistic is expected to have a value somewhere in the vicinity of the population mean $\mu$`μ`. Of course, that is the motivation for taking the sample in the first place.

Using the formula $\frac{\Sigma x}{n}$Σ`x``n`, we know that the size of the deviation from $\mu$`μ` will depend to a large extent on the size of the sample taken. However, the formula chosen to calculate the sample mean can be shown to be an unbiased estimator of $\mu$`μ`.

An unbiased estimator of the population mean is one in which the average value of many repeated $n$`n`-sized sample means is expected to equal the population mean.

Suppose an alternative estimator had been chosen - perhaps, for example $\overline{x_{alt}}=\frac{\Sigma x}{n+1}$`x``a``l``t`=Σ`x``n`+1. This new formula possesses an in-built bias so that repeated sampling will systematically produce estimates that are, on average, slightly lower than $\mu$`μ`. Such an estimator would be a biased estimator.

The important point of course is to choose estimators that have no in-built bias.

Bias is not restricted to just estimator bias. There is bias that stems from the way we sample. Our methodology needs to ensure that the sample is representative of the population we are interested in, and there are well established statistical techniques that can be used to avoid biased samples.

Selection bias typically occurs when certain individuals are more likely to be involved in the study than other individuals. Reporting bias occurs when information becomes skewed because of the difficulty of obtaining some of the data.

For example, suppose we randomly sample $1000$1000 households about their enjoyment of skiing. We conduct the survey by a household door knock on the weekends of the three winter months in the year. The recorded response by each door knocker might be one of three letters for each household:

- "$Y$
`Y`" for yes I like skiing - "$N$
`N`" for no I don't like skiing - "$U$
`U`" for unknown (the occupants were not home)

There is a possibility that the proportion of "N" responses, given by $p=\frac{N}{N+Y}$`p`=`N``N`+`Y`, will be higher than expected simply because a lot of the regular skiers (expected to respond with $Y$`Y`) are quite likely to be away at the snowfields. This is an example of selection bias.

These is also the possibility of reporting bias because we really have no idea how the $U$`U`'s feel about skiing.

Design flaw bias is unintentional bias caused by the actual measuring device used to collect the sample. For example if a teacher is estimating a classes understanding of a particular mathematical procedure, and the administered sample test questions are either poorly written, or refer to some other procedure, or do not represent a comprehensive scoping of examples, then the results will be a poor reflection of the true understanding. The results will be biased.

Included under this heading is interviewer bias. This is bias arising by the way the interviewer asks questions of the sample participants. In effect then, interview bias is really a design flaw. This can be avoided if the questions are carefully thought out, written down and distributed to interviewers who themselves would be coached in how to verbally deliver the questions.

Also included under this heading is completion bias. This is bias caused by partially completed surveys. It comes under the broad heading of design flaw bias because in most cases partially completed survey returns arise from poorly considered surveys. Questions can be ambiguous and confusing and/or require long responses, and the participant can easily stop answering after a while, or begin to answer questions carelessly. There are strategies to avoid this. For example, the number of questions could be minimised, and perhaps require simple but direct responses. Also, giving the participant the approximate length of time required to answer the questions is also a good idea.

Good research should always be completely independent of any funding organisation. Bias can occur when pressure, whether intended or otherwise, is put on researchers to reach certain conclusions that may be favourable to the funding body. For example, a cigarette company might spend money on researching the effects of smoking. It would clearly be in their favour to arrive at conclusions that disputed the health hazards of smoking. Any pressure to reach desired conclusions introduces funding bias.

Analytic bias is bias arising from the way sample data is analysed. It is important to realise that data is turned into information by the analysis. This leaves the possibility of bias coming from the analytic choices made by the researcher. As a simple example the average house price in the sample set $\$200000$$200000, $\$300000$$300000 and $\$1000000$$1000000 could be stated as $\$300000$$300000 or $\$500000$$500000 depending on the choice of statistical average used.

Exclusion bias occurs when a particular subset of a population is systematically omitted from a study. For example, a random sample of household couples are surveyed to ascertain the number of children they have. If in defining couples we admit married couples only, then we are guilty of introducing exclusion bias. Because we have systematically omitted all couples in a de-facto relationship our data might well become biased.

The above biases are certainly not an exhaustive list. There are many others. What is important is that, when taking a random sample, it is essential to think about the real possibility of bias creeping in to the analysis. You should also ensure that the sample estimators used are statistically unbiased.

The Skin Cancer Council wants to survey the population to approximate the average amount of time someone spends in the sun each day.

Which of the following methods could minimise completion bias in the survey responses?

Requiring responders to note the exact times of the day that they spend in the sun.

ACalling people during standard work hours.

BMaking sure the survey questions are comprehensive by having many questions.

CHaving one short question where responders select from given ranges of values for the number of hours they spend in the sun.

DRequiring responders to note the exact times of the day that they spend in the sun.

ACalling people during standard work hours.

BMaking sure the survey questions are comprehensive by having many questions.

CHaving one short question where responders select from given ranges of values for the number of hours they spend in the sun.

D

A radio station conducts a poll asking its listeners to call in to say if they are for or against restrictions on scalpers selling tickets for gigs at a higher price.

Why is this not an appropriate way to conduct a poll? Select all that apply.

A large variety of people are likely to call.

AA person can call more than once, so they could be counted more than once.

BPeople with stronger views are more likely to call than those who don’t have a strong view.

CYoung people are more likely to call than elderly pensioners.

DA large variety of people are likely to call.

AA person can call more than once, so they could be counted more than once.

BPeople with stronger views are more likely to call than those who don’t have a strong view.

CYoung people are more likely to call than elderly pensioners.

D

Make inferences from surveys and experiments: A determining estimates and confidence intervals for means, proportions, and differences, recognising the relevance of the central limit theorem B using methods such as resampling or randomisation to assess

Use statistical methods to make a formal inference