topic badge

3.09 Representation, bias and ethics

Lesson

There are a number of important points to consider when gathering data, because they can affect the conclusions we end up making. 

Populations and samples

If we have the data for every element of the set we are studying, then we call this dataset a population. If we only have the data for some subset of the set we are studying, then we call this data set a sample.

For example, if we were looking at the lengths of fish in a pond and measured every fish, we would call this dataset a population. If we only measured a smaller number of fish, we would call this dataset a sample.

Statistics calculated from populations are more reliable than those calculated from samples, because we know every value of the set we are studying. 

In a sample, we don't know which data points have been included or excluded. As you can see below, it is possible for a sample to give different statistics to the population.

The lengths of fish in a lake recorded in a histogram for both the population and sample

Representation and bias

Practically speaking, we can't always use the entire population, so we often do have to use samples instead. In these cases, it is important to make sure that the sample properly represents the population. If the sample does not represent the population, then we might have introduced (sample) bias to our data.

To give an example of this, imagine if we wanted to determine the average growth rate of a species of tree. If we took a sample of the trees that had the best growing conditions, then the average growth rate will probably be higher than the entire population of these trees. So this sampling method gives us a biased result.

The heights of a species of trees recorded in a histogram for both the population and biased sample

 

There are many ways that a sampling method can bias the data. In the particular case of data about people it is important to make sure the sample properly represents the population in terms of age, beliefs, culture, ethnicity, gender, location, and similar qualities if what we are measuring is affected by any of these factors. 

Our population changes depending on what we want to measure. For example, when measuring the lifespan of animals, our population wouldn't be all animals on Earth. It would likely be all the animals in a particular species.

 

Privacy and ethics

Presenting biased data as unbiased data is an ethical concern. It's incorrect to deliberately mislead people by hiding any information we have, which includes how we've sampled the data. Anything that could be relevant to the investigation should be made clear.

In the same way that it's important to avoid bias in selecting data, it is important to avoid bias in presenting or interpreting it. It is also important to keep in mind that including any sort of identifying information about people is a breach of privacy, which is a significant ethical concern.

Remember!

Population - The entire data set that is used for calculating statistics.

Sample - A subset of a population which is used for calculating statistics.

Bias - A situation where a sample does not represent the population.

 

Practice questions

Question 1

To find the relationship between the distance students travel to a specific school and their attendance rate a statistical inquiry is carried out over all the students in the school.

Is this a population or sample?

  1. Population

    A

    Sample

    B

Question 2

To investigate the relationship between hours a week spent cardiovascular training and resting heart rate, $50$50 men aged between 20 and 30 were randomly sampled.

Is this sample biased? If so, describe why.

Outcomes

MA12-8

solves problems using appropriate statistical processes

What is Mathspace

About Mathspace