10. Representing Data

Lesson

The growth in computerised technology has made it possible to analyse ever larger and more complex data sets, resulting in more efficient, responsive and adaptable processes. It has allowed for advances in a range of fields such as medicine, environmental science, transportation, manufacturing and logistics.

At the same time, data has also been used to influence opinion and create political divisions. It has intruded on personal privacy and created greater inequality. It is increasingly important for us to understand how data is collected and used, and be aware of the impact it may have on our lives and our future.

Statistics is concerned with the collection and analysis of data. It normally follows a process:

- Plan a question that can be answered with data
- Decide on how much data to collect
- Collect the data by conducting a survey, questionnaire, experiment or observation
- Represent the data using a table or chart
- Analyse the data
- Interpret the data and draw conclusions related to the original question being asked

One decision that has to be made, before data collection begins, is whether to collect data from every member of a population, or to only collect data from a sample of members within that population.

In statistics, a population refers to **every** member within any particular group of interest. It could be the entire population of a country, the population of a school, the number of frogs living in a wetland, the number of trees in a forest, or the number of cars in a parking lot.

A survey conducted on every member of a population is called a census. In Australia, a nation-wide census is conducted every five years by the Australian Bureau of Statistics (ABS). Data obtained from the census is used by the government to plan for the future direction of the country.

Collecting data from every member of a population is the most accurate way of gathering information, but it is not always the most practical, and can be very expensive. For these reasons, data is often gathered from a smaller group, or sample, that can be used to **estimate** the characteristics of the wider population.

The size of the sample is an important consideration. If the sample is too large, it may be too expensive or time-consuming to collect the data. If it is too small, the sample may not be representative of the population.

In many cases, a population will contain subgroups that each require proportional representation within the sample. As an example, consider a survey of a school where half the population are female, and gender is an important factor in the results of the survey. If the sample contains a higher proportion of males than females, then the sample is not representative of the population. We would say that the sample is biased rather than fair.

Bias in a sample means that certain characteristics of the population are over- or under-represented. One way to reduce bias is to use some form of **randomisation** in the sampling method.

There are four main methods used to sample a population:

- Simple random sampling
- Systematic sampling
- Self-selected sampling
- Stratified sampling

Each method has advantages and disadvantages, depending on the situation.

In this method, a sample is formed by selecting members from the population at random, where each member of the population has an equally likely chance of being selected. In simple cases, a sample could be created by drawing names from a hat. For most samples though, it is more common to use a random number generator.

Generating random numbers

Calculators and spreadsheet applications usually have a **random number generator** that can generate random decimals between $0$0 and $1$1.

These can be used to create random numbers within any range of values. For example, if we require a random number between:

- $1$1 and $50$50, we would multiply the randomly-generated decimal by $50$50, then round the answer up to the next whole number.
- $20$20 and $30$30, we would multiply the randomly-generated decimal by $30$30, add $20$20, then round the answer up to the next whole number.

While simple random sampling can be one of the cheapest and least time-consuming methods, it may not adequately represent every sub-group within larger populations.

Sam wants to randomly select five people to use in a sample from a population of twenty. He begins by assigning each person in the population, a number from $1$1 to $20$20. Using the random number generator on his calculator, he then generates five random numbers between $0$0 and $1$1:

$0.532$0.532, $0.805$0.805, $0.686$0.686, $0.774$0.774, $0.272$0.272

Help Sam, by converting these values to numbers between $1$1 and $20$20, so he can use them to select members for his sample.

**Solution**

To get random numbers between $1$1 and $20$20, we multiply each value by $20$20 and round up to the next whole number:

First random number | $=$= | $0.532\times20$0.532×20 | ||

$=$= | $10.64$10.64 | |||

$=$= | $11$11 | (Rounded up to the next whole number) |

Repeating this for the remaining values gives us five random numbers between $1$1 and $20$20:

$11$11, $17$17, $14$14, $16$16, $6$6

Anyone who was assigned these numbers in the population, would be selected for the sample.

**Note**: in this case we wanted five unique random numbers. If our calculator generated two numbers that were similar, we could end up with the same value when converted to a number between $1$1 and $20$20. In that situation, we would keep generating additional random numbers until we had five that were unique.

In this method, a sample is formed by choosing a random starting point, then selecting members from the population at regular intervals (i.e. every $5$5th member). In other words, it uses a 'system' for selection. This method is often favoured by manufacturers for sampling products on a production line.

While systematic sampling is better than simple random sampling, it can still under- or over-represent a subgroup within the population.

A machine produces $400$400 items a day. At what interval should an item be selected in order to obtain a systematic sample of $25$25 items?

**Solution**

Interval required | $=$= | $\frac{\text{population size }}{\text{sample size }}$population size sample size |

$=$= | $\frac{400}{24}$40024 | |

$=$= | $16$16 |

Therefore, every $16$16th item should be selected.

In this method, a sample is formed by members of the population who volunteer themselves for selection. This is a common sampling method in the field of medicine where volunteers may be asked to take part in a medical trial.

This method tends to be used in situations where it may be difficult to randomly select people, perhaps due to ethical or logistical reasons. As a result, self-selected sampling may not be truly representative of the wider population.

In this method, a sample is formed by selecting equal proportions of each subgroup (or strata) within a population. This is a particularly appropriate when there are clearly defined subgroups, and each one requires adequate representation.

For example, if a survey was being conducted on the student population of a school, and it was important to represent all year levels, a sample might be formed by randomly selecting $10%$10% of students from each year group.

While stratified sampling can be more expensive and time-consuming than other methods, it can also provide the best representation of the wider population.

Martha wants to use a stratified sample of $50$50 members to survey her schools population. If there are $70$70 teachers and $1100$1100 students at her school,

- How many teachers should be in the sample?
- How many students should be in the sample?

**Solution**

- The total school population is $1170$1170 people. If teachers make up $70$70 out of $1170$1170, we need to find the same proportion of teachers in a sample of $50$50 people.
Number of teachers in sample $=$= $\frac{70}{1170}\times50$701170×50 $=$= $2.991$2.991... $=$= $3$3 (Rounded to the nearest whole number) - We use the same approach to find the number of students in the sample.
Number of students in sample $=$= $\frac{1100}{1170}\times50$11001170×50 $=$= $47.008$47.008... $=$= $47$47 (Rounded to the nearest whole number)

**Note**: we need to check that the number of teachers and students add to the required sample size of $50$50, i.e. $3$3 teachers $+$+ $47$47 students $=$= $50$50.

A school principal wants to estimate the number of students who ride a bicycle to school.

Which two samples should be used to not introduce bias?

All students who are in the school band.

A$8$8 students in the hallway.

BTen students from each grade, chosen at random.

C$130$130 students during the lunch periods.

D

Out of $2160$2160 students in a school, $216$216 were chosen at random and asked their favourite colour out of red, blue and yellow with $99$99 choosing red, $63$63 blue and $54$54 yellow.

One in every how many students at the school was sampled?

Estimate the total number of students in the whole school who prefer the colour red.

Estimate the total number of students in the whole school who prefer the colour blue.

Estimate the total number of students in the whole school who prefer the colour yellow.

Data can be collected in a variety of different ways:

- Questionnaires and surveys
- Experiments and simulations
- Observational studies
- Data logging (most websites do this automatically)

The design of the data collection method and the way it is implemented can have a big impact on the quality of the data obtained. Below, we look at some of the considerations when using a questionnaire for collecting data, but similar considerations would apply for each of the other methods.

Collecting data from people in a sample or population, often involves asking them questions and recording their responses. One of the most common survey methods is the questionnaire, where participants answer a set of questions on a printed or online form. Other survey methods, like personal interviews, can be used to collect data in a similar way.

The quality of the data collected depends a lot on the quality of the questions being asked. Some important considerations when designing questionnaires are:

- Using simple language that is easy to understand
- Respecting people's right to privacy
- Considering the ethics of the issue being analysed
- Asking questions that are clear and concise, with no ambiguity
- Asking questions that are fair and unbiased
- Only asking questions that are relevant
- Allowing for a range of different responses
- Keeping the overall length of the survey appropriate
- Ensuring the survey is easy to complete

The two main types of questions used in questionnaires are either 'closed' or 'open'.

**Closed questions**include those where the participant answers by selecting either YES or NO or they choose their answer from a list of options (multiple choice or indicating a value on a scale). This is by far the most common question type because it is relatively easy to answer and can be recorded automatically by a computer.

**Open questions**are those that require a more detailed written response. These questions may allow for a more accurate response and can often pick up details that are missed in closed question types. The main drawback with open questions is that they are difficult to analyse using computerised methods. They may be more appropriate though for interview-style surveys.

One of the main issues with questionnaires is knowing whether or not the participant has understood the question or answered the question accurately. There is also the issue of what to do if a question was not answered at all.

A commonly used method of estimating the population of a species is capture-recapture. As the name suggests, this method involves capturing individuals, releasing them, then capturing them again. Simple, right? In practice, it is slightly more complicated, and is based on ratios.

The idea is to capture some individuals, mark them in some way so you know you have already captured them, then release them back into the environment. After a little time has passed, we do the same thing again but this time we compare the number of marked individuals to the number captured. With this information we can estimate the total population!

Seems a bit of a stretch, doesn't it? Actually it boils down to a single ratio: the number of individuals we capture in one go compared to the total population. Let's introduce some notation: let $n$`n` be the number in the first capture, $N$`N` be the total population, $k$`k` be the number of marked individuals in the second capture and $K$`K` the total number in the second capture. Let's say that during the first capture we mark a certain proportion of the total population, this proportion would be $n:N$`n`:`N`. During the second capture, given random conditions and no bias in the selection of individuals, we would expect the ratio $k:K$`k`:`K` of marked individuals to the number in the capture to be the same as $n:N$`n`:`N`. That's it! with this assumption we can form an equation and solve for $N$`N`.

$n:N$n:N |
$=$= | $k:K$k:K |

$\frac{N}{n}$Nn |
$=$= | $\frac{K}{k}$Kk |

$N$N |
$=$= | $\frac{nK}{k}$nKk |

Although this method is quite straight-forward, in practice there are some drawbacks:

- It depends on the mobility of individuals. For example, it doesn't work with plants or other stationary organisms. Also, If your sampling area is too small, marked individuals may range outside of it and effect your result
- For it to be effective you must do the second capture shortly after the first, to avoid the possibility of individuals dying or seasonal change etc, but this also reduces the probability that the population has adequately mixed back together.
- It actually estimates the population in the sampling area, which can then be extrapolated to a larger area. This is often reasonable but one are does not always reflect the population density of a more general area.

That said, there are ways around all these issues and ecologists still use this method every day. One way to get more accurate and reliable results is to do three or more captures, comparing the ratio of marked individuals multiple times. Can you think of any other solutions to the drawbacks above?

Idat wants to estimate the total population of skinks in his backyard. He performs a capture-recapture, marking 12 individuals in the first capture and finding 3 marked out of the 9 individuals in the second capture. What is the total population?

**Think: **We have three of the variables used in the capture-recapture formula, the number in the first capture $n$`n`, the number of marked individuals in the second capture $k$`k`, and the number in the second capture $K$`K`.

**Do:** Assign the variables correctly then, using the formula, substitute and evaluate $N$`N`.

$N$N |
$=$= | $\frac{nK}{k}$nKk |

$=$= | $\frac{12\times9}{3}$12×93 | |

$=$= | $36$36 |

**Reflect: **Would this number be normal of any backyard? What sort of factors could affect the population of skinks in a particular backyard?