topic badge

1.01 Statistical design

Introduction

Statistics can be used to answer many questions we have about the world in which we live. However, we need to make sure the collected data is accurate. In this lesson, we will learn how to design a plan for gathering data that accurately represents the question being answered.

Statistical design

In statistics, a population refers to every member within any particular group of interest. A survey conducted on every member of a population is called a census.

Collecting data from every member of a population is the most accurate way of gathering information, but it is not always the most practical and can be very expensive. Collecting data from a subset of the population, called a sample, can be quicker and less expensive.

When summarizing the collected data, we use different terms depending on whether the data came from a sample or from the whole population.

Parameter

A number that summarizes data from a population

Statistic

A number that summarizes data from a sample

If the methods of collecting the data were unbiased and the sample was representative of the population, the statistic may be used to understand the population. When we apply a statistic to a population, we are making an inference.

There are three methods of statistical design that can be used: a sample survey, an observational study, or an experiment.

Sample survey

A research method used to collect information from a group of individuals

This design is best when the information must be provided by a person. Popular sample surveys are ones that ask about an opinion, a feeling, or a preference.

Observational study

A study where a group is watched and monitored with no outside intervention

An observational study is best for determining a correlation between two variables of interest, but it cannot determine whether one factor was a cause of another factor.

Experiment

A planned method that is randomly applied to a group with the intent of finding cause and effect relationships

This design is best for determining cause and effect relationships. In an experiment, subjects are separated into a control group and an experimental group. A treatment is applied to the experimental group but not to the control group. This helps determine whether or not the treatment applied to the experimental group was the cause of any differences from the control group.

Examples

Example 1

A city councilman wants to determine whether a new skateboard park or a new ice-skating rink should be built as the new community building project. The new project will be located in the city park.

a

Identify the target population.

Worked Solution
Apply the idea

The target population will be the members of the community or residents of the city. These will be the people using, building, and paying for the upkeep of the park.

Reflect and check

The city may hope that residents of other cities will be drawn to the park because of the new project, but the target population will be the people most likely to use the skateboard park or ice-skating rink.

b

What design methodology would be best to find out how the community feels about the two proposed community building projects?

Worked Solution
Apply the idea

Since we want to find out the community's opinions about the two building project options, we should use a survey design.

c

Explain why surveying members of the ice hockey team about their preference is not representative of the population.

Worked Solution
Create a strategy

For the sample to be representative of the population, different types of community members of varying ages with various hobbies should have an opportunity to express their opinion.

Apply the idea

An ice hockey team will most likely prefer a new ice-skating rink, but we do not know the preferences of the entire community.

The team members may also be around the same age, and this age does not represent all ages within the community.

Finally, there may not be many team members, but the population of the city might be relatively large in comparison.

Example 2

Determine whether each of the following situations demonstrate a sample survey, an experiment, or an observational study.

a

A grocery store wants to know if their customers prefer using the self-checkouts or if they prefer using the standard checkout lanes that are staffed.

Worked Solution
Create a strategy

To determine which type of design is best for this situation, we need to determine how the data can be collected.

Apply the idea

Because the grocery store wants to know their customers opinions, they will need to ask their customers about their preference on checkout style. A sample survey is the best design for gathering this information.

Reflect and check

If the grocery store wanted to know which types of checkouts were used more frequently, an observational study would have been the better design.

b

A group of students wants to know how different levels of fertilizer affect plant growth.

Worked Solution
Create a strategy

A sample survey would not make sense in this situation. An observational study determines correlation, but it cannot determine if the fertilizer was the cause of specific growth differences. An experiment can determine cause and effect relationships.

Apply the idea

Since the students want to know if different levels were the cause of plant growth levels, the students should design an experiment.

Reflect and check

Although an observational study could have identified a correlation between fertilizer and plant growth, it cannot distinguish between fertilizer levels and plant heights. Other factors (like the sun or water levels) may have also caused a difference in plant levels.

If they use an experiment, the students can control the other factors (like sun and water levels) to make sure all plants receive the same amount. This will help them determine if the fertilizer was truly the cause of the plant growth.

c

Endangered, wild wolves were reintroduced to Yellowstone National Park. The conservationists want to know if, and by how much, the population of wolves is growing.

Worked Solution
Create a strategy

A sample survey would not make sense in this context. A observational study does not interfere with the lifestyle of the wolves. An experiment would use certain factors to attempt to change the population of the wolves with the purpose of determining if these factors had an effect on the population.

Apply the idea

Because the conservationists do not want to control the population or try to make it grow by using a certain tactic, an observational study is the best type of design for collecting data.

The conservationists would still need to find a way to track the population, like tagging the wolves or placing trackers on them, but this tactic does not try to increase the population. It is a method of observing how the population grows naturally.

Example 3

A researcher captures 400 fish in a lake, tags them, then releases them. The following day, he captures 1200 fish, of which 100 have tags attached to them.

a

Describe a statistic that could be calculated from the given information.

Worked Solution
Create a strategy

A statistic summarizes data from a sample. The population in this problem is the total amount of fish in the lake, so 400 and 1200 are samples of the population.

Apply the idea

The second sentence tells us 100 of the 1200 fish from the sample had tags. This is a statistic, and we can rewrite it as a percentage.

\displaystyle \dfrac{100}{1200}\displaystyle =\displaystyle \dfrac{1}{12}
\displaystyle =\displaystyle 0.08\overline{3}
\displaystyle =\displaystyle 8\frac{1}{3}\%

A statistic is 8\frac{1}{3}\% of the fish that were caught from the lake on the second day were tagged.

b

What inference can be made about the population based on the statistic from part (a)?

Worked Solution
Create a strategy

The proportion of tagged fish in the second sample should be equal to the proportion of 400 fish out of the total population. The fraction of tagged fish in the second sample is \dfrac{1}{12}, and it is known that there are 400 tagged fish in the lake.

Apply the idea
\displaystyle \dfrac{\text{Tagged fish from sample }1}{\text{Total population}}\displaystyle =\displaystyle \dfrac{\text{Tagged fish from sample }2}{\text{Total fish from sample }2}
\displaystyle \dfrac{400}{\text{Total population}}\displaystyle =\displaystyle \dfrac{1}{12}
\displaystyle 4800\displaystyle =\displaystyle \text{Total population}

Therefore, there is estimated to be 4800 fish in the lake.

Reflect and check

To check our answer, we can multiply the total population by the statistic from part (a) to see if it gives us the 400 fish originally tagged.

\displaystyle 4800\cdot 8\frac{1}{3}\%\displaystyle =\displaystyle 4800\cdot \frac{\frac{25}{3}}{100}Rewrite the percentage as a fraction
\displaystyle =\displaystyle 48\cdot \frac{25}{3}Divide 4800 by 100
\displaystyle =\displaystyle 16\cdot 25Divide 48 by 3
\displaystyle =\displaystyle 400Evaluate the multiplication
Idea summary

A parameter is a number that summarizes data from a population. A statistic is a number that summarizes data from a sample. A survey conducted on every member of a population is called a census.

If the sample was representative of the population, the statistic may be used to understand the population. When a statistic is applied to a population, we are making an inference.

There are three methods of statistical design:

  1. Sample survey - used to gather information from a group of individuals

  2. Observational study - used to determine a correlation between two variables without outside intervention

  3. Experiment - used to determine cause and effect relationships

Sampling methods and bias

To avoid bias when gathering sample data, it is important that the method in which the data is collected is random.

Random sample

A sample composed of selecting from the population where every object has an equal chance of being selected

Randomness is one way to ensure that the sample is representative of the population. A few different methods of creating a sample are described below.

Sampling methodDescription
Systematic sampleObjects are chosen based on a consistent rule
Stratified sampleObjects are separated into groups based on a characteristic. Objects are then randomly selected from each group
Cluster sampleObjects in the population are randomly separated into groups. One or some of the groups are randomly selected, then objects within the selected groups are randomly chosen.

Exploration

A city mayor needs to decide which intersections in the city need stop signs, which ones need stoplights, and which ones should be converted to roundabouts. She decides to give a survey to determine how people feel about the current road conditions. She intends to prioritize fixing the intersections that cause the most frustration for drivers.

  1. Decide whether or not the results from the following surveys are an accurate representation of the population.

  2. If they are not an accurate representation, explain why.

  • The mayor asks the principal of the local high school to give the survey to 100 random students who drive themselves to school.

  • The mayor gives the survey to everyone in her neighborhood.

  • The mayor sets up a booth at a festival and asks anyone who drives and is interested to fill out a survey.

  • The mayor randomly selects 100 members from the voter registration and conducts phone surveys.

  • The mayor asks the staff members at the DMV (Department of motor vehicles) to ask everyone who comes in that day if they'd be willing to participate in the survey.

If a sample is not representative of the entire population, we cannot use the survey to draw conclusions or make inferences about the population. Instead, we say that the survey has bias. There are a number of potential sources of bias that we should avoid:

  1. Poor sampling techniques

    • If the people being surveyed do not resemble the population, the survey is likely to be biased.

    • Convenience samples, where samples are chosen because they are easily available, introduce bias. These groups are likely to have particular traits in common that are not representative of the population as a whole.

    • Self-selected samples, where people volunteer their input, introduces bias. People who choose to self-select often have strong opinions that might not be representative of the population as a whole.

  2. Too small of a sample

    • In general, the bigger the number of people being surveyed, the closer the results will be to a census. This is known as the Law of large numbers.

  3. Poor question wording

    • If the question asked does not answer the purpose of the study, it cannot be used to interpret the variable of interest.

  4. Using loaded or leading questions

    • Avoid questions which use words that suggest preference, invoke emotion, or might otherwise influence the results of the survey.

Examples

Example 4

The nutrition team at a school wants to know to what extent students are making healthy lunch choices. The school cafeteria offers a salad bar, a hot lunch option and also has vending machines available. For each of the following proposed sampling method designs, describe the sampling method and decide if it is biased.

a

Observe the lunch choices of 3 randomly selected students from each of 15 randomly selected lunch tables in the cafeteria.

Worked Solution
Create a strategy

The students in the cafeteria have naturally grouped themselves by sitting at different tables.

Apply the idea

This is a cluster sample because students were grouped at tables, then 3 students were randomly selected from each of the groups or tables.

However, this sample is biased because the way the students grouped themselves is not random. Students will likely sit by friends, and there may be many students at some tables and fewer students at other tables. Self-selection bias was introduced into the sampling method.

Reflect and check

The main differences between cluster samples and stratified samples are

  • Stratified samples are grouped by a common characteristic (gender, age, location, etc.) but cluster groups do not necessarily have common characteristics

  • Objects are chosen from each group in a stratified sample, but objects are chosen from only one or a few groups in a cluster sample

b

Survey 20 random students in the hallway between periods after lunch.

Worked Solution
Apply the idea

This is a simple random sample because any of the students in the hallway would have an equal chance of being surveyed. This sample is biased because 20 students is a small number and likely not representative of the entire population of students in the school.

We might also question whether or not sampled students would be entirely truthful about their lunch choices. They might not want to admit they bought chips and soda for lunch from the vending machine if they thought someone would judge their choices. A survey may not be the most accurate design methodology to answer this research question.

Reflect and check

If the school is very small, then 20 students may be considered representative of the population. However, it may be worth taking a census instead of a sample if the school is that small.

c

Observe the lunch choices of 50 randomly selected students from each grade level.

Worked Solution
Apply the idea

Students are grouped by grade level, then students are randomly chosen from each grade. This is a stratified sample, and it is not biased because there is a large number of students chosen overall and the method of choosing them was random.

Reflect and check

Although this is a non-biased sampling method, it may not be the most practical. It may depend on who is doing the observing. It's likely the observer won't know the names and faces of the randomly selected students. Students are also unlikely to cooperate with wearing some kind of identification that shows they were randomly selected for the experiment. Also, knowing their lunch choices are being observed could make students choose differently than they would have otherwise.

d

Observe the lunch choices of every 5th student that enters the cafeteria on a particular day.

Worked Solution
Apply the idea

This is a systematic sample because the students were chosen by a rule. It is not biased because the rule ensures random selection, and it was applied to all students in the cafeteria on a given day. It is also a method that would be practical to implement, and can be subtle, so students' lunch choices will not be influenced.

Reflect and check

Systematic samples can be biased if they are applied to a subset of the population that is not representative of the entire population. For example, selecting every 5th student on the track team would not be representative of the rest of the students in the school.

Example 5

Students in a certain state must take 4 years of math in high school. The school is deciding whether or not to add a statistics course as an additional option in their Senior year.

Mario surveyed 20 students from his junior year advanced pre-calculus class to find out whether juniors at his school think the school should offer a statistics class. 30\% of students said yes, and 70\% of students said no. The school has 750 juniors.

a

State if 70\% is a statistic or parameter. Explain how you know.

Worked Solution
Create a strategy

We know that a statistic is a number that summarizes data from a sample, which may be applied to the population if the sampling was unbiased.

We also know that a parameter is a number that summarizes data from a population.

Therefore, we need to determine whether 70\% summarizes data from a sample or a population.

Apply the idea

Mario has interviewed 20 juniors from his math class in order to determine what students from the junior class think. Because there are more than 20 students in the junior class, we know that 20 students is a sample. This means that 70\% summarizes data from a sample.

Therefore, 70\% is a statistic.

b

State the sampling method that Mario used to gather data. Explain your reasoning.

Worked Solution
Create a strategy

From part (a), we know that Mario has interviewed a sample of students from his school, as opposed to the entire school population. He used a sample survey to collect the data.

Apply the idea

Mario only chose students who were in his math class. This is a convenience sample because it is easy for him to collect data from those students. He probably knows those students, they are close to him in proximity, and he does not have to go to much effort to survey them.

c

Write an invalid conclusion based on Mario's survey results. Explain why the claim is invalid.

Worked Solution
Create a strategy

We can create an invalid conclusion by claiming something is a fact when it is not necessarily true. One way to do this is by making an inference about the population when there is a source of potential bias in the data.

Apply the idea

An example of an invalid claim could be:

30\% of all juniors at Mario's school want to add statistics as a 4th math course.

This claim is invalid because Mario has interviewed students in a precalculus class. It is possible that these students may prefer taking calculus their senior year, compared to other students at the school. So, they would be more inclined to say no to a statistics course. This could be a potential source of bias.

We know that Mario sampled 20 students, but the junior population is 750. If we compare this to the overall number of students, this sample might be too small to represent the population. This is a potential flaw or source of bias. The Law of Large Numbers states that the larger the sample, the more closely the sample statistic will represent the true population parameter.

Reflect and check

There are many invalid conclusions that we could make based on Mario's survey results. It's important to always consider the conclusion and look to the data for confirmation.

d

Describe a plan that Mario can use to decide if there is enough interest in a senior-level statistics course next year. The plan should include the statistical question being asked, the design of the study, the target population, the sample size, and the sampling method.

Worked Solution
Create a strategy

There are multiple designs that would serve the purpose of the study and gain the desired information. The importance is that we defend our choices to show they accurately represent the population.

Apply the idea

The statistical question being asked is "How many juniors are interested in taking a senior-level statistics course as the 4th course math option for the next school year?"

The type of design that should be used is a survey because the school wants to know the students' opinions.

The target population is the junior class because they would be the ones taking the course next year. A sample of 150 students would be a good sample size. This is \dfrac{150}{750}=\dfrac{1}{5} of the juniors which is a good proportion.

One way to create a random sample is to get a list of names of all the juniors. Then, Mario can separate the list of names into 25 groups of 30. Next, he can assign each group a number and use a random number generator to choose 10 groups from the 25 total groups.

Finally, he can assign each person in each group a number and use the random number generator again to choose 15 students to survey from each group. Although this method will take time and effort, it ensures that every student in the sample was chosen completely by chance.

Idea summary

When choosing a sample for a survey, observational study, or experiment, it is important that the sample was chosen randomly so that it is representative of the population. The following are various random sampling methods:

  • Simple random sample

  • Systematic sample

  • Stratified sample

  • Cluster sample

The following are potential sources of bias and should be avoided when conducting a survey, observational study, or experiment:

  • Poor sampling methods

  • Too small of a sample

  • Poor question wording

  • Using loaded or leading questions

Outcomes

S.IC.A.1

Understand statistics as a process for making inferences about population parameters based on a random sample from that population.

S.IC.B.3

Recognize the purposes of and differences among sample surveys, experiments, and observational studies; explain how randomization relates to each.

What is Mathspace

About Mathspace