topic badge

4.01 Data collection and sampling

Formulate questions for univariate data

The statistical investigation process is a process that begins with the need to solve a real-world problem and aims to reflect the way statisticians work. The data cycle gives us a nice structure to follow:

A data cycle with four stages. At the top, there is Formulate questions represented by a speech bubble with a question mark. To the right, Collect or acquire data is shown with an icon of a person and a magnifying glass. At the bottom, Organize and represent data is illustrated with a dot plot. To the left, Analyze and communicate results is indicated by a person with charts. Clockwise arrows are drawn from one stage to the next.

To help us formulate or write our question, we can think about whether we will get univariate data or bivariate data.

Univariate data

Information gathered around a single characteristic. This data can be numerical or categorical.

Displays include: pictographs, bar graphs, line graphs, line plots/dot plots, stem-and-leaf plots, circle graphs, and histograms.

Example:

Scores on assessments, time spent looking at social media, hours spent on an activity

A histogram on the height of students. Ask your teacher for more information.

This histogram displays numerical data of the heights of students in a class.

Notice that there is only one characteristic (or attribute), height, that is being explored. The axes are the attribute and the frequencies. We can only compare the heights of different groups/bins within the data set.

Bivariate data

Data represented by two variables. This data is typically numerical.

It needs to be collected in a table, so that we are comparing the two attributes for the same member of the population.

Example:

Time versus distance, age versus height

\text{Person}\text{Age (years)}\text{Bone density }\\ (g \text{/} cm^{3})
\text{Alia}301.35
\text{Boyd}401.28
\text{Cato}501.22
\text{Daria}601.10
\text{Eve}700.97

For a study about bone health, a person’s bone density is measured against their age.

Notice that each person’s age and bone density make a pair of values in the bivariate data set.

A clear question helps us know what kind of data to gather and who to collect it from. The type of question we ask can lead us to collect different data.

When formulating a question, it should be a statistical question. It must have all the following features:

  • Can be answered by collecting data

  • Anticipate some degree of variability - more than one possible answer

  • Provide data that can be represented in a visual display - the type of display will depend on the type of data and the question

Statistical questionNot statistical question
How much money do professional female athletes make?How much money does NCAA star Caitlin Clark make in Name-Image-Likeness deals?
How long do people keep leftovers in their fridge before eating them?Does mayonnaise need to be kept in the fridge after opening?
How much time is spent on social media? Do you use social media?
What are the five most popular video games among 8th graders?Does your best friend play Minecraft?

Examples

Example 1

Select the question(s) which could be answered by collecting univariate data. Select all that apply.

A
What is the median house price in Virginia?
B
Do I need to file taxes every year as a part-time employee?
C
What is the relationship between reaction time and hours of sleep the previous night?
D
What is the distribution of salaries of professional rugby players in North America?
Worked Solution
Create a strategy

For univariate data each member of the sample or population would have one characteristic recorded.

Apply the idea

Let's go through each of the options:

Option A: The characteristic that would be collected for each house in the sample would be its selling price or value. This is a single characteristic and would be different for different houses. This could be answered by collecting univariate data.

Option B: There is no characteristic being collected here. This is a fact with a single answers, so would not have univariate data collected to answer it.

Option C: The characteristics being collected from each member of the sample would be reaction time and amount of sleep. There are two characteristics being compared, so this is bivariate data, not univariate.

Option D: The characteristic that would be collected for each North American professional rugby player in the sample would be their salary. This is a single characteristic and would be different for different players. This could be answered by collecting univariate data.

The correct answers are A and D.

Reflect and check

Although Option A could be answered with univariate data, it is not a statistical question as there would be a single correct answer at a given point in time.

Example 2

Determine whether or not each question is a statistical question. Explain why or why not.

a

How many books did my teacher read last year?

Worked Solution
Create a strategy

A statistical question is one that does not have a single possible answer, requires data collection, and be summarized with a data display.

Apply the idea

This question is not a statistical question because it has only one possible answer and there is not enough data to create a display.

Reflect and check

This could be reworded to "For students in my class, how did the number of books they read last year vary?"

b

How many steps do most students in our school walk each day?

Worked Solution
Create a strategy

This question has a clear population of "students in our school". We now need to consider if the data collected will have more than one possible answer and can be displayed.

Apply the idea

This is a statistical question because:

  • We would need to collect data to answer this question.

  • A variety of answers are possible- likely ranging from 2000 to 20\,000 depending on level of activity.

  • A histogram could be used to display the results.

Reflect and check

The data would be univariate data.

c

What is the range of money spent on snacks per week by those who pack a lunch?

Worked Solution
Create a strategy

We need to look at if the data collected will have more than one possible answer and can be analyzed.

Apply the idea

This is a statistical question because:

  • We would need to collect data to answer this question.

  • A variety of answers are possible.

  • A histogram could be used to display the results.

Reflect and check

The population would need to be made more specific before collecting data.

Idea summary

Univariate data is data where only one attribute or characteristic is collected.

Before we can collect data, we need a clear statistical question that:

  • Can be answered by collecting data

  • Has some amount of variation - more than one possible answer

  • Results in data that can be shown in a visual display - the type of display will depend on the type of data and the question

Collect data without bias

To answer a statistical question, we first collect the necessary data. There are several main methods for data collection:

  • Observation: Watching and noting things as they happen

  • Measurement: Using tools to find out how much, how long, or how heavy something is

  • Survey: Asking people questions to get information

  • Experiment: Doing tests in a controlled way to get data

  • Acquire existing data: Using a secondary source, usually an online database, to get raw or summarized data.

As we have seen, collecting data from every member of the population can be very expensive and take a lot of time. In the United States, every ten years, data is collected from the whole population. This is called the census. We may use census data a secondary source: https://data.census.gov/.

To save time and money, we can collect data from a subset of the population, called a sample. However, we need to be sure that our sample is representative of the population.

A group of seven individuals labeled as 'Population' and a group of 3 individuals pulled out from the previous group labeled as 'sample'.
Statistical bias

Any aspect of the data cycle process that leads to a difference between the conclusion and the actual truth for the population.

Sampling bias

A situation that occurs when some members of a population are more likely to be chosen over other members of the same population for a specific reason.

When there is sampling bias, the sample is not random.

Convenience sample

A sample where members of the population are chosen because they are close by or easy to survey.

When there is bias in the data cycle, we may get misleading or inaccurate conclusions. Here are some specific types of bias:

Undercoverage bias

Occurs when the sample is selected in a way that causes certain subgroup of the population to be under-represented.

Example:

A researcher has a sample that is 25\% female even though the population is about 50\% female.

Exclusion bias

Occurs when data collectors do not allow certain members of the population to be part of the sample.

Example:

A researcher does not include any homeschooled children in a study about education.

Self-selection bias

Occurs when the sample is made up of people who have self-selected, or volunteered.

This is sometimes called volunteer response bias.

Example:

A study is only made up of people who are very passionate about the subject because they self-selected.

There are a number of ways we can avoid bias in our sample, including:

  • Having a sample that is large enough to represent the characteristics of the population. The larger the sample size, the closer the results will be to that of the population.

  • Having a sample that is selected without strategically choosing more people from a certain group.

  • Randomly selecting the sample.

Once our sample is selected, it is possible to introduce even more bias, such as:

Observer bias

For an observational study, when the observer brings a particular perspective that may affect the data.

This is sometimes called experimenter bias.

Example:

A doctor may say a blood test is incorrect based on biased thoughts about a patient.

Measurement bias

A consistent error in measurements that leads to data which is not true.

Example:

A scale that is not well calibrated.

Examples

Example 3

Determine whether each situation demonstrates a sample survey, an experiment, or an observational study.

a

A grocery store wants to know if their customers would use self-checkouts if they were added or if they prefer using the standard checkout lanes that are staffed.

Worked Solution
Create a strategy

To determine which type of design is best for this situation, we need to determine how the data can be collected.

Apply the idea

Because the grocery store wants to know their customers opinions, they will need to ask their customers about their preference on checkout style. A sample survey is the best design for gathering this information.

Reflect and check

If the grocery store installed self-checkouts and wanted to know which types of checkouts were used more frequently, an observational study would be a suitable design.

b

A group of students wants to know how different levels of fertilizer affect plant growth.

Worked Solution
Create a strategy

A sample survey would not make sense in this situation. An observational study determines correlation, but it cannot determine if the fertilizer was the cause of specific growth differences. An experiment can determine cause and effect relationships.

Apply the idea

Since the students want to know if different levels were the cause of plant growth levels, the students should design an experiment.

Reflect and check

Although an observational study could have identified a correlation between fertilizer and plant growth, it cannot distinguish between fertilizer levels and plant heights. Other factors (like the sun or water levels) may have also caused a difference in plant levels.

If they use an experiment, the students can control the other factors (like sun and water levels) to make sure all plants receive the same amount. This will help them determine if the fertilizer was truly the cause of the plant growth.

c

Endangered, wild wolves were reintroduced to Yellowstone National Park. The conservationists want to know if, and by how much, the population of wolves is growing.

Worked Solution
Create a strategy

A sample survey would not make sense in this context. A observational study does not interfere with the lifestyle of the wolves. An experiment would use certain factors to attempt to change the population of the wolves with the purpose of determining if these factors had an effect on the population.

Apply the idea

Because the conservationists do not want to control the population or try to make it grow by using a certain tactic, an observational study is the best type of design for collecting data.

The conservationists would still need to find a way to track the population, like tagging the wolves or placing trackers on them, but this tactic does not try to increase the population. It is a method of observing how the population grows naturally.

Example 4

A city council wants to determine whether a new skateboard park or a new ice skating rink should be built as the new community building project. The new project will be located in the city park.

a

Identify the target population.

Worked Solution
Apply the idea

The target population will be the members of the community or residents of the city. These will be the people using, building, and paying for the upkeep of the park.

Reflect and check

The city may hope that residents of other cities will be drawn to the park because of the new project, but the target population will be the people most likely to use the skateboard park or ice skating rink.

b

What design methodology would be best to find out how the community feels about the two proposed community building projects?

A
Observation
B
Measurement
C
Survey
D
Acquire secondary data
Worked Solution
Create a strategy

We are looking for people's opinions on a project. We should consider if we could get public opinons from each method.

Apply the idea

Let's look at each option:

Option A: Can we watch the public and determine what they want? No, it would be difficult to just watch people and see if they would be interested in ice skating or not.

Option B: Can we measure the public and determine what they want? No, their interests in skateboarding or ice skating is not able to be measured using measurement tools.

Option C: Can we ask the public and determine what they want? Yes, by asking them a specific question about their preferences for skateboarding or ice skating we can get the data to answer this question.

Option D: Would there be existing data to answer this question? No, it is highly unlikely that this question has already been asked to a sample which represents the current population.

The answer is C: Survey.

c

Explain why using the local ice hockey team as the sample would not give representative data.

Worked Solution
Create a strategy

For the sample to be representative of the population, different types of community members of varying ages with various hobbies should have an opportunity to express their opinion.

Apply the idea

An ice hockey team will most likely prefer a new ice skating rink, but we do not know the preferences of the entire community.

The team members may also be around the same age, and this age does not represent all ages within the community.

Finally, there may not be many team members, but the population of the city might be relatively large in comparison.

Example 5

For each survey question and sample, determine whether the results are likely to be biased or not. Explain your answer.

a

To answer the question "How much time do students at my school spend practicing a musical instrument per week?", Yvonne surveys the people in her jazz quartet.

Worked Solution
Create a strategy

First we can look at whether or not the sample is representative of the population. If it is a good sample, then we should consider if there are any ways the data that is collected could be biased by the data collector.

Apply the idea

The results would likely be biased.

A quartet only has four people, so that sample would be too small to represent the population. Also, this sample would miss out of people who play non-jazz instruments or do not play an instrument. It is also a convenience sample, so would not be random.

b

To answer the question, "What range of speeds do people drive on I95 throughout the day?", Lachlan uses a radar gun to observe and measure the speeds of 100 cars in the right lane between the hours of 8 AM and 9 AM.

Worked Solution
Create a strategy

We need to consider whether the cars whose speed was recorded represent all of the cars that travel on I95.

Apply the idea

The results would likely be biased.

One main cause of bias would be that between 8 AM and 9 AM, it is rush hour traffic, so the speeds at that point in time would be very different than midday or at night.

There is also experimenter bias as he only measures cars in the right lane which are usually slower than cars in the left lane.

Reflect and check

There could also be more experimenter bias if he chose cars that were easier to follow like red cars. If his radar gun was not working well, he could also have measurement bias.

c

To answer the question "How much rain does Middleburg, VA get per month?" Tricia uses historical weather data from a reliable source for the past 20 years.

Worked Solution
Create a strategy

Tricia is using secondary data, not primary data, so bias would come from using an unreliable source.

Apply the idea

The data would likely be unbiased.

She uses a reliable source which would likely have used proper measurement tools.

Reflect and check

When using secondary sources, we can compare the results from two different sources check how accurate the results are.

Idea summary

After we formulate a clear statistical question, we use the data cycle to collect, show, and explain information. To get data, we can use methods like:

  • Watching (Observation)

  • Measuring

  • Asking questions (Survey)

  • Doing experiments

  • Acquiring existing secondary data

If the sample is representative of the population, the data may be used to understand the population. There are a number of potential sources of bias including:

  • A sample that does not resemble the population.

  • A sample that is too small to be representative.

  • A sample that is not randomly selected, such as a convenience sample.

Outcomes

8.PS.2

The student will apply the data cycle (formulate questions; collect or acquire data; organize and represent data; and analyze data and communicate results) with a focus on boxplots.

8.PS.2a

Formulate questions that require the collection or acquisition of data with a focus on boxplots.

8.PS.2b

Determine the data needed to answer a formulated question and collect the data (or acquire existing data) using various methods (e.g., observations, measurement, surveys, experiments).

8.PS.2c

Determine how statistical bias might affect whether the data collected from the sample is representative of the larger population.

What is Mathspace

About Mathspace