The statistical investigation process is a process that begins with the need to solve a real-world problem and aims to reflect the way statisticians work. The data cycle gives us a nice structure to follow:
To help us formulate or write our question, we can think about whether we will get univariate data or bivariate data.
\text{Person} | \text{Age (years)} | \text{Bone density }\\ (g \text{/} cm^{3}) |
---|---|---|
\text{Alia} | 30 | 1.35 |
\text{Boyd} | 40 | 1.28 |
\text{Cato} | 50 | 1.22 |
\text{Daria} | 60 | 1.10 |
\text{Eve} | 70 | 0.97 |
A clear question helps us know what kind of data to gather and who to collect it from. The type of question we ask can lead us to collect different data.
When formulating a question, it should be a statistical question. It must have all the following features:
Can be answered by collecting data
Anticipate some degree of variability - more than one possible answer
Provide data that can be represented in a visual display - the type of display will depend on the type of data and the question
Statistical question | Not statistical question |
---|---|
How much money do professional female athletes make? | How much money does NCAA star Caitlin Clark make in Name-Image-Likeness deals? |
How long do people keep leftovers in their fridge before eating them? | Does mayonnaise need to be kept in the fridge after opening? |
How much time is spent on social media? | Do you use social media? |
What are the five most popular video games among 8th graders? | Does your best friend play Minecraft? |
Select the question(s) which could be answered by collecting univariate data. Select all that apply.
Determine whether or not each question is a statistical question. Explain why or why not.
How many books did my teacher read last year?
How many steps do most students in our school walk each day?
What is the range of money spent on snacks per week by those who pack a lunch?
Univariate data is data where only one attribute or characteristic is collected.
Before we can collect data, we need a clear statistical question that:
Can be answered by collecting data
Has some amount of variation - more than one possible answer
Results in data that can be shown in a visual display - the type of display will depend on the type of data and the question
To answer a statistical question, we first collect the necessary data. There are several main methods for data collection:
Observation: Watching and noting things as they happen
Measurement: Using tools to find out how much, how long, or how heavy something is
Survey: Asking people questions to get information
Experiment: Doing tests in a controlled way to get data
Acquire existing data: Using a secondary source, usually an online database, to get raw or summarized data.
As we have seen, collecting data from every member of the population can be very expensive and take a lot of time. In the United States, every ten years, data is collected from the whole population. This is called the census. We may use census data a secondary source: https://data.census.gov/.
To save time and money, we can collect data from a subset of the population, called a sample. However, we need to be sure that our sample is representative of the population.
When there is bias in the data cycle, we may get misleading or inaccurate conclusions. Here are some specific types of bias:
There are a number of ways we can avoid bias in our sample, including:
Having a sample that is large enough to represent the characteristics of the population. The larger the sample size, the closer the results will be to that of the population.
Having a sample that is selected without strategically choosing more people from a certain group.
Randomly selecting the sample.
Once our sample is selected, it is possible to introduce even more bias, such as:
Determine whether each situation demonstrates a sample survey, an experiment, or an observational study.
A grocery store wants to know if their customers would use self-checkouts if they were added or if they prefer using the standard checkout lanes that are staffed.
A group of students wants to know how different levels of fertilizer affect plant growth.
Endangered, wild wolves were reintroduced to Yellowstone National Park. The conservationists want to know if, and by how much, the population of wolves is growing.
A city council wants to determine whether a new skateboard park or a new ice skating rink should be built as the new community building project. The new project will be located in the city park.
Identify the target population.
What design methodology would be best to find out how the community feels about the two proposed community building projects?
Explain why using the local ice hockey team as the sample would not give representative data.
For each survey question and sample, determine whether the results are likely to be biased or not. Explain your answer.
To answer the question "How much time do students at my school spend practicing a musical instrument per week?", Yvonne surveys the people in her jazz quartet.
To answer the question, "What range of speeds do people drive on I95 throughout the day?", Lachlan uses a radar gun to observe and measure the speeds of 100 cars in the right lane between the hours of 8 AM and 9 AM.
To answer the question "How much rain does Middleburg, VA get per month?" Tricia uses historical weather data from a reliable source for the past 20 years.
After we formulate a clear statistical question, we use the data cycle to collect, show, and explain information. To get data, we can use methods like:
Watching (Observation)
Measuring
Asking questions (Survey)
Doing experiments
Acquiring existing secondary data
If the sample is representative of the population, the data may be used to understand the population. There are a number of potential sources of bias including:
A sample that does not resemble the population.
A sample that is too small to be representative.
A sample that is not randomly selected, such as a convenience sample.