topic badge

5.04 Formulate questions and collect data

Formulate questions

The statistical investigation process, known as the data cycle, is a process where we solve a real-world problem by collecting and analyzing data.

A data cycle with four stages. At the top, there is Formulate questions represented by a speech bubble with a question mark. To the right, Collect or acquire data is shown with an icon of a person and a magnifying glass. At the bottom, Organize and represent data is illustrated with a dot plot. To the left, Analyze and communicate results is indicated by a person with charts. Clockwise arrows are drawn from one stage to the next.

A well formulated question in statistics should be written in a way that it has more than one possible answer. Answering the question should require collecting data (primary data) or finding data that someone else has already collected (secondary data). It should also be clear who the population is.

10 different dog breeds

For example, we could formulate the question:

"How long does it take for a dog to get adopted at the local shelter?"

This is a good question because:

  • We need to collect data to answer it

  • Different dogs take different amounts of time to get adopted

  • The population is clear - dogs at our local shelter

When we write our question, it can be helpful to think about what type of data we will be collecting.

Numerical data

Values or observations that can be measured

It can be displayed in line plots, circle graphs, and stem-and-leaf-plots.

Example:

Number of cousins, height, 100 \operatorname{m} sprint time

Numerical data is sometimes called quantitative data because it is about quantities.

Day to adopt
StemLeaf
01\ 1\ 2\ 2\ 4\ 5\ 5\ 6\ 6\ 6\ 7\ 7\ 7\ 8\ 8\ 8\ 9\ 9
10\ 0\ 1\ 1\ 1\ 3\ 3\ 4\ 6\ 7\ 7\ 7\ 9
21\ 3\ 3\ 7
Key 1\vert 4 = 14 days

This stem-and-leaf-plot shows numerical data. We can see that more dogs are adopted in less than 10 days than between 10 and 19, but it is difficult to see any trends without doing calculations.

Notice that this display shows all of the individual data points.

In this case we are looking at discrete numerical data.

If we are looking for an overall summary of a large data set instead of individual data points, may group the data into bins or classes after collecting the data.

Grouped numerical data

Numerical data that is organized into a range of values or interval

Example:

1 –5, \, 6–10,\, 11–15,\ldots

A circle graph about the number of days it usually takes for different dog breeds to get adopted. Ask your teacher for more information.

The circle graph shows grouped numerical data and we can see trends more clearly like that it is most common for a dog to take 5 – 10 days to get adopted and about \dfrac{3}{4} of dogs are adopted in less than 16 days.

Notice that this display does not preserve the individual data values.

Sometimes a data cycle will create more questions like "How much time do cats spend in the shelter?" We can repeat the data cycle with these new populations and collect or acquire more data to try to answer these questions.

Examples

Example 1

Determine if each question would result in numerical data or not. If it is numerical, explain if the collected data could be grouped or if we need to keep the individual data values.

a

What kind of transportation do students at my school use to get to school?

Worked Solution
Create a strategy

We want to look at if the responses to this question would be be numerical, like a quantity, or not, like a category or word.

Apply the idea

Answer to this question could include things like:

  • Bike

  • Bus

  • Car

  • Walk

These are categories, not quantities or numbers, so this data is not numerical.

Reflect and check

This data could be organized in a pictograph which is used for categorical data, not numerical.

A labeled chart titled Transport Survey displays various modes of transportation and their corresponding usage, represented by icons. Each row, labeled with a mode of transport, contains icons depicting that mode. From top to bottom: three bus icons in the Bus row; four shoe icons in the Walk row; two car icons in the Car row; and one bicycle icon in the Bike row. Below the chart, a note indicates 1 image equals to 9 people.
b

How do the heights of students in your class vary?

Worked Solution
Create a strategy

We want to know if we can measure or do calculations with the answers to this question.

Apply the idea

Height is something that we can measure, so this is numerical data.

If we measure to the nearest quarter inch, it is unlikely that any two students in the class would be the exact same height, so we would need to group this numerical data in order to organize and analyze with it.

Reflect and check

If we tried to make a display, like a dot plot or bar graph, without grouping or rounding to the nearest inch, there would be too many values along the horizontal axis.

c

What is a typical score for a hockey team in a single NHL game?

Worked Solution
Create a strategy

We need to be able to count or measure the score for it to be numerical.

Apply the idea

The score of a hockey game is the number of goals, so it is numerical.

NHL hockey games are generally fairly low scoring with an average of around 3 goals per game and the highest score ever being 14 goals. This data does not need to be grouped as it does not have a very wide spread or possible responses.

Reflect and check

For the first game of the 2023-2024 season, this is what a dot plot would look like:

A dot plot graph represents the number of goals scored in the first game of the 2023/2024 season. The horizontal axis is labeled Number of goals scored in the first game of the 2023/2024 season with values ranging from 0 to 8. The vertical accumulation of dots for each number of goals shows the frequency. There is a varying number of green dots aligned above each number from 0 to 8, indicating the count of occurrences for each goal tally.

Example 2

Is each question well formulated for the data cycle? Explain why or why not.

a

How many years has the Boston Red Sox baseball team been around?

Worked Solution
Create a strategy

A well formulated question should have a variety of possible answers and relate to a specific population.

Apply the idea

There is one answer and no clear population, so this is not a well formulated question for the data cycle.

Reflect and check

A related question we could use the data cycle for is "How many years does the average MLB player play for?"

b

How do the heights of 7th and 8th graders at my school compare?

Worked Solution
Create a strategy

A well formulated question should have a variety of possible answers and relate to a specific population.

Apply the idea

This is a well formulated question as height is a clear attribute with different possible answers.

Reflect and check

This data would be numerical and need to be grouped.

c

What is the distribution of ages at the local martial arts studio?

Worked Solution
Create a strategy

There is a clear population of people at the local martial arts studio. Now we need to check if data could be collected to give a variety of answers.

Apply the idea

Different martial arts athletes will have different ages, so this something we could find data on. This is a well formulated question.

Idea summary

We follow the data cycle to help us formulate questions and use data to answer them.

The questions we ask may lead to data that is numerical data. This data may be left as individual data values or grouped when it is displayed.

Well formulated questions should have more than one possible answer and clearly identify the population we are looking to investigate.

Data collection and sampling

When we have questions, we use different ways to collect data to find answers:

  • Observation: Watching and noting things as they happen

  • Measurement: Using tools to find out how much, how long, or how heavy something is

  • Survey: Asking people questions to get information

  • Experiment: Doing tests in a controlled way to get data. For example, planting two identical plants, giving one sunlight and the other only artificial light, and observing the differences

  • Acquire existing secondary data: Use data which was collected by a reliable source like census data, Common Online Data Analysis Platform (CODAP), or National Oceanic and Atmospheric Administration (NOAA) weather data.

We should choose a method that is realistic and ethical. This means making things possible and kind to all participants.

For the question: 'How long does it take for dogs to get adopted from shelters across the US?'
Realistic and ethicalAcquiring secondary data from a reliable source like the SPCA
Not ethicalStealing one dog, putting it in a shelter and seeing how long it takes to get adopted
Not realisticSurveying every animal shelter in the US

It can be too time consuming to survey the whole population, so we can select a sample, a smaller group from the population. The process of choosing the people or subjects for the sample is called sampling.

The sample should be:

  • representative of the population, by having the same characteristics

  • randomly selected

  • big enough size to give reliable data

A picture showing a population, representative sample and poor sample using dogs.

A larger sample size is usually better because they make a representative sample more likely, simply by including more of the population. However, a larger sample size can be a lot more expensive, time consuming, and difficult to organize. We need to balance a realistic sample size and reliable results.

Random sample

A sample where each member of the population has an equal chance of being selected.

Random samples can be used to ensure that the sample is representative of the population.

A good sample can be used to make reasonable assumptions about the whole population. A sample that is too small or was not selected randomly can lead to an incorrect conclusion.

Exploration

In previous grades, we have used line graphs, line plots, stem-and-leaf plots, and circle graphs.

For a short exploration of the data cycle, let the population be your class and explore a question which uses numerical data.

  1. Formulate a question that you could easily collect data on.

  2. Describe a realistic process for collecting the data. Would the sample be representative of the population? Explain.

  3. Collect the data.

  4. Would it make sense to leave the individual data values, or to group them?

  5. Represent the data visually.

  6. What does this data tell you about your original question?

Examples

Example 3

Hannah has chosen to collect information using a sample.

a

What are the advantages of doing a sample? Select all that apply.

A
It is cheaper to conduct.
B
Any sample will represent the population.
C
It is more accurate.
D
It takes less time.
Worked Solution
Create a strategy

Using a sample requires asking fewer people than if looking at the whole population.

Apply the idea

Since a sample requires surveying fewer people, it would usually be cheaper to conduct and it would take less time.

The correct options are A and D.

Reflect and check

For option B, our goal is to have the sample represent the population, but we have to be very careful to select a good sample that represents the population. Not all samples are good samples.

b

What are the disadvantages of doing a sample? Select all that apply.

A
It takes more time.
B
It is more expensive to conduct.
C
It is less accurate.
D
There can be poor sampling.
Worked Solution
Create a strategy

Use the fact that samples do not survey the whole population being considered.

Apply the idea

Because the whole population is not being surveyed, the results of a sample would be less accurate. Also, the sample may not be random or not large enough which would be poor sampling.

The correct options are C and D.

Example 4

A middle school principal wants to determine whether students would support adding soccer as a new after school sports team. Anyone who attends the school would be able to join the team.

a

Identify the target population.

Worked Solution
Apply the idea

The target population will be students at that school. This will impact students since student would get the opportunity to play on the team, but also because money spent adding a soccer team could not be spent adding other opportunities at the school.

Reflect and check

This decision could impact other people such as parent, coaches, and school staff, but the target population will be the students that have the opportunity to join the team.

b

What method would be best to find out how the students feels about the addition of a new sport?

A
Observation
B
Measurement
C
Survey
D
Acquire secondary data
Worked Solution
Create a strategy

The chosen method should be realistic and ethical to action. It should also give data that is useful for the specific population.

Apply the idea

Since we want to find out students' opinions about the addition of soccer, we should use a survey.

c

Explain why surveying members of the football team about their preference is not representative of the population.

Worked Solution
Create a strategy

For the sample to be representative of the population, different types of students of varying ages with various hobbies should have an opportunity to express their opinion.

Apply the idea

The football team members could be more athletic than other students, or they could not want a new sport to potentially split some of their players.

Finally, there may not be many team members, but the population of the total students at the school might be relatively large in comparison.

Example 5

Donovan wants to explore temperature trends in his hometown over the last 50 years.

a

Formulate a question to help him complete his investigation.

Worked Solution
Create a strategy

Donovan needs to choose a particular attribute to explore, like temperature, wind, or precipitation (rain, snow, hail).

Apply the idea

Donovan could formulate the question:

"How has the number days below freezing in Richmond, Virginia varied over the last 50 years?"

Reflect and check

It might be helpful to know what type of data is collected by weather centers before formulating the question. For example, they usually record amount of precipitation (\operatorname{mm} of rain) per day, but not the length of time it was raining for each day.

b

Could he use observation, measurement, survey, experiment, or acquire secondary sources? Explain.

Worked Solution
Create a strategy

He is looking for data on times in the past, likely before he was born. He cannot go back in time and collect the data himself.

Apply the idea

Donovan should acquire secondary sources. He can't use observation, measurement, survey, or experiments, because the data from those is typically real-time, not historical (in the past).

Reflect and check

For a different question, like, "Do you think winters are the same as 50 years ago, colder, or warmer?" he could use a different method.

c

Explain how Donovan could collect data that could be used to answer his question from part (a).

Worked Solution
Create a strategy

He would need to find a reliable secondary source that has the data he is looking for. He may need to reformat or summarize the raw data.

Apply the idea

He can submit a request through the National Centers for Environmental Information which is run by the NOAA. He will get an email with a spreadsheet that shows the raw data he selected. From there, he can reformat it to find the total for each year.

For example, here is the raw data in tables:

Year1974197519761977197819791980198119821983
Number of days below freezing02512813911138
Year1984198519861987198819891990199119921993
Number of days below freezing985511130115
Year1994199519961997199819992000200120022003
Number of days below freezing851733213138
Year2004200520062007200820092010201120122013
Number of days below freezing8612146202
Year2014201520162017201820192020202120222023
Number of days below freezing71095720230
Reflect and check

We could organize and analyze this numerical data with a line graph to see the trend over time.

A line graph showing the number of days below freezing from 1980 to 2020. Ask your teacher for more information.

The longer the time period we look at, the more work it would be to analyze. However, the results would be more reliable and tell a more complete story.

d

How do the temperature trends in your hometown compare to Donovan's?

Worked Solution
Create a strategy

We can use NOAA's database, or something similar for other countries, to find data for our location or as close as possible. Usually, airports collect excellent weather data, so we may need to use the closest airport.

For hometowns in the US, we can copy and paste this link into our internet browser:\text{https://www.ncei.noaa.gov/weather-climate-links}

Then ask your teacher for help with finding and summarizing the data you need for your hometown.

Apply the idea

Here are some questions you can use to compare Donovan's hometown (Richmond, VA) with yours:

  • In what ways are the temperature trends similar?

  • In what ways are the temperature trends different?

  • What was the coldest year in your hometown compared to Donovan's, which was 1996?

  • Does the climate in your hometown vary more or less compared to Donovan's?

Answers will vary based on the hometown.

Reflect and check

Sometimes we can find nice data displays showing the data we are interested in. Other times, we need to take raw data and summarize it ourselves. Learning to use some formulas and tools in spreadsheets like "SUM" and "Remove duplicates" can be very helpful.

Idea summary

After we formulate a clear question, we use the data cycle to collect, show, and explain information. To get data, we can use methods like:

  • Watching (Observation)

  • Measuring

  • Asking quesions (Survey)

  • Doing experiments

  • Acquiring existing secondary data

It's important to choose the right method based on the question we have.

When we collect data from a sample, we need to make sure that it is representative of the population. We can do this making sure the sample is randomly selected, is big enough, and has the same characteristics as the population.

Outcomes

7.PS.2

The student will apply the data cycle (formulate questions; collect or acquire data; organize and represent data; and analyze data and communicate results) with a focus on histograms.

7.PS.2a

Formulate questions that require the collection or acquisition of data with a focus on histograms.

7.PS.2b

Determine the data needed to answer a formulated question and collect the data (or acquire existing data) using various methods (e.g., observations, measurement, surveys, experiments).

7.PS.2c

Determine how sample size and randomness will ensure that the data collected is a sample that is representative of a larger population.

What is Mathspace

About Mathspace