topic badge

9.01 Data and sampling

Formulate questions for bivariate data

Data is a crucial aspect of our day to day lives. It helps us to understand the world around us and make informed decisions. The process of working with data involves a series of steps commonly referred to as the data cycle. This cycle includes formulating questions, collecting or acquiring data, organizing and representing data, analyzing data, and communicating results.

A data cycle with four stages. At the top, there is Formulate questions represented by a speech bubble with a question mark. To the right, Collect or acquire data is shown with an icon of a person and a magnifying glass. At the bottom, Organize and represent data is illustrated with a dot plot. To the left, Analyze and communicate results is indicated by a person with charts. Clockwise arrows are drawn from one stage to the next.

To help us formulate appropriate questions, it is helpful to know what type of data we want to explore.

Univariate data

Information gathered around a single characteristic. This data can be numerical or categorical.

Displays include: bar graphs, line plots/dot plots, stem-and-leaf plots, circle graphs, histograms, and boxplots.

Example:

Grades in a course, time spent outside, preferred type of fruit

Histogram titled 'Amount of sleep per night'. The horizontal axis shows the hours of sleep from 7 to 15, and the vertical axis shows the frequency from 0 to 10. The following are the bins and the corresponding frequencies: 7-8, 2; 8-9, 7; 9-10,9; 10-11, 6; 11-12, 4; 12-13, 3; 13-14, 2; and 14-15, 1.

This histogram displays numerical data of the length of time people slept per night.

Notice that there is only one characteristic (or attribute), time, that is being explored. The axes are the attribute and the frequencies. We can only compare the amount of sleep in different groups/bins within the data set.

Variables

Variables are quantities or qualities that can be measured or classified.

Independent variable

The variable that is varied or controlled to explore the effect it has on the dependent variable.

Dependent variable

The variable that depends on the independent variable.

We typically want to explore the effect that the independent variable has on the dependent variable.

Bivariate data

Bivariate data is data that is collected from two different variables and compared against each other. This data is typically numerical.

Example:

Age versus height, or 1-mile time versus 5-mile time

\text{Person}\text{Age (years)}\text{Systolic } \\ \text{blood pressure}\\ (\text{mmHg})
\text{Art}30121
\text{Kumi}40140
\text{Isla}50134
\text{Daria}60154
\text{Xia}70146

For a study about heart health, a person’s systolic blood pressure is measured against their age.

Their age is the independent variable and can be any value. Their systolic blood pressure is the dependent variable that is recorded against their age.

Notice that each person’s age and systolic blood pressure make a pair of values in the bivariate data set.

To start working with bivariate data, we need to formulate a statistical question.

Statistical question

A statistical question that can be answered by collecting data and whose answer may vary depending on the sample the data is collected from.

Also called an investigative question.

Statistical questionNot statistical question
Is there a relationship between age and systolic blood pressure?What is your blood pressure?
Do test scores increase as the amount of time studying increases?How long did you study and what was your grade?
Does how long a pen lasts impact the cost of the pen?Are there pens under five dollars that will last all year?

A statistical question is different from a survey question that is asked to those in the people in a study.

We need to make sure that questions are not leading people to answer a particular way. This means not using emotive language or suggesting a particular answer.

Good survey questionLeading question
Do you watch soccer?Do you watch the most popular sport in the world, soccer?
How would you rate your meal?What did you think of the meal from the outstanding chef?
What was your average speed driving here?Did you do the wrong thing and go over the speed limit to get here?

Notice how the good questions are very neutral and the leading questions may encourage people to respond in a particular way.

Examples

Example 1

Select the question(s) which could be answered by collecting bivariate data. Select all that apply.

A
How much does it typically cost to own a horse?
B
Does the amount of fertilizer used impact the number of tomatoes a plant produces?
C
What is the relationship between reaction time and age?
D
What interest rate can I expect on a car loan?
Worked Solution
Create a strategy

For bivariate data each member of the sample or population would have two characteristics recorded.

Apply the idea

Let's go through each of the options:

Option A: The characteristic that would be collected for each horse would be the cost of owning. This is a single characteristic and would be different for different horses. This could be answered by collecting univariate data, not bivariate.

Option B: For each tomato plant we would record how much fertilizer was used and the number of tomatoes produced. These are two numerical characteristics, so this could be answered with bivariate data.

Option C: The characteristics being collected from each member of the sample would be reaction time and age. There are two characteristics being compared, so this is bivariate data.

Option D: There is no characteristic being collected here. This is a fact with a single answer, so would not have bivariate data collected to answer it.

The correct answers are B and C.

Reflect and check

For option B, we can't collect

Example 2

The local ice cream shop has noticed that their sales seem to be related to the temperature outside. They want to investigate this relationship more closely.

Write a statistical question related to this scenario.

Worked Solution
Create a strategy

When creating a statistical question, it's important that data can be collected to answer the question. For bivariate data, the question should focus on the relationship between the two variables involved, which in this case are temperature and ice cream sales.

Apply the idea

Some possible statistical questions might be:

  1. Does an increase in temperature lead to an increase in ice cream sales?

  2. Is there a relationship between temperature and the number of ice creams sold?

  3. Is there a specific temperature range that results in the most ice cream sales?

Each of these questions is clear and concise, focusing specifically on the relationship between temperature and ice cream sales. They each propose a different aspect of the relationship to investigate, making them effective statistical questions.

Reflect and check

After one round of the data cycle we may formulate a new question to further explore the topic.

Example 3

In a study conducted at a high school, students' study times (in hours) and their corresponding test scores (out of 100) were recorded for one particular examination. The data is to be analyzed to understand the relationship between study time and test scores.

a

Identify the variables involved in this scenario.

Worked Solution
Create a strategy

A variable is any characteristic, number, or quantity that can be measured or counted.

There are two types of variables: dependent and independent. The dependent variable is what is being measured or observed (the outcome), while the independent variable is what is being manipulated or changed (the likely cause).

Apply the idea

In this case, the two variables are study time and test scores.

The independent variable is the study time. This is because it is the variable that we think will cause changes in the test scores.

The dependent variable is the test score. This is because it may change in response to changes in the study time. Test scores are what we are interested in predicting or explaining.

Reflect and check

It's important to consider potential confounding variables in any study. A confounding variable is an outside influence that may impact one or both of the variables. These may lead to a false conclusion. For example, the difficulty of the test, the student's previous knowledge, and other external factors (like health, sleep, etc.) could all potentially impact a student's test score.

b

Rewrite this question so it is not leading and it could be used to accurately collect data.

"We believe students who study more do better on tests. How much time did you spend studying last night? Did you do well on the test?"

Worked Solution
Create a strategy

We need to make sure that the question is not leading people to answer a particular way. This means not using emotive language or suggesting a particular answer.

Apply the idea

"How much time did you spend studying last night? Did you do well on the test?"

Idea summary
  • Data Cycle: A process for working with data, which includes formulating questions, collecting data, organizing and representing data, analyzing data, and communicating results.

  • Bivariate Data: A type of data that involves two variables, allowing us to explore potential relationships between them.

  • Statistical question: A clear, concise question that can be answered by collecting and analyzing data.

  • Variables: Quantities or qualities that can be measured or classified, which help explain a given situation or answer statistical questions.

Examples of bivariate data analysis include studying the relationship between a person's study time and their test scores, or a city's population density and its air quality.

Collect data using samples

Once we have formulated our statistical question and identified our variables, we are now ready to collect the necessary data.

There are several main methods for data collection:

  • Observation: Watching and noting things as they happen

  • Measurement: Using tools to find out how much, how long, or how heavy something is

  • Survey: Asking people questions to get information

  • Experiment: Doing tests in a controlled way to get data

  • Acquire existing data: Using a secondary source, usually an online database, to get raw or summarized data.

This is where sampling may come into play. Sampling methods are techniques used to select a representative subset of the population, known as a sample. A population refers to every member of a group. A sample is a subset (or a smaller group) of the population. The type of sampling method we choose can greatly influence the quality of our data and, therefore, the conclusions we draw from it.

Collecting data from every member of a population is the most accurate way of gathering information, but it is not always the most practical and can be very expensive or time consuming. Typically, a sample survey is instead done on a sample from the population to make it quicker and less expensive.

There are several types of sampling methods, each with its own strengths and weaknesses. The goal is to have a sample that is representative, so has the same characteristics as the population.

A concept of sampling from a population. On the left is a large circle labeled Population containing many diverse cartoon faces representing individuals. On the right is a smaller circle labeled Sample with a subset of the faces from the population, connected by an arrow indicating selection from the larger group to the smaller.
Simple Random Sampling

A sampling method where every member of the population has an equal chance of being selected.

Image of a hat with names of people to be drawn.

In this method, a sample is formed by selecting members from the population at random, where each member of the population has an equally likely chance of being selected. In simple cases, a sample could be created by drawing names from a hat. For most samples though, it is more common to use a random number generator.

While simple random sampling can be one of the cheapest and least time-consuming methods, it may not be representative every sub-group within larger populations.

If a sample is not representative, we may say it is biased.

One particular type of biased or non-random sample is a convenience sample.

Convenience sample

A sample of people who are convenient, easy to ask, or close by. For example, your class or people in your building.

They typically aren't representative of the population.

There are other sampling techniques that can lead to more representative samples. These include:

  • Systematic Sampling: A sampling technique where a starting point is chosen at random, and then items are chosen at regular intervals. This method is often used by manufacturers for sampling products on a production line. For example, we may call every tenth business in the phone book or select every fifth bottle from a production line.

    Image of 9 people where the third, sixth, and ninth persons are selected as samples.
  • Stratified Sampling: A sampling method that involves dividing the population into subgroups, or strata, and then selecting a separate random sample from each stratum. If one subgroup is larger than another, then we should proportionally select more people from that strata. For example, dividing a group into children, adults, and seniors and then selecting a proportional number of people from each group.

    Image of 9 people divided by 3 to form subgroups.
  • Cluster sampling: A sampling method where the population is divided into groups, or clusters. Then, a random sample of clusters is selected, and all members within selected clusters are included in the sample. It is like taking a sample of small samples.

    An image showing population formed by clusters, and a sample group formed by getting two clusters from the population.

It's important to select an appropriate method to collect a representative sample. This will help to accurately analyze the relationship between our two variables. For example, if we wanted to investigate the relationship between exercise frequency and overall health across different age groups, we might choose a stratified sample to ensure we collect data from all age groups.

Examples

Example 4

Determine whether each situation demonstrates a sample survey, an experiment, or an observational study.

a

A hospital wants to compare the recovery rates of patients using a new dosage versus those using a standard treatment.

Worked Solution
Create a strategy

A sample survey would not make sense in this situation. An observational study determines correlation, but it cannot determine if the dosage was the cause of specific growth differences. An experiment can determine cause and effect relationships.

Apply the idea

Since the hospital wants to know if different dosages were the cause of the recovery rates, the hospital should design an experiment.

Reflect and check

Although an observational study could have identified a correlation between dosage and recovery, other factors (like the family support or prior health levels) may have also caused a difference in recovery.

If they use an experiment, the hospital can control the other factors. This will help them determine if the dosage was truly the cause of the recovery rate.

b

A new book has been released. A library wants to know if their patrons prefer e-books or physical books to determine how many of each to get.

Worked Solution
Create a strategy

To determine which type of design is best for this situation, we need to determine how the data can be collected.

Apply the idea

Because the library wants to know their patrons opinions, they will need to ask them about their preference on book type. A sample survey is the best design for gathering this information.

Reflect and check

If the library got an equal number of each and wanted to know which was getting more use, an observational study would be a suitable design.

c

A gym wants to know if there is a relationship between the number of visits someone had in the their first month and the length of time they will be at the gym.

Worked Solution
Create a strategy

We can first think about if the data already exists or if the gym would need to create it.

Apply the idea

Since we are looking for data on something that has already happened, acquiring existing data would be the best choice.

Reflect and check

It is unlikely that if people were surveyed they would remember how many times they went in the first month, so a survey would not be a good choice.

Example 5

A social worker notices that many of the children he helps say they want to go to the playground during their sessions. He is curious if there is a relationship between age of children and time spent at playgrounds each week in his town.

a

Identify the target population.

Worked Solution
Apply the idea

The target population would be children in the social worker's town.

Reflect and check

He could look at all children, but playground usage would likely vary significantly based on population density, country, and number of playgrounds.

b

Which method would be best to collect the relevant data?

A
Observation
B
Measurement
C
Survey
D
Acquire secondary data
Worked Solution
Create a strategy

We want to know somewhat personal information like age and habits.

Apply the idea

Let's look at each option:

Option A: Can we watch the children and determine their age and how much time they spend at the playground each week? No, it would be difficult to just watch and get this information.

Option B: Can we measure the children and determine their age and how much time they spend at the playground each week? No, their age and habits are not able to be measured using measurement tools.

Option C: Can we ask the children (or their guardian) their age and how much time they spend at the playground each week? Yes, by asking them a specific question about their age and playground habits we can get the data we need.

Option D: Would there be existing data to answer this question? No, it is unlikely that this question has already been asked to a sample which represents the current population.

The answer is C: Survey.

c

Explain why doing a sample of the children at a park one Monday morning would not be a good sample.

Worked Solution
Create a strategy

For the sample to be representative of the population, different children of varying ages with various playground habits should be included.

Apply the idea

This sample would be convenient, so is a convenience sample which is not representative.

This sample likely wouldn't include school age children who would be at school, not at the playground on a Monday morning. It also wouldn't include children who don't regularly get to go to the playground.

Finally, there may not be many children at the park, but the population of children might be large in comparison.

Example 6

Dr. Jane is a health researcher and she formulated the question "Is there a relationship between the frequency of exercise and overall health among working adults in Washington, DC?" She wants to collect data for her research.

Choose an appropriate sampling method.

Worked Solution
Create a strategy

When selecting a sampling method, Dr. Jane needs to consider several factors such as the size of her target population, the resources she has available, and potential biases that could influence the results.

Apply the idea

An appropriate sampling method for this study could be stratified sampling.

Considering Washington, DC's large and diverse population, stratified sampling would ensure that all segments of the population are represented in the sample. Dr. Jane could divide the population into different strata based on factors like age, occupation, or zipcode, and then randomly select participants from each stratum.

Reflect and check

While stratified sampling can provide a representative sample, it can be more complex and time-consuming to implement, and it might not be feasible if information about the different strata is not readily available.

An alternative method might be simple random sampling, where every individual in the population has an equal chance of being selected. However, this method might not guarantee that all segments of the population are adequately represented, especially for a diverse population.

Idea summary

After we formulate a clear statistical question, we use the data cycle to collect, show, and explain information. To get data, we can use methods like:

  • Watching (Observation)

  • Measuring

  • Asking questions (Survey)

  • Doing experiments

  • Acquiring existing secondary data

Sampling methods are techniques to collect data from a representative subset of the population, known as a sample.

  • Population: every member of a group.

  • Sample: a subset of the population.

Types of sampling methods include:

  • Simple Random Sampling: every member of the population has an equal chance of being selected.

  • Systematic Sampling: involves selecting every nth member of the population.

  • Stratified Sampling: dividing the population into subgroups, and then selecting a separate random sample from each subgroup.

  • Cluster Sampling: the population is divided into groups, or clusters. Then, a random sample of clusters is selected, and all members within selected clusters are included in the sample.

The type of sampling method chosen can greatly influence the quality of data collected and the conclusions drawn from it.

Outcomes

A.ST.1

The student will apply the data cycle (formulate questions; collect or acquire data; organize and represent data; and analyze data and communicate results) with a focus on representing bivariate data in scatterplots and determining the curve of best fit using linear and quadratic functions.

A.ST.1a

Formulate investigative questions that require the collection or acquisition of bivariate data.

A.ST.1b

Determine what variables could be used to explain a given contextual problem or situation or answer investigative questions.

A.ST.1c

Determine an appropriate method to collect a representative sample, which could include a simple random sample, to answer an investigative question.

What is Mathspace

About Mathspace