7. Probability & Statistics

The statistical investigation process is a process that begins with the need to solve a real-world problem and aims to reflect the way statisticians work. The data cycle gives us a nice structure to follow:

Recall that **bivariate data** is data which is represented by two variables. This data is typically numerical and can be organized with a scatterplot.

We need to identify the variables that will be explored in the data cycle. Identifying the **independent variable ** and **dependent variable ** is important for formulating our questions and for accurate data analysis.

Previously, we have only looked at one set of bivariate data, but we can compare bivariate data for two categorical variables.

Categorical variables can be added to a scatterplot using color or different symbols.

Once we know what variables we will be exploring, we need to formulate a question that requires the collection of data that can be analyzed using a data display.

We need to look at the data source when questions are formulated. We can consider:

What population do we want to make a conclusion about?

How can we find relevant data? Is it easy to acquire secondary data that already exists?

Who will be using the conclusions from the analysis?

Once the question has been formulated, we need to determine how to collect or acquire the necessary data. Here are some ways to collect data:

**Research**using secondary sources to find existing data.**Surveys**can be done by asking each member of the representative sample two questions or giving them a questionnaire. Answers can be more open ended than a poll.**Observations**can be made by watching members of the sample and noting particular characteristics.**Scientific experiments**can be done by carefully selecting a sample and controling as many other variables as possible, then varying the independent variable and measuring how the dependent variable responds.

There are many formulas and ways to estimate or predict someone's adult height such as doubling their height at age 2 or using a combination of their biological parents' heights. With a partner or in a small group. Using data cycle explore relationships involving adult height.

Brainstorm potential investigative questions.

What would the variables be for each question? Are they all numerical or are some categorical?

Do you think there could be a single model that would be accurate across all demographics like race and gender? Explain.

What method would be best to explore this relationship?

When doing a survey or using secondary sources, it is important that the data is collected from a **sample** that is **representative** of the population, so that our analysis of the data is valid.

Representative means that characteristics of the population should be similar to the sample.

In this course, we will aim to collect larger data sets because they provide a reasonable approximation for the population.

When there is bias in the data cycle, we may get misleading or inaccurate conclusions.

Sampling bias can occur due to undercoverage or exclusion when a particular subgroup is under-represented or fully excluded.

There are a number of ways we can avoid bias in our sample, including:

Having a sample that is large enough to represent the characteristics of the population. The larger the sample size, the closer the results will be to that of the population.

Having a sample that is selected without strategically choosing more people from a certain group.

Randomly selecting the sample.

First identify the variables, then write an investigative question related to each scenario.

a

The local souvenir shop has noticed that their sweatshirt sales seem to be related to the temperature outside. They want to investigate this relationship more closely.

Worked Solution

b

For a school in Fairfax, VA, the principal noticed that the number of days missed by a student in September is a good predictor of the number of total days they will be absent throughout the year. She wants to investigate this relationship.

Worked Solution

c

A baker wants to adjust his pricing model to be more competitive. He wants to look at the price he charged for a cake compared to the time it took to create. He is curious if he would need specific models for wedding cakes versus to birthday cakes or if the same model would be appropriate for all kinds of cakes.

Worked Solution

For each investigative question, select which data collection technique would be best. Explain your answer.

A

Observation

B

Polls

C

Research

D

Scientific experiment

E

Survey

a

For the intersection of Chain Bridge Road and Eaton Place, Louisa notices that sometimes she can walk through easily, but sometimes she gets stuck in a crowd.

She asks the question "For the intersection at Chain Bridge Road and Eaton Place, how can the relationship between pedestrian density (people per square yard) and walking speed (feet per second) be modeled?"

Worked Solution

b

Polly loves attending concerts, but finds that she often can't see the stage because of the taller people in front of her. This leads her to ask the question:

"For those who attend concerts at the local venue, is there a relationship between height and amount spent on concerts in a year?"

Worked Solution

Idea summary

**Bivariate data** is data which is represented by two variables. This data is typically numerical and can be organized with a scatterplot.

To explore bivariate data, we first need to formulate an **investigative question** and then we can determine how to collect or acquire the necessary data. Such as:

**Research**using secondary sources to find existing data.**Surveys**can be done by asking each member of the representative sample two questions or giving them a questionnaire.**Observations**can be made by watching members of the sample and noting particular characteristics.**Scientific experiments**can be done by controling other variables, then varying the independent variable and measuring the dependent variable.

We often display bivariate data using a **scatterplot** where the independent variable is written on the horizontal axis and the dependent variable is on the vertical axis. We can describe the relationship based on how closely the points follow a particular model, we can use the terms:

Form, usually described as a

**linear relationship**or**nonlinear relationship**Strength, describing how closely the data points match the model line or curve

For linear relationships, we may also describe their direction as positive or negative.

As a review, here are some examples:

A linear relationship that is strong and positive

A nonlinear relationship that is weak

For larger data sets, we can use technology such as spreadsheets and graphing calculators to graph scatterplots. This is especially helpful for when we also want to analyze which model or equation would be the most appropriate.

A group of patients participating in a medical trial were given different dosages of the same medication. In addition to the medication provided, some of the patients also take insulin while others do not. The doctors running the trial then rated the effectiveness of the medication for each patient.

Use the checkboxes to display different subsets of data values in this scatterplot.

What relationship is this scatterplot trying to explore? Formulate a question that this scatterplot could be used to help answer.

When the entire data set is displayed the same way, how would you describe the trend?

When just the people in the study who are taking insulin are shown, how would you describe the trend?

When just the people in the study who are not taking insulin are shown, how would you describe the trend?

Formulate a new question for a second round of the data cycle based on what you notice from the categories in this scatterplot.

While working through the data cycle, we will often uncover patterns or relationships we didn't think of before that can lead us to new investigative questions. For instance, we may realize that by grouping data into categories we uncover relationships between variables that appeared to be unrelated when the data was combined.

The table shows the grade of 12 students in English and French.

Student | Grade in English | Grade in French |
---|---|---|

1 | 85 | 89 |

2 | 71 | 71 |

3 | 57 | 56 |

4 | 60 | 62 |

5 | 79 | 86 |

6 | 76 | 76 |

7 | 71 | 77 |

8 | 91 | 86 |

9 | 50 | 90 |

10 | 49 | 47 |

11 | 66 | 67 |

12 | 92 | 92 |

a

Which of the following scatterplots correctly represents the above data?

A

B

C

Worked Solution

b

Is the relationship between students' English and French grade positive or negative?

A

Positive

B

Negative

Worked Solution

c

Is the relationship between students' English and French grade strong or weak?

A

Strong

B

Weak

Worked Solution

Idea summary

When describing a relationships shown in a scatterplot, we can describe the:

Form, usually described as a

**linear relationship**or**nonlinear relationship**Strength, describing how closely the data points match the model line or curve

Direction, usually described as

**positive relationship**or**negative relationship**