We have seen the data cycle for univariate data, now we will look at the process for numerical bivariate data. This means that each study participant will have two pieces of data collected. We use these pairs to explore the relationship between the two variables.
This means that when we formulate statistical questions, we will need to ask about relationships between two variables. For example:
What is the relationship between education level and salary?
What is the relationship between number of followers and number of posts per day?
Does age impact bone density?
After formulating a question, we need to collect data. For bivariate data, we need two pieces of data for each data point. This means that for a survey we need to ask each person two questions or in an experiment we need to take two measurements for each trial.
Age (years) | Bone density (g/cm³) | |
---|---|---|
Nila | 30 | 1.35 |
Orland | 40 | 1.28 |
Pei | 50 | 1.22 |
Qi | 60 | 1.10 |
Reina | 70 | 0.97 |
We display often display bivariate data using a scatterplot.
Slide the slider for n to change the number of data points.
Slide the other slider to change the relationship type of the data set.
What do you think a positive relationship means?
Describe what a negative relationship looks like.
For a particular type of relationship, if you change the number of data points, does this change the general shape of the scatterplot?
While data points might display non-linear forms like curves, we'll focus on linear models.
A scatterplot can suggest different kinds of linear relationships between variables. Linear relationships can be postive (rising) or negative (falling). We can identify the relationship based on how the points "slope" from left to right.
A pattern between two variables is known as a relationship or association. It's important to note that the existence of a relationship between two variables in a scatterplot does not necessarily imply that one causes the other. For example, there is a clear relationship between height and stride length. However, it doesn't mean that if you take big steps you'll grow taller.
The scatterplot shows the relationship between sea temperature and the amount of healthy coral.
Which variable is the independent variable?
Which variable is the dependent variable?
Describe the relationship between sea temperature the amount of healthy coral.
The table shows the number of traffic accidents associated with a sample of drivers of different age groups.
Age | Accidents |
---|---|
20 | 41 |
25 | 44 |
30 | 39 |
35 | 34 |
40 | 30 |
45 | 25 |
50 | 22 |
55 | 18 |
60 | 19 |
65 | 17 |
Construct a scatterplot to represent the above data.
Does the scatterplot show no relationship, a positive relationship, or a negative relationship?
As a person's age increases, does the number of accidents they are involved in increase, decrease, or neither?
Zenaida is curious about relationships involving follower to following ratio on social media for different levels of influencers.
Formulate a statistical question that could be used to explore this aspect of social media.
Identify the independent and dependent variables.
Collect data which could be used to help answer her question.
Based on the collected data, for students in your class, is there a relationship between the number of followers someone has and the number of people they are following?
We can create scatterplots to help us identify patterns and relationships between two variables.
A scatterplot can suggest different kinds of linear relationships between variables. Linear relationships can be postive (rising) or negative (falling). A scatterplot may also show no relationship if it is randomly scattered with no positive or negative pattern.
No relationship: The scatterplot suggests that there is no definite positive or negative pattern.
Positive relationship: The scatterplot suggests that as variable 1 increases, the variable 2 also increases.
Negative relationship: The scatterplot suggests that as variable 1 increases it has an opposite effect on variable 2, so as variable 1 increases, variable 2 decreases.
A line of best fit is a straight line that best represents the data on a scatterplot. The line may pass exactly through all of the points, some of the points, or none of the points. It always represents the general trend of the data.
Lines of best fit are helpful because we can use them to make interpretations and predictions about the data.
Data was collected on the weight of someone's backpack versus their age.
Each scatterplot shows the same data, but with different lines of fit.
Which line would you choose as a line of best fit? Explain.
Which one would you not choose? Explain.
What are the differences between the line you would choose and the line you would not choose?
To draw a line of best fit by eye, balance the number of points above the line with the number of points below the line as much as possible and ensure the line follows the trend of the data.
Straight lines are often used to model relationships between two quantities. For scatterplots that model linear relationships, we can describe the relationship as positive, negative, or no relationship. The line of best fit for a positive relationship will have a positive slope and one for a negative relationship will have a negative slope.
The closer the points fit to the line, the more confident we can be about the relationship between the two variables.
But remember, just because two variables are related, this does not mean that one causes the other.
The following scatterplot shows the data for two variables, x and y.
Draw a line of best fit for the data.
Amit is working on budgeting with his mom. He learned about the 30\% guideline which says that housing costs should be at most 30 \% of your income.
He wants to explore the statistical question "What relationships exist between monthly housing costs and income?"
Explain how Amit could collect or aquire data to help answer his question, then collect or acquire some data.
Draw a line of best fit to summarize the data.
Describe the relationship as positive, negative, or no relationship.
Based on the clustering of the points around the line, what conclusions can you draw from the scatterplot.
How can we clearly communicate the results?