topic badge

4.05 Scatterplots and lines of best fit

Collect and organize bivariate data

We have seen the data cycle for univariate data, now we will look at the process for numerical bivariate data. This means that each study participant will have two pieces of data collected. We use these pairs to explore the relationship between the two variables.

This means that when we formulate statistical questions, we will need to ask about relationships between two variables. For example:

  • What is the relationship between education level and salary?

  • What is the relationship between number of followers and number of posts per day?

  • Does age impact bone density?

After formulating a question, we need to collect data. For bivariate data, we need two pieces of data for each data point. This means that for a survey we need to ask each person two questions or in an experiment we need to take two measurements for each trial.

Age (years)Bone density (g/cm³)
Nila301.35
Orland401.28
Pei501.22
Qi601.10
Reina700.97

In a study about bone health, a person’s bone density is measured against their age.

Their age is the independent variable and can be any value. Their bone density is the dependent variable that is recorded against their age.

A single data point in a bivariate data set is written in the form \left(x,\,y\right), with x being the independent variable and y being the dependent variable.

We display often display bivariate data using a scatterplot.

Bone density
30
40
50
60
70
\text{Age (years)}
0.2
0.4
0.6
0.8
1
1.2
1.4
\text{Bone density (g/cm³)}

In a scatterplot, we plot the points with the value of the independent variable on the horizontal axis and the value of the dependent variable on the vertical axis

Scatterplots can often reveal patterns between two variables.

We may then check whether age is related to bone density by analyzing the relationship between the variables graphically.

Exploration

Slide the slider for n to change the number of data points.

Slide the other slider to change the relationship type of the data set.

Loading interactive...
  1. What do you think a positive relationship means?

  2. Describe what a negative relationship looks like.

  3. For a particular type of relationship, if you change the number of data points, does this change the general shape of the scatterplot?

While data points might display non-linear forms like curves, we'll focus on linear models.

A scatterplot can suggest different kinds of linear relationships between variables. Linear relationships can be postive (rising) or negative (falling). We can identify the relationship based on how the points "slope" from left to right.

A scatterplot showing a positive rising relationship. The points are forming a vertical line rising from left to right.
Positive (rising) relationship
A scatterplot showing a negative falling relationship. The points are forming a diagonal line falling from left to right.
Negative (falling) relationships
A scatterplot showing no relationship. The points are spread out and does not form any pattern.
No relationship
A scatterplot showing no relationship. The points are forming a horizontal line.
No relationship

A pattern between two variables is known as a relationship or association. It's important to note that the existence of a relationship between two variables in a scatterplot does not necessarily imply that one causes the other. For example, there is a clear relationship between height and stride length. However, it doesn't mean that if you take big steps you'll grow taller.

Examples

Example 1

The scatterplot shows the relationship between sea temperature and the amount of healthy coral.

A scatterplot showing a negative correlation between sea temperature on the x axis and coral on the y axis.
a

Which variable is the independent variable?

A
Level of healthy coral
B
Sea temperature
Worked Solution
Create a strategy

An independent variable is a variable that stands alone and is not changed by the other variables you are measuring. It is also typically on the x-axis.

Apply the idea

Sea temperature is the independent variable because it is on the x-axis and it would likely affect the coral, but the coral would not change the temperature. The correct answer is B.

b

Which variable is the dependent variable?

A
Level of healthy coral
B
Sea temperature
Worked Solution
Create a strategy

The dependent variable is placed on the vertical axis and is affected by the independent variable.

Apply the idea

The level of healthy coral is determined by sea temperature and is on the vertical axis, making it the dependent variable. So, the correct answer is A.

c

Describe the relationship between sea temperature the amount of healthy coral.

Worked Solution
Create a strategy

We want to know if the relationship is negative, positive, or if there is no relationship.

Apply the idea

The scatterplot shows a negative (falling) relationship.

This means that as the sea temperature increases, the amount of coral decreases.

Example 2

The table shows the number of traffic accidents associated with a sample of drivers of different age groups.

AgeAccidents
2041
2544
3039
3534
4030
4525
5022
5518
6019
6517
a

Construct a scatterplot to represent the above data.

Worked Solution
Create a strategy

Draw the scatterplot by plotting each point from the table.

Apply the idea
20
25
30
35
40
45
50
55
60
65
70
\text{Age}
20
25
30
35
40
45
\text{Accidents}

Age is the independent variable, so should be put on the horizontal axis. So Accidents should be put on the vertical axis.

So the first row from the table corresponds to the point \left(20 , \,41\right) on the graph.

b

Does the scatterplot show no relationship, a positive relationship, or a negative relationship?

Worked Solution
Create a strategy

The points will look random if there is no relationship.

If the pattern of points slopes up from bottom left to top right, it indicates a positive linear relationship between the variables being studied.

If the pattern of points slopes down from top left to bottom right, it indicates a negative linear relationship.

Apply the idea

The points in the scatterplot are going down from left to right. This means this is a negative relationship.

c

As a person's age increases, does the number of accidents they are involved in increase, decrease, or neither?

Worked Solution
Create a strategy

Check the trend of the data on the scatterplot.

Apply the idea

Based on the scatterplot, as one variable increases, the other one decreases. So as age increases, the number of accidents decreases.

Reflect and check

We cannot say that being younger causes you to have more accidents. However, a possible reason for this relationship could be that older drivers have had more experience or drive less often, so are less likely to get in an accident.

Example 3

Zenaida is curious about relationships involving follower to following ratio on social media for different levels of influencers.

a

Formulate a statistical question that could be used to explore this aspect of social media.

Worked Solution
Create a strategy

A statistical question should require data to be collected, have some variety in the answers, and allow for a data display.

Apply the idea

We could ask "For the students in my class that use social media, is there a relationship between the number of followers someone has and the number of people they are following?"

Reflect and check

There is more than one possible statistical question for this context.

Another possible statistical question is "For social media users, is there a relationship between the number of followers someone has and their follower to following ratio."

b

Identify the independent and dependent variables.

Worked Solution
Create a strategy

A social media user only has direct control over how many people they follow, not how many people follow them.

Apply the idea

There are two attributes that need to be collected.

Independent variable: Number of people they are following

Dependent variable: Number of followers they have

Reflect and check

They could ask each person to answer a survey with two questions.

  1. How many people do you follow on this social media platform?

  2. How many followers do you have on this social media platform?

c

Collect data which could be used to help answer her question.

Worked Solution
Create a strategy

We can ask each student if they have a particular social media platform. If they do, we can ask them to find the number of followers and following on their profile.

Apply the idea

For example, we could get this data set:

Following31651432695551407221492234736
Followers344178132211290452323251257494
Following105651816262168428615643333318
Followers4242183054734332807617719316
Reflect and check

These numbers can change drastically, so the data would only be valid for a short amount of time, but the trend might still be relevant.

We could also sort the data into different sets based on social media platform to see if different trends arise for different platforms.

d

Based on the collected data, for students in your class, is there a relationship between the number of followers someone has and the number of people they are following?

Worked Solution
Create a strategy

We should create a scatterplot to see if there is a positive (rising) pattern, negative (falling) pattern, or no pattern.

Apply the idea
100
200
300
400
500
600
700
800
900
1000
1100
\text{Following}
50
100
150
200
250
300
350
400
450
500
550
\text{Followers}

This scatterplot shows that there is a positive (rising) relationship between the number of people you are following and the number of followers you have.

This means that as the number of people you are following increases, so does the number of followers you have.

Reflect and check

The scatterplot for your data may show a different relationship, or no relationship at all.

If the population was more broad and included influencers and celebrities, they may not follow this trend.

Idea summary

We can create scatterplots to help us identify patterns and relationships between two variables.

A scatterplot can suggest different kinds of linear relationships between variables. Linear relationships can be postive (rising) or negative (falling). A scatterplot may also show no relationship if it is randomly scattered with no positive or negative pattern.

3 Scatterplots showing a no relationship, positive relationship and negative relationship.
  • No relationship: The scatterplot suggests that there is no definite positive or negative pattern.

  • Positive relationship: The scatterplot suggests that as variable 1 increases, the variable 2 also increases.

  • Negative relationship: The scatterplot suggests that as variable 1 increases it has an opposite effect on variable 2, so as variable 1 increases, variable 2 decreases.

Line of best fit

A line of best fit is a straight line that best represents the data on a scatterplot. The line may pass exactly through all of the points, some of the points, or none of the points. It always represents the general trend of the data.

Lines of best fit are helpful because we can use them to make interpretations and predictions about the data.

Exploration

Data was collected on the weight of someone's backpack versus their age.

Each scatterplot shows the same data, but with different lines of fit.

Four scatterplots showing the same data, but with different lines of fit, Ask your teacher for more info.
  1. Which line would you choose as a line of best fit? Explain.

  2. Which one would you not choose? Explain.

  3. What are the differences between the line you would choose and the line you would not choose?

To draw a line of best fit by eye, balance the number of points above the line with the number of points below the line as much as possible and ensure the line follows the trend of the data.

\text{Independent variable}
\text{Dependent variable}
This is an example of a good line of best fit
\text{Independent variable}
\text{Dependent variable}
This is an example of a poor line of best fit

Straight lines are often used to model relationships between two quantities. For scatterplots that model linear relationships, we can describe the relationship as positive, negative, or no relationship. The line of best fit for a positive relationship will have a positive slope and one for a negative relationship will have a negative slope.

The closer the points fit to the line, the more confident we can be about the relationship between the two variables.

10
20
30
40
50
60
70
80
90
x
1
2
3
4
y
Points are all very close to the line, so we can have higher confidence in the model
10
20
30
40
50
60
70
80
90
x
1
2
3
4
y
Points are more spread out from the line, so we'll have lower confidence in the model

But remember, just because two variables are related, this does not mean that one causes the other.

Examples

Example 4

The following scatterplot shows the data for two variables, x and y.

1
2
3
4
5
6
7
8
9
10
x
1
2
3
4
5
6
7
8
9
10
y

Draw a line of best fit for the data.

Worked Solution
Create a strategy

Draw a line that follows the trend of the points and have the same number of points above and below the line.

Apply the idea
1
2
3
4
5
6
7
8
9
10
x
1
2
3
4
5
6
7
8
9
10
y

Here is an example of a line of best fit that follows the trend of the data and has 4 points above the line and 4 points below the line.

Example 5

Amit is working on budgeting with his mom. He learned about the 30\% guideline which says that housing costs should be at most 30 \% of your income.

He wants to explore the statistical question "What relationships exist between monthly housing costs and income?"

a

Explain how Amit could collect or aquire data to help answer his question, then collect or acquire some data.

Worked Solution
Create a strategy

We can find a lot of aggregate (summary) data online, but finding raw data comparing the exact two variables of interest for a specific region can sometimes be difficult.

He could do a survey to answer this question where the population is his neighborhood. The survey should be anonymous as this could be very personal information

Apply the idea

He could choose a random house or apartment on each street and ask a resident to fill in an anonymous form with the questions "What is your monthly household income?" and "What are your monthly housing costs?"

His data could look something like this:

Monthly income98317324103326135771751791627539763134819
Housing cost300012454200122718521243488161919571494
Monthly income2235370615372200454635521652195881132950
Housing cost69312235848941864145769488144621711
Reflect and check

This would be a fairly small sample for a whole neighborhood, but could indicate what the relationships might be.

b

Draw a line of best fit to summarize the data.

Worked Solution
Create a strategy

To draw a line of best fit, we must first create a scatterplot. Then we can draw a line the goes through the data with about half of the points above and about half below the line.

We can start by opening a statistics calculator and typing the data into the spreadsheet.

A screenshot of the GeoGebra statistics tool showing how to input a given set of data. Speak to your teacher for more details.

Next, we can highlight all the data and click Two Variable Regression Analysis. In other programs, we might need to insert a chart or zoom to see the data.

A screenshot of the GeoGebra statistics tool showing how to select the Two Variable Regression Analysis option. Speak to your teacher for more details.

This will give us a scatterplot.

A screenshot of the GeoGebra statistics tool showing how to generate the scatterplot of a given data set. Speak to your teacher for more details.
Apply the idea
2000
4000
6000
8000
10000
\text{Income}
1000
2000
3000
4000
\text{Housing}

The independent variable is the income and the dependent variable is the housing cost.

There are about 10 points above and 10 points below the line.

Some of the points are very close to the line, while others are further.

Reflect and check

Most technology tools have the ability to plot a line of best fit for us. In GeoGebra, we use the drop down menu for Regression Model and click Linear. We can see that the line of best fit looks similar to the one done by eye.

A screenshot of the GeoGebra statistics tool showing how to display the equation of the line of best fit. Speak to your teacher for more details.
c

Describe the relationship as positive, negative, or no relationship.

Worked Solution
Create a strategy

A positive relationship will have the points rising from left to right.

A negative relationship will have the points falling from left to right.

If the points are randomly scattered, there is no relationship.

Apply the idea

As the income increases, the cost of housing also increases. This means that the points are rising, so there is a positive relationship.

Reflect and check

The slope of the line of best fit is related to the type of relationship. In this case, the slope of the line of best fit is positive, so the relationship is also positive.

d

Based on the clustering of the points around the line, what conclusions can you draw from the scatterplot.

Worked Solution
Create a strategy

We can look at the direction of the line, how close the points are to the line, and patterns within the data set.

Apply the idea
2000
4000
6000
8000
10000
\text{Income}
1000
2000
3000
4000
\text{Housing}

For those with monthly incomes around ,\$2000, the points are quite clustered around the line, so the relationship is stronger there.

However, some of the points are quite far from the line, so we cannot be completely confident any conclusions we make.

We could draw the conclusion that there is a moderately strong, positive relationship between income and housing costs. Also, we can say that for higher incomes, housing costs vary more than for lower incomes.

Reflect and check

This could lead to further questions about the proportion of income spent on housing, like "What is the distribution for percentage of income spent on housing costs?"

e

How can we clearly communicate the results?

Worked Solution
Create a strategy

We can first show the raw data, then the display, then show the conclusions.

Apply the idea

The raw data was collected using a random convenience sample in one specific neighborhood, so likely does not represent the wider US population. Surveys were done anonymously.

Monthly income98317324103326135771751791627539763134819
Housing cost300012454200122718521243488161919571494
Monthly income2235370615372200454635521652195881132950
Housing cost69312235848941864145769488144621711

The data can be summarized with scatterplots. The first scatterplot shows the whole data set. The other two split it into incomes below and above \$ 6000 per month.

2000
4000
6000
8000
10000
\text{Income}
1000
2000
3000
4000
\text{Housing}

We can see that overall, there is a positive relationship between income and housing costs. For lower incomes, there is a stronger pattern that for higher incomes.

1000
2000
3000
4000
5000
\text{Income}
200
400
600
800
1000
1200
1400
1600
1800
\text{Housing}

We can see that for the lower incomes, more points are above the line, so they are spending a higher proportion of their income on housing.

2000
4000
6000
8000
10000
\text{Income}
1000
2000
3000
4000
\text{Housing}

We can see that for the higher incomes, more points are below the line, so they are spending a lower proportion of their income on housing.

Overall, the higher your income, the more you spend on housing.

Reflect and check

This histogram shows the ratios of housing to income and could help to communicate the results. Different bin width can tell a slightly different story.

A histogram. Frequency is from 0 to 7. x-axis is labeled Housing: income ratio.The bins have 0.5 intervals. The plots show 0.2-0.25 at frequency 3, 0.25-0.3 and 0.35-0.4 at frequency 1, 0.3-0.35 at frequency 7, 0.4-0.45 at frequency 6 and 0.55-0.6 at frequency 2
Idea summary
10
20
30
40
50
60
70
80
90
x
1
2
3
4
y

When drawing a line of best fit by eye, balance the number of points above the line with the number of points below the line, and place the line as close as possible to the points.

The closer the points fit to the line, the more confident we can be about the relationship between the two variables.

Outcomes

8.PS.3

The student will apply the data cycle (formulate questions; collect or acquire data; organize and represent data; and analyze data and communicate results) with a focus on scatterplots.

8.PS.3a

Formulate questions that require the collection or acquisition of data with a focus on scatterplots.

8.PS.3b

Determine the data needed to answer a formulated question and collect the data (or acquire existing data) of no more than 20 items using various methods (e.g., observations, measurement, surveys, experiments).

8.PS.3c

Organize and represent numeric bivariate data using scatterplots with and without the use of technology.

8.PS.3d

Make observations about a set of data points in a scatterplot as having a positive linear relationship, a negative linear relationship, or no relationship

8.PS.3e

Analyze and justify the relationship of the quantitative bivariate data represented in scatterplots.

8.PS.3f

Sketch the line of best fit for data represented in a scatterplot.

What is Mathspace

About Mathspace