topic badge

8.03 Scatter plots and lines of fit

Introduction

Lines of fit appeared in 8th grade informally, where we determined linear associations between two quantitative variables. This lesson will use lines of fit as a foundation for discussing correlation and causation.

Scatter plots and lines of fit

When looking at bivariate data, it can often appear that the two variables are correlated.

Correlation

A relationship between two variables

It is important to be able to distinguish between causal relationships (when changes in one variable cause changes in the other variable) and non-causal relationships

Causation

A relationship between two events where one event causes the other

To claim a correlation between two variables, we can exmaine mathematical calculations that can measure the strength of an association between two variables. Causation can only be determined from an appropriately designed statistical experiment.

For categorical data, we can describe an association as positive or negative, as well as whether the association is strong or weak (or if there is no association).

Exploration

Consider the graph shown:

Car Value
1
2
3
4
5
\text{Time since purchase (years) } x
5
10
15
20
25
30
\text{Value (in thousands of dollars) }y
  1. Is there a correlation between the years since purchased and the value in thousands of dollars?
  2. Which of the lines on the graph is the line of best fit?

A line of best fit (or trend line) is a straight line that best represents the data on a scatter plot. We can use lines of best fit to help us make predictions or conclusions about the data.

To draw a line of best fit by eye, balance the number of points above the line with the number of points below the line. We should generally ignore outliers as they can skew the line of best fit.

The analysis of bivariate data should include:

  • Form, usually described as a linear association or nonlinear association
  • Strength, describing how closely the data points match the form
  • Direction, usually described as positive association or negative association

A scatter plot can be used to display bivariate data once the independent and dependent variables are defined.

The correlation coefficient, r, is a statistic that can describe both the strength and direction of a linear association.

10
20
30
40
50
60
70
80
90
x
1
2
3
4
5
6
7
8
9
y
Perfect positive correlation r=1
5
10
15
20
25
30
35
40
45
x
5
10
15
20
25
30
35
40
45
y
Perfect negative correlation r=-1
10
20
30
40
50
60
70
80
90
x
1
2
3
4
y
Strong negative correlation r=-0.974
0.2
0.3
x
254
255
256
257
258
259
260
y
Weak positive correlation r=0.306
1
2
3
4
5
6
7
8
9
x
1
2
3
4
5
6
7
8
9
y
Moderate negative correlation r=-0.684
1
2
3
4
5
6
7
8
9
x
1
2
3
4
5
6
7
8
9
y
No correlation r=0.072

It is important to be able to distinguish between causal relationships (when changes in one variable cause changes in the other variable) and correlation where the two variables are related, but one variable does not necessarily influence the other.

Examples

Example 1

A study was conducted to find the relationship between the age at which a child first speaks and their level of intelligence as teenagers. The following table shows the ages of some teenagers when they first spoke and their results in an aptitude test:

Age when first spoke (months)142791621171071924
Aptitude test results9669931018792991049397
a

Create a scatter plot to model the data.

Worked Solution
Create a strategy

Let x= age when the child first spoke and y= aptitude test results as a teen.

The minimum value for x is 7 and the maximum is 27 so we can use a scale of 5 to label the x-axis. The minimum value for y is 69 and the maximum is 104 so we can use a scale of 20 to label the x-axis.

Apply the idea
5
10
15
20
25
\text{Age (months)}
20
40
60
80
100
\text{Aptitude score}
b

Sketch an approximate line of best fit for the scatter plot and interpret the y-intercept.

Worked Solution
Create a strategy

Estimate a line of best fit by balancing the number of points above the line with the number of points below the line.

Apply the idea
5
10
15
20
25
\text{Age (months)}
20
40
60
80
100
\text{Aptitude score}

The y-intercept is about 109 at 0. This suggests that a child who first speaks at 0 months old would have an aptitude of 109 as a teenager.

Reflect and check

Note that children do not begin speaking within their first month of birth, so the y-intercept is theoretical.

c

Estimate the correlation coefficient and describe the association between the variables.

Worked Solution
Create a strategy

The correlation will be a value between -1 and 1, depending on the strength and direction.

Describe the association with the following attributes:

  • Form: linear, quadratic, or exponential
  • Strength: strong or weak
  • Direction: positive or negative
Apply the idea

The association between the age when a child first spoke and their aptitude test score as a teenager has a strong, negative, linear association. The correlation coefficient is between -0.9 and -0.7.

Reflect and check

The closer the points are to forming a curve, the stronger their association will be. Extreme data values have a large impact on correlation.

d

Determine if there is enough evidence to suggest a causal relationship between the age when a child first speaks and their intelligence as teenagers.

Worked Solution
Apply the idea

No, correlation is not causation.

Reflect and check

An association between two quantities is evidence to suggest that the value of one quantity can be predicted with some accuracy given the other quantity, but is not enough evidence to suggest that changes in one quantity directly cause changes in the other.

Example 2

Determine whether the following statement is true or false:

"There is a causal relationship between number of cigarettes a person smokes and their life expectancy."

Worked Solution
Apply the idea

True. It is generally understood that smoking cigarettes can cause diseases such as cancer, which is known to have an effect on life expectancy.

Reflect and check

The evidence of a causal relationship usually comes from generally accepted truths or verified research studies. A causal relationship is not confirmed by finding an association between variables.

Example 3

Consider the graph showing the relationship between the years since purchasing a car and its value in thousands of dollars:

Car Value
1
2
3
4
5
\text{Time since purchase (years) } x
5
10
15
20
25
30
\text{Value (in thousands of dollars) }y
a

The equation of the line of best fit for the line is y=-2.2x+30.5. Interpret the slope and y-intercept of the line.

Worked Solution
Create a strategy

Use the labels on the axes of the graph to determine the units of the slope and y-intercept.

Apply the idea

The y-intercept of (0,30.5) on the graph means that at the time of purchasing the car, it would have a value of \$30\,500.

The slope of -2.2 means that each year, the car's value would decrease by \$2\,200.

b

Make a prediction about the value of a car after 10 years.

Worked Solution
Create a strategy

Since 10 years after purchase is not shown on the graph, we can use the equation of the line of best fit to determine the value of a car at that time.

Apply the idea

We have

y=-2.2(10)+30.5=8.5

Based on the equation of the line of best fit, a car that is initially valued at \$30\,500 will be worth \$8\,500 ten years after it was purchased.

c

Estimate the correlation coefficient and describe the association between the variables and explain whether there is a causal relationship.

Worked Solution
Apply the idea

The association between the age when years since purchasing a car and its value in thousands of dollars has a strong, negative, linear association. The correlation coefficient is between -1 to -0.9. Since correlation does not mean causation, we cannot say whether there is a causal relationship regardless of the strength of the association.

Reflect and check

Remember, only a carefully designed statistical experiment can determine causation. There are factors to consider when determining if a relationship is causal. A person who keeps their car in their garage for many years and does not drive it may have a different value than a person who commutes daily. Certain sought-after car models may fluctuate in their value over time, so a line of best fit and the strong association may indicate an estimate for a car's value but will not determine if the relationship is causal.

Idea summary

The analysis of bivariate data should include:

  • Form, usually described as a linear association or nonlinear association
  • Strength, describing how closely the data points match the form
  • Direction, usually described as positive association or negative association

Recall that the correlation coefficient can describe both the strength and direction of a linear association.

Outcomes

S.ID.B.6

Represent data on two quantitative variables on a scatter plot, and describe how the variables are related.

S.ID.B.6.C

Fit a linear function for a scatter plot that suggests a linear association.

S.ID.C.7

Interpret the slope (rate of change) and the intercept (constant term) of a linear model in the context of the data.

S.ID.C.9

Distinguish between correlation and causation.

What is Mathspace

About Mathspace