topic badge

2.01 Bivariate data and line of best fit

Lesson

Least-squares regression line

In the previous chapter, we learnt to describe the relationship between numerical variables. If there is a relationship, we can recognise when it is linear or non-linear, strong or weak, and positive or negative.

We can draw a line, by eye, that best fits the points in a scatter plot. The images below show some examples of a line of best fit. The line of best fit does not necessarily go through the data points. It is positioned so that it minimises the overall distance from the points.

Three graphs showing different line of best fits. Ask your teacher for more information.

Constructing a line of best fit in this way can be very useful because the line summarises all of the data points in a way that allows us to predict the value of the response (dependent) variable that corresponds to a given value of the explanatory (independent) variable, or vice-versa. However, it takes a certain amount of judgement to choose the position and angle of the line. Constructing the line by eye is not very reliable, especially if the scatter plot does not show a strong relationship.

Fortunately, there are mathematical calculations that we can use to determine a line of best fit with more consistency. In this lesson, we will learn to construct a least-squares regression line.

Exploration

Try moving the points to see how the least-squares regression line follows the data.

Loading interactive...

The number of points above and below the line should be equal or differ by 1.

The least-squares regression line is a straight line, so it can be represented by a linear function, in gradient-intercept form: y=mx+c, where y is the predicted value of the response (dependent) variable, x s our explanatory (independent) variable, m is the gradient of our line, and c is the vertical intercept (or y-intercept) of our line.

It is common practice to use the symbol \hat{y} (which is pronounced "y-hat"), instead of y, as the variable for the predicted value of the response variable.

Unless we are working with a very small data set, the calculations to determine the least-squares regression line are too tedious to do by hand. In this course, we will use technology for these calculations.

Examples

Example 1

Use technology to find the line of best fit for the data below.

x24371931322214302340
y-7-8-3-6-9-8-2-8-8-12

Write the equation with the coefficient and constant term to the nearest two decimal places.

Worked Solution
Create a strategy

Use the linear regression function on your calculator.

Apply the idea

Using the Statistics mode, enter each x-values along with its y-values into a data table on your calculator then find the linear regression.

Write down the equation of the line in the form y=mx+c:y=14.87x+30.26

Idea summary

The least-squares regression line is a straight line with equation of the form:

\displaystyle y=mx+c
\bm{y}
is the predicted value of the response (dependent) variable
\bm{x}
is our explanatory (independent) variable
\bm{m}
is the gradient of our line
\bm{c}
is the vertical intercept (or y-intercept) of our line.

Interpret the line of best fit

Now that we can determine the equation of the least-squares regression line, we are able to make important observations about the original data. When we are analysing data, it is important that we consider the context.

In the calculator examples above, y represents the vocabulary (the response variable) and x represents the number of pages read per day (the explanatory variable).

We found the least squares regression line in the form y=mx+c, to be y=14.87x+30.26.

The m value displayed is the gradient of the least-squares regression line. Since this is a positive number, it indicates that there is a positive relationship between the variables. In the context of this example, this tells us that for each additional page of reading per day, a child's vocabulary increases by approximately 15 words.

The c value provided is the vertical intercept of the line of best fit and, in the context of this example, tells us that a child that does no reading would have a vocabulary of approximately 30 words.

Given a least squares squares regression line of the form y=mx+c:

The m value shows the gradient:

  • if the gradient is positive, when the explanatory increases by 1 unit, the response variable increases by m units.

  • if the gradient is negative, when the explanatory increases by 1 unit, the response variable decreases by m units.

The c value shows the vertical intercept (also known as the y-intercept):

  • when the explanatory variable is 0, the value of the response variable is c.

When we interpret the vertical intercept, we need to consider if it makes sense for the explanatory variable to be zero and the response variable to have a value indicated by c.

Examples

Example 2

The price of various second-hand Mitsubishi Lancers are shown below.

\text{Age}1205743482
\text{Value } \\ (\$)16\,00013\,00021\,99010\,000860012\,50011\,00011\,000450014\,500
a

Find the equation of the Least Squares Regression Line for the price (y) in terms of age (x). Round all values to the nearest integer.

Worked Solution
Create a strategy

Use the linear regression function on your calculator.

Apply the idea

Using the Statistics mode, enter each x-values along with its y-values into a data table on your calculator then find the linear regression.

Write down the equation of the line in the form y=mx+c:y=-1694x+18\,407

b

State the value of the y-intercept.

Worked Solution
Create a strategy

Recall the equation y=mx+c, where c is also known as y-intercept of the line.

Apply the idea

From the equation y=-1694x+18\,407:c=18\,407

c

The value of the y-intercept indicates that when a Mitsubishi Lancer is brand new, its value is, on average, dollars.

Worked Solution
Create a strategy

Use the y-intercept.

Apply the idea

When the Mitsubishi is brand new the age will be x=0. So the value, y, will be equal to the y-intercept which is \$18\,407.

The value of the y-intercept indicates that when a Mitsubishi Lancer is brand new, its value is, on average, 18\,407 dollars.

d

Does the interpretation in the previous part make sense in this context?

A
Yes, when the explanatory variable has a value of zero, this is still within the data range and the value of the dependent variable makes sense.
B
No, when the explanatory variable has a value of zero, this is outside the data range and the value of the dependent variable does not makes sense.
Worked Solution
Create a strategy

Check if the value on part (c) lies in the data range, and whether the statement makes sense.

Apply the idea

We can see in the table that the age (x) shows a data range from 0 to 8, Since we are talking about Mitsubishi Lancers being brand new, where x=0, this is within the data range.

When a care is brand new it has a high value. So the statement makes sense.

So the answer is Option A.

e

State the gradient of the line.

Worked Solution
Create a strategy

Use the equation of the line where m is the coefficient of x.

Apply the idea

From the equation y=-1694x+18\,407:m=-1694

f

Which of the following is true?

A
If the explanatory variable increases by 1 unit, then the dependent variable increases by 1694 units.
B
If the explanatory variable increases by 1 unit, then the dependent variable increases by 18\,407 units.
C
If the explanatory variable increases by 1 unit, then the dependent variable decreases by 18\,407 units.
D
If the explanatory variable increases by 1 unit, then the dependent variable decreases by 1694 units.
Worked Solution
Create a strategy

Consider the size and sign of the gradient.

Apply the idea

From part (e), we identified the gradient as m=-1694.

Since the gradient is negative, the dependent variable will decrease for every 1 unit the independent variable increases. The amount it is decreasing by is 1694.

So the correct answer is D.

Idea summary

Given a least squares squares regression line of the form y=mx+c

The m value shows the gradient:

  • if the gradient is positive, when the explanatory increases by 1 unit, the response variable increases by m units.

  • if the gradient is negative, when the explanatory increases by 1 unit, the response variable decreases by m units.

The c value shows the vertical intercept (also known as the y-intercept):

  • when the explanatory variable is 0, the value of the response variable is c.

Correlation coefficient

The correlation coefficient was introduced in the previous chapter as a measure that tells us the strength of a relationship between two variables. It is denoted by the letter r. The sign of r also tells us the direction of the relationship.

A perfect positive correlation has a value of 1. That means that if we graphed the variables on a scatter plot, it would show that all the data points lie exactly on a straight line with a positive gradient.

A perfect negative correlation has a value of -1 in which case all data points lie exactly on a straight line with a negative gradient.

A value of 0 indicates that there is no relationship between the variables.

To be more descriptive about the relationship between variables, we can further divide up the line to indicate other values with descriptions like "weak", "moderate" and "strong", in the figure below.

A number line showing values and descriptions of correlation from negative 1 to 1. Ask your teacher for more information.

The strength of correlation depends on the size of the r value, so we can ignore the positive or negative sign:

  • if the size is less than 0.5, then we have a weak correlation;

  • if the size of the correlation is greater than 0.8, then we have a strong correlation.

  • if the size is between 0.5 and 0.8, then we a moderate correlation.

Examples

Example 3

Given the following data:

x14710131619
y44.254.554.44.454.754.2
a

Calculate the correlation coefficient and give your answer to two decimal places.

Worked Solution
Create a strategy

Use technology to find the correlation coefficient.

Apply the idea

Using the Statistics mode in your calculator, enter each x-value along with its y-value into a data table on your calculator then find the linear regression.

Look for the correlation coefficient (r):r=0.47

b

Choose the best description of this correlation.

A
Weak negative
B
Strong positive
C
Moderate negative
D
Weak positive
E
Strong negative
F
Moderate positive
Worked Solution
Create a strategy

Use the figure below to identify the best description of the correlation:

A number line showing values and descriptions of correlation from negative 1 to 1. Ask your teacher for more information.
Apply the idea

The value of r=0.47 is positive, and between 0 and 0.5. So there is a weak positive correlation between the variables.

The correct answer is D.

Idea summary

The correlation coefficient, r, tells us the strength and direction of the correlation between two variables.

If r is negative the direction of the correlation is negative. If r is positive the direction of the correlation is positive.

The strength of the correlation depends on the size of r as shown below:

A number line showing values and descriptions of correlation from negative 1 to 1. Ask your teacher for more information.

Coefficient of determination

When we discuss the coefficient of determination or the value of r^2, we can already tell that it must be related to the value of the correlation coefficient (r) and something to do with measuring the relationship between two variables.

r^2 tells us the proportion of the response variable (y) that can be explained by the variation in the explanatory variable (x).

For example, if r^2=0.92 then we can say that 92\% of the variation in the response variable is explained by the variation in the explanatory variable.

Alternatively, we can say that 8\% of the variation in the response variable is not explained by the variation in the explanatory variable.

Unlike the correlation coefficient, the coefficient of determination is always a positive value, so it does not indicate if there is a positive or negative relationship between variables.

The closer r^2 is to 1, the more the variation in the response variable is explained by the variation in the explanatory variable.

If we already have the value of r, we can square the value to get r^2.

So if r=0.8, then r^2=0.64, and if r=-0.9, then r^2=0.81.

If we don't already have the value of r, our calculators will calculate it for us. In fact, the r^2 value is given on the same screen as the r value.

Just as with r values when two variables have a strong relationship, that is r^2 is close to 1, we cannot say that one variable causes change in the other variable.

Examples

Example 4

A linear association between two data sets is such that the coefficient of determination, r^2, is 0.80.

a

Calculate the correlation coefficient if the relationship is negative. Give your answer to four decimal places.

Worked Solution
Create a strategy

Take the negative square root of the coefficient of determination.

Apply the idea

Remember that the relationship in this example is negative.

\displaystyle r\displaystyle =\displaystyle -\sqrt{0.80}Take the negative square root of r^2
\displaystyle =\displaystyle -0.8944Evaluate
b

Hence choose the option that best describes the strength of the relationship.

A
Weak
B
Moderate
C
Strong
Worked Solution
Create a strategy

Use the figure below to identify the best description of the correlation:

A number line showing values and descriptions of correlation from negative 1 to 1. Ask your teacher for more information.
Apply the idea

The correlation coeffficient is close to -1, so there is a strong relationship.

The correct answer is C.

Example 5

A linear association between two data sets is such that the correlation coefficient is -0.72.

What proportion of the variation can be explained by the linear relationship? Give your answer to the nearest percent.

Worked Solution
Create a strategy

The proportion of variation is also the same as the coefficient of determination. So, we square the correlation coefficient.

Apply the idea
\displaystyle r^2\displaystyle =\displaystyle (-0.72)^2Square -0.72
\displaystyle =\displaystyle 0.52Evaluate and round up

So 52\% of the variation can be explained by the linear relationship.

Example 6

The heights (in \text{cm}) and the weights (in \text{kg}) of 8 primary school children is shown on the scattergraph below.

110
115
120
125
130
135
140
145
\text{Height}
45
50
55
60
\text{Weight}
a

Calculate the value of the coefficient of determination. Give your answer to two decimal places.

Worked Solution
Create a strategy

Use the linear regression function on your calculator.

Apply the idea

Using the Statistics mode, enter each x-coordinate along with its y-coordinate into a data table on your calculator then find the linear regression.

Look for the coefficient of determination (r^2):r^2=0.93

b

Calculate the value of the correlation coefficient. Give your answer to two decimal places.

Worked Solution
Create a strategy

Take the square root of the coefficient of determination.

Apply the idea
\displaystyle r\displaystyle =\displaystyle \sqrt{0.93}Take the square root of 0.93
\displaystyle =\displaystyle 0.96Evaluate
Reflect and check

Or we could use the same procedure as in part (a), and look for the correlation of coefficient (r):r=0.96

c

What percentage of the variation in weight is accounted for by the height of the child? Give your answer to the nearest whole percent.

Worked Solution
Create a strategy

Convert the coefficient of determination into a percentage.

Apply the idea

We have already found a value for the coefficient of determination (as a decimal).

\displaystyle 0.93 \displaystyle =\displaystyle 0.93 \times 100\% Multiply by 100\%
\displaystyle =\displaystyle 93\%Evaluate
d

Consider these two comments on the claim “The weight of a child is primarily influenced by their height.”

Which do you think is most correct?

A
This claim is valid and is supported by the strong relationship between the two variables.
B
While this claim is supported by a strong relationship between the two variables, we cannot state the causality as there may be other factors inluencing the outcome.
Worked Solution
Create a strategy

Consider your answers to previous parts of this question and choose the statement that you think is most correct.

Apply the idea

The claim describes a strong relationship as the coefficient of determination, 0.93, is close to 1. But a high coefficient of determination (or correlation) value does not necessarily imply causation.

So the correct answer is B.

Idea summary

r^2 tells us the proportion of the response variable (y) that can be explained by the variation in the explanatory variable (x).

Outcomes

ACMGM054

calculate and interpret the correlation coefficient (r) to quantify the strength of a linear association

ACMGM057

model a linear relationship by fitting a least-squares line to the data

ACMGM059

interpret the intercept and slope of the fitted line

ACMGM060

use the coefficient of determination to assess the strength of a linear association in terms of the explained variation

What is Mathspace

About Mathspace