topic badge
AustraliaVIC
VCE 12 General 2023

3.01 Least squares regression line

Lesson

Introduction

A line of best fit is a straight line that best represents bivariate data on a scatter plot. This line may pass through some of the points, none of the points, or all of the points. Lines of best fit are useful to determine the equation of the line and use the equation to make predictions.

The least squares line of best fit

The most common method of fitting a line of best fit to a scatter plot of bivariate data is using the least squares method. This is then called a least squares line of best fit, or sometimes a Least Squares Regression Line.

In a sentence, it is a technique that minimises the sum of the squares of the vertical distance from each data point to a straight line. Perhaps an easier way to understand it is through a demonstration.

Exploration

Experiment with this Geogebra applet.

  1. Refresh the applet and generate a new scatter plot to experiment with.

  2. Drag the slider to Stage 2 and move the blue dots around the rectangle until you have a line which you think is the Line of Best Fit.

  3. Drag the slider to Stage 3 to see what are called residuals. For now, think of these as the distance between the line and the actual data.

  4. Drag the slider to Stage 4. Here you see the residuals turn to squares. Now move the blue dots on the line around again and try to make the total area of all the squares combined to be as small as possible. (Note, depending on the scale, these squares may appear rectangular)

  5. Drag the slider to Stage 5. Your blue Line of Best Fit should be very close to the green Least Squares Regression Line. And the sum of your squares should be very close to the least sum of the squares.

Loading interactive...

In Stage 2, we make sure to move the line of best fits where the number of data points is equal above and below the line. In Stage 4, we minimize the overall area of all the squares combined. By doing this, moving to Stage 5, we should be very close to the green line or the least-squares regression line.

The equation of a least squares line of best fit is the same as the equation of a straight line, but is usually presented in a different form. We may recall the equation of a straight line as, y=mx+c or y=ax+b

However, a least squares line is usually written as, y=a+bx or RV=a+b\times EV where RV stands for the Response variable and EV stands for the Explanatory variable

Plotting a least squares line on a graph is the same as plotting the graph of a straight line and finding the equation of a least squares line from a graph is the same as finding the equation of a straight line from a graph.

Most of the time, you will be required to calculate the equation of the Least Squares Regression Line using technology.

Here's a video on how to use the TI-Nspire to create a scatter graph and calculate the equation of the Least Squares Regression Line.

Loading video...

Examples

Example 1

The table shows the number of people who went to watch a movie x weeks after it was released.

\text{Weeks }(x)1234567
\text{Number of people }(y)37373333292925
a

Plot the points from the table.

Worked Solution
Create a strategy

Plot each x-value along with its corresponding y-value.

Apply the idea
1
2
3
4
5
6
7
8
x
25
30
35
40
y

The points from the table have the coordinates (1,37), \,(2,37), \,(3,33), \,(4,33), \,(5,29), \,(6,29), \\(7,25) .

b

If a line of best fit were drawn to approximate the relationship, which of the following could be its equation?

A
y=-2x+40
B
y=2x+40
C
y=-2x
D
y=2x
Worked Solution
Create a strategy

Check the trend in the scatterplot.

Apply the idea

We can see that the trend in the scatterplot is decreasing which means we have a negative gradient. So options B and D are incorrect. Also, option C is incorrect because it implies that the yintercept is zero, whereas the trend contradicts it.

So the correct answer is option A.

c

Graph the line of best fit whose equation is given by y=-2x+40.

Worked Solution
Create a strategy

To graph the line, identify any two points that satisfy the equation. One point may be the y-intercept.

Apply the idea

By substituting x=0 to the equation, we have: \begin{aligned} y&=-2(0)+40 \\ y&=40 \end{aligned}

Solving the next point, with x=2, we have: \begin{aligned} y&=-2(2)+40 \\ y&=36 \end{aligned}

1
2
3
4
5
6
7
8
x
25
30
35
40
y

Here is the scatterplot with its line of best fit.

Reflect and check

We can see that the line of best fit follows the trend of the data and has the same number of points above and below the line.

d

Use the equation of the line of best fit to find the number of people who went to watch the movie 12 weeks after it was released.

Worked Solution
Create a strategy

Substitute x=12 to the equation.

Apply the idea
\displaystyle \text{Number of people}\displaystyle =\displaystyle -2(12)+40Substitute x=12
\displaystyle =\displaystyle -24+40Perform the multiplication
\displaystyle =\displaystyle 16Evaluate
Idea summary

The equations of a straight line are as follows:

  • y=mx+c

  • y=ax+b

Equation of a least squares regression line is usually written as:

  • y=a+bx

  • RV=a+b\times EV

    • where RV stands for the Response variable and EV stands for the Explanatory variable

Calculate the least squares line from summary statistics

If we aren't given the data set but instead have certain statistics calculated from the data set, we can still calculate the equation of the least squares regression line.

We will use the following formulae: y=a+bx,\,b=r\dfrac{s_y}{s_x}, and a=\overline{y}-b\overline{x} where s_x is the standard deviation of x, \,s_y is the standard deviation of y, \,\overline{x} is the mean of x, \,\overline{y} is the mean of y, and r is the correlation coefficient.

Examples

Example 2

A bivariate data set contains 10 data points with the following summary statistics:

  • \overline{x}=5.13

  • s_x=2.85

  • \overline{y}=18.81

  • s_y=7.54

  • r=0.993

a

Calculate the slope of the least squares regression line. Give your answer to two decimal places.

Worked Solution
Create a strategy

Use the formula: b=\dfrac{rs_y}{s_x}

Apply the idea
\displaystyle b\displaystyle =\displaystyle \dfrac{0.993 \times 7.54}{2.85}Subsitute r=0.993,\,s_y=7.54,\,s_x=2.85
\displaystyle =\displaystyle 2.63Evaluate using your calculator
b

Using the rounded value of the previous part, calculate the vertical intercept of the least squares regression line. Give your answer to two decimal places.

Worked Solution
Create a strategy

Use the formula: a=\overline{y}-b\overline{x}

Apply the idea

In part (a), we found the slope of the line to be 2.63.

\displaystyle a\displaystyle =\displaystyle 18.81-2.63 \times 5.13Subsitute \overline{y}=18.81,\,b=2.63,\,\overline{y}=5.13
\displaystyle =\displaystyle 5.32Evaluate using your calculator
c

State the equation of the least squares regression line.

Worked Solution
Create a strategy

Use the formula: y=a+bx

Apply the idea

Substitute the values we found in parts (a) and (b): y=5.32+2.63x

Example 3

The equation for the line of best fit is given by P=-4t+116, where t is time. This relationship shows that over time, P is:

A
remaining constant
B
decreasing
C
increasing
Worked Solution
Create a strategy

Recall that the sign of the gradient determines the trend of the data.

Apply the idea

In the equation, the gradient is the coefficient of your variable. In this case, it is -4. It is negative, which means it has a decreasing trend. So the correct option is B.

Idea summary

The following are the formulae in calculating the least squares regression line: y=a+bx \qquad b=r\dfrac{s_y}{s_x}\qquad a=\overline{y}-b\overline{x}

where:

  • a= the y-intercept

  • b= the slope

  • s_x= the standard deviation of x

  • s_y= the standard deviation of y

  • \overline{x}= the mean of x

  • \overline{y}= the mean of y

  • r= the correlation coefficient

Outcomes

U3.AoS1.23

answer statistical questions that require a knowledge of the associations between pairs of variables

U3.AoS1.11

least squares line and its use in modelling linear associations

U3.AoS1.25

use the least squares line of best fit to model and analyse the linear association between two numerical variables and interpret the model in the context of the association being modelled

What is Mathspace

About Mathspace