topic badge

2.03 Residuals

Lesson

Residuals

When we want to analyse the fit of a linear model to a bivariate set of data, we start by analysing the value of the correlation coefficient.

However, we might find we have a strong value for r, but looking at the data more closely, we realise that it is not actually linear, but instead is curved in shape.

The scatter plots below illustrate this idea. Both sets of data have a strong correlation, with similar lines of best fit.

On the left, the data points appear to be scattered randomly above and below the least-squares regression line. This randomness is expected when the linear model is suitable for the data.

20
40
60
80
\text{Sales}
10
20
30
40
\text{Profit}
Linear data
20
40
60
80
\text{Sales}
10
20
30
40
\text{Profit}
Non-linear data

However, on the right, the scatter plot shows a distinct pattern in the arrangement of the data points - starting below the line-of-best fit, then above the line, before returning below the line. Any pattern, such as this, suggests that the linear model is not appropriate.

To help us recognise any patterns and determine the suitability of a linear model, we can use a tool called a residual plot.

Residuals are the vertical distances from each data point to a line. When your calculator determines the least-squares regression line, it is minimising the residuals (actually the sum of the squares of the residuals) to choose the optimal coefficients for the line of best fit.

The scatter plots below show how the residuals are short when the line of best fit is chosen appropriately, and longer for a line that is a poor fit to the data.

Good fit (least-squares regression line)

A scatter plot the residuals of each point to the line of best fit which are mostly small. Ask your teacher for more information.

Poor fit

A scatter plot the residuals of each point to the line of best fit which are mostly large. Ask your teacher for more information.

Exploration

Experiment with this interactive tool to practice finding a good fit for data. The aim is to minimise the sum of the squares of the residual values.

Loading interactive...

The closer the regression line is to the data points, the smaller the residuals are.

To calculate a residual for a data value:

\text{Residual} = \text{Actual value} - \text{Predicted value}

\text{Residual} = y - \hat{y}Remember that the predicted value, \hat{y}, is obtained from the equation of the least-squares regression line (or the y-coordinate of the corresponding point on the line of best fit).

A positive residual means the actual data point is above the least-squares regression line and a negative residual means the raw data point is below the line.

Using the above relationships between the residual, actual value and predicted values, we are able to calculate any one of these values if we know the other two.

For instance, if the predicted value is 22 and the actual value is 19, then we can calculate the residual: \begin{aligned} \text{Residual} &= y - \hat{y} \\ &= 19-22 \\ y &= -3 \end{aligned}

If the residual is equal to 5 and the predicted value is 18, then we can calculate the actual value, with some rearranging to solve the equation:\begin{aligned} \text{Residual} &= y - \hat{y} \\ 5 &= y-18 \\ y &= 5+18 \\ &= 23 \end{aligned}

Similarly, if the residual is equal to -7 and the predicted value is actual value is 4, then we can calculate the predicted value (without knowing the equations for the least-squares regression line):\begin{aligned} \text{Residual} &= y - \hat{y} \\ -7 &= 4-\hat{y} \\ \hat{y} &= 4+7 \\ &= 11 \end{aligned}

Examples

Example 1

The following table shows the sets of data (x,\, y) and the predicted \hat{y}-values based on a least-squares regression line. Complete the table by finding the residuals.

x13579
y22.722.324.221.821.5
\hat{y}25.223.421.619.818
\text{Residuals}
Worked Solution
Create a strategy

Use the formula: \text{Residual} = y - \hat{y}

Apply the idea

Solving for the first column:

\displaystyle \text{Residual}\displaystyle =\displaystyle 22.7 - 25.2Substitute the values
\displaystyle =\displaystyle -2.5Evaluate

We can use the same process to find the remaining values shown in the completed table:

x13579
y22.722.324.221.821.5
\hat{y}25.223.421.619.818.0
\text{Residuals}-2.5-1.12.62.03.5
Idea summary

Line of best fit - The line which most closely models a set of bivariate data.

Least-squares regression - A technique for finding the line of best fit, which would then be called the least-squares regression line. This technique involves minimising the sum of the squares of the residuals, which is best done with technology.

Residual - The vertical distance between a data point and the line of best fit.

Calculating residuals:\text{Residual} = \text{Actual value} - \text{Predicted value}

\text{Residual} = y - \hat{y}Remember that the predicted value, \hat{y}, is obtained from the equation of the least-squares regression line.

Residual plot

To help us recognise any patterns and determine the suitability of a linear model, we can use a tool called a residual plot.

Exploration

To explore further, use this applet to move the points and show how the residuals are measured vertically from the least-squares regression line. Then switch to the "Residuals" to see how these residuals can be converted to a residual plot.

Loading interactive...

Residual plots show residual values on the y-axis and the independent variable on the x-axis.

A residual plot pattern can help you determine what is wrong with your model. For example, it may reveal clear outliers in the data or that there is a pattern in the data, causing the forecast to fall short of the mark.

Examples

Example 2

The residual plot for a set of data is shown below.

1
2
3
4
5
6
7
8
9
10
x
-5
-4
-3
-2
-1
1
2
3
4
5
y

Which of these scatter plots shows the original data set?

A
2
4
6
8
10
x
5
10
15
20
y
B
2
4
6
8
10
x
5
10
15
20
y
C
2
4
6
8
10
x
5
10
15
20
y
D
2
4
6
8
10
x
5
10
15
20
y
Worked Solution
Create a strategy

Check on how the residuals are placed above or below the line of best fit.

Apply the idea

Looking at the residual plot, we can see three residuals above and two below the x-axis.

Only options A and B have three data points above the line of best fit, and two below. So the other options are incorrect.

If we look at the first data point on the residual plot, we see that it is above the x-axis. So the first data point, at x=3, should be above the line.

Only option A is correct because we can see that the first residual lies above the line of best fit.

Idea summary

Residual plot - A graph that displays the residual for each point, rather than the actual data points.

If the residual data points are above the x-axis, then the original data points should be above the line of best fit.

If the residual data points are below the x-axis, then the original data points should be below the line of best fit.

Construct and analyse a residual plot

Once we've plotted our residuals against the independent variable, we want to analyse the plot for the suitability of using a linear regression model.

If the linear model is suitable:

  • The residuals are randomly scattered above and below the horizontal axis

  • No clustering of the residuals

  • Residuals are relatively small in size

If the linear model is not suitable:

  • The residual plot will show a clear pattern and/or

  • The residuals are relatively large in size

Here are some examples where the residual plot indicates that a linear model is suitable or not.

The image shows three different residual plots. Ask your teacher for more information.

Linear model is suitable

The image shows three different residual plots. Ask your teacher for more information.

Linear model is not suitable

Examples

Example 3

The table shows a company's costs y (in millions) in week x. The equation y=5x+12 is being used to model the data.

a

Complete the table of residuals:

xy\text{Model value}\text{Residual}
122
225
433
639
953
1269
1481
1799
Worked Solution
Create a strategy

To calculate the model value, substitute each value of x into the equation of the least-square regression line.

To calculate the residual value, subtract the model value from y.

Apply the idea

Solving for the first row:

\displaystyle \text{Model value}\displaystyle =\displaystyle 5 \times 1 +12Substitute x=1
\displaystyle =\displaystyle 17Evaluate
\displaystyle \text{Residual}\displaystyle =\displaystyle 22-17Subtract the model value from y
\displaystyle =\displaystyle 5Evaluate

The same process can be done for the remaining values shown in the completed table:

xy\text{Model value}\text{Residual}
122175
225223
433321
63942-3
95357-4
126972-3
148182-1
1799972
b

Plot the residuals on the scatter plot.

Worked Solution
Create a strategy

Plot each value of x along with its residual value.

Apply the idea

The points we are plotting have coordinates: (1,5),(2,3),(4,1),(6,-3),(9,-4),(12,-3),(14,-1),(17,2).

Here's the residual plot:

2
4
6
8
10
12
14
16
18
x
-5
-4
-3
-2
-1
1
2
3
4
5
\text{Residual}
c

Is this model a good fit for the data?

Worked Solution
Create a strategy

Look if there is a pattern formed on the residual plot in part (b).

Apply the idea

No, the linear model is not a good fit for this data as there is a parabolic pattern present in the residual plot.

Idea summary

If the linear model is suitable:

  • The residuals are randomly scattered above and below the horizontal axis

  • No clustering of the residuals

  • Residuals are relatively small in size

If the linear model is not suitable:

  • The residual plot will show a clear pattern and/or

  • The residuals are relatively large in size

Outcomes

ACMGM058

use a residual plot to assess the appropriateness of fitting a linear model to the data

What is Mathspace

About Mathspace