topic badge

2.03 Residuals

Lesson

When we want to analyse the fit of a linear model to a bivariate set of data, we start by analysing the value of the correlation coefficient. 

However, we might find we have a strong value for $r$r, but looking at the data more closely, we realise that it is not actually linear, but instead is curved in shape. 

The scatter plots below illustrate this idea.  Both sets of data have a strong correlation, with similar lines of best fit. 

On the left, the data points appear to be scattered randomly above and below the least-squares regression line.  This randomness is expected when the linear model is suitable for the data.

However, on the right, the scatter plot shows a distinct pattern in the arrangement of the data points - starting below the line-of-best fit, then above the line, before returning below the line.  Any pattern, such as this, suggests that the linear model is not appropriate.

Linear data

Non-linear data

 

To help us recognise any patterns and determine the suitability of a linear model, we can use a tool called a residual plot.

Residuals are the vertical distances from each data point to a line. When your calculator determines the least-squares regression line, it is minimising the residuals (actually the sum of the squares of the residuals) to choose the optimal coefficients for the line of best fit.

The scatter plots below show how the residuals are short when the line of best fit is chosen appropriately, and longer for a line that is a poor fit to the data.

Good fit (least-squares regression line)

Poor fit

 

To explore further, use this applet to move the points and show how the residuals are measured vertically from the least-squares regression line.  Then switch to the "Residuals" to see how these residuals can be converted to a residual plot.

 

Summary

Line of best fit - The line which most closely models a set of bivariate data.

Least-squares regression - A technique for finding the line of best fit, which would then be called the least-squares regression line.  This technique involves minimising the sum of the squares of the residuals, which is best done with technology.

Residual - The vertical distance between a data point and the line of best fit.

Residual plot - A graph that displays the residual for each point, rather than the actual data points.

 

Calculating residuals

Calculating residuals

$\text{Residual}=\text{Actual value}-\text{Predicted value}$Residual=Actual valuePredicted value

$\text{Residual}=y-\hat{y}$Residual=y^y

Remember that the predicted value, $\hat{y}$^y,  is obtained from the equation of the least-squares regression line.

 

A positive residual means the actual data point is above the least-squares regression line and a negative residual means the raw data point is below the line.

Using the above relationships between the residual, actual value and predicted values, we are able to calculate any one of these values if we know the other two. 

For instance, if the predicted value is $22$22 and the actual value is $19$19, then we can calculate the residual:

$\text{residual}$residual $=$= $y-\hat{y}$y^y
  $=$= $19-22$1922
$y$y $=$= $-3$3

If the residual is equal to $5$5 and the predicted value is $18$18, then we can calculate the actual value, with some rearranging to solve the equation:

$\text{residual}$residual $=$= $y-\hat{y}$y^y
$5$5 $=$= $y-18$y18
$y$y $=$= $5+18$5+18
  $=$= $23$23

Similarly, if the residual is equal to $-7$7 and the predicted value is actual value is $4$4, then we can calculate the predicted value (without knowing the equations for the least-squares regression line):

$\text{residual}$residual $=$= $y-\hat{y}$y^y
$-7$7 $=$= $4-\hat{y}$4^y
$\hat{y}$^y $=$= $4+7$4+7
  $=$= $11$11

 

Practice question

Question 1

The following table shows the sets of data $\left(x,y\right)$(x,y) and the predicted $\hat{y}$^y values based on a least-squares regression line. Complete the table by finding the residuals.

  1. $x$x-values $1$1 $3$3 $5$5 $7$7 $9$9
    $y$y-values $22.7$22.7 $22.3$22.3 $24.2$24.2 $21.8$21.8 $21.5$21.5
    $\hat{y}$^y $25.2$25.2 $23.4$23.4 $21.6$21.6 $19.8$19.8 $18$18
    Residuals $\editable{}$ $\editable{}$ $\editable{}$ $\editable{}$ $\editable{}$
     

Question 2

The residual plot for a set of data is shown below.

  1. Which of these scatter plots shows the original data set?

    A

    B

    C

    D

    A

    B

    C

    D

 

Constructing a residual plot


Worked example

The table shows a company's profit $P$P (in $millions) for total monthly sales $S$S. The equation $P=0.4S-10$P=0.4S10 is being used to model the data.

(a) Complete table with predicted profit and residuals, based on the linear model.

Sales

$S$S

Profit

$P$P

Predicted profit

$\hat{P}$^P

Residual

$P-\hat{P}$P^P

$30$30 $-8$8    
$80$80 $24$24    
$50$50 $12$12    
$100$100 $23$23    
$60$60 $17$17    
$70$70 $23$23    
$90$90 $24$24    
$40$40 $3$3    

 

 

Think: calculate the predicted value and residual value of $P$P for each of the given $S$S values.

Do: The residual is calculated using the formula, $\text{residual}=y-\hat{y}$residual=y^y

  1. Calculate the predicted values of $\hat{P}$^P:
    • substitute each value of $S$S into the equation of the least-squares regression line; 
  2. Calculate the residual values:
    • subtract the predicted value of $\hat{P}$^P from the corresponding actual $P$P value.

The required substitutions and calculations for the first row are:

$\hat{P}$^P   $=$= $0.4\times30-10$0.4×3010
  $=$= $2$2
$\text{residual}$residual $=$= $-0.8-0.2$0.80.2
  $=$= $-10$10

 

The remaining values are shown in the completed table:

Sales

$S$S

Profit

$P$P

Predicted profit

$\hat{P}$^P

Residual

$P-\hat{P}$P^P

$30$30 $-8$8 $2$2 $-10$10
$80$80 $24$24 $22$22 $2$2
$50$50 $12$12 $10$10 $2$2
$100$100 $23$23 $30$30 $-7$7
$60$60 $17$17 $14$14 $3$3
$70$70 $23$23 $18$18 $5$5
$90$90 $24$24 $26$26 $-2$2
$40$40 $3$3 $6$6 $-3$3

 

(b) Construct a residual plot for the data in part (a).

Think: Each value of $S$S and the corresponding residual value will make up the coordinates for each point on the residual plot. 

Do: Construct the graph, choosing appropriate scales and labelling the axes.  Take care to place each point accurately.

(c) Is this model a good fit for the data?  Justify your answer.

Think: If the linear model is a good fit, the residual plot should show a random scattering of points values, above and below $0$0, with no obvious pattern. 

Do: No, a linear model is not a good fit for this data as there is a pattern present in the residual plot.

 

Residual plots with a calculator

Calculating residuals and constructing the residual plot manually for a large set of data is tedious, so we can use our CAS calculator to do this for us.

Select the brand of calculator you use below to work through an example of using a calculator to generate a residual plot.

 

Casio Classpad

How to use the CASIO Classpad to complete the following tasks regarding creating residual plots.

Consider the data set given below:

$x$x $2$2 $4$4 $5$5 $7$7 $11$11 $15$15 $16$16 $19$19 $22$22 $25$25
$y$y $1.5$1.5 $5.8$5.8 $6.9$6.9 $13.2$13.2 $20.0$20.0 $34.5$34.5 $34.7$34.7 $41.0$41.0 $49.2$49.2 $55.1$55.1
  1. Use your calculator to generate the residual plot associated with the least squares regression line for the data.

TI Nspire

How to use the TI Nspire to complete the following tasks regarding creating residual plots.

Consider the data set given below:

$x$x $2$2 $4$4 $5$5 $7$7 $11$11 $15$15 $16$16 $19$19 $22$22 $25$25
$y$y $1.5$1.5 $5.8$5.8 $6.9$6.9 $13.2$13.2 $20.0$20.0 $34.5$34.5 $34.7$34.7 $41.0$41.0 $49.2$49.2 $55.1$55.1
  1. Use your calculator to generate the residual plot associated with the least squares regression line for the data.

 

Analysing the residual plot

Once we've plotted our residuals against the independent variable, we want to analyse the plot for the suitability of using a linear regression model.

Analysing residuals

If the linear model is suitable:

  • The residuals are randomly scattered above and below the horizontal axis
  • No clustering of the residuals
  • Residuals are relatively small in size

If the linear model is NOT suitable:

  • The residual plot will show a clear pattern and/or
  • The residuals are relatively large in size

 

Here are some examples where the residual plot indicates that a linear model is suitable or not.

Linear model is suitable

Linear model is NOT suitable

 

If we take a look at the image below, we see on the left a scatterplot and a linear regression line fitted to some data. On the right we see the residual plot for the data.

 

Were we to only look at the scatterplot and the strong correlation ($0.994$0.994), we'd assume a linear model was appropriate. But when we examine the residual plot, there is certainly a pattern evident in the residuals (in this case, a parabolic pattern) and so we might need to rethink what sort of model might best suit this data.

 

Practice question

Question 3

The table below shows the residual values after a least-squares regression line has been fitted to a set of data.

$x$x $12$12 $20$20 $10$10 $18$18 $9$9 $20$20
Residual $-4$4 $-2$2 $5$5 $2$2 $3$3 $-1$1

 

  1. Create a residual plot for this data set.

     

    Loading Graph...

  2. Which of the following best describes the suitability of a linear model for this data set?

    A linear model is suitable because there is a distinct pattern in the residual plot.

    A

    A linear model is not suitable because there is no pattern in the residual plot.

    B

    A linear model is not suitable because there is a distinct pattern in the residual plot.

    C

    A linear model is suitable because there is no pattern in the residual plot.

    D

    A linear model is suitable because there is a distinct pattern in the residual plot.

    A

    A linear model is not suitable because there is no pattern in the residual plot.

    B

    A linear model is not suitable because there is a distinct pattern in the residual plot.

    C

    A linear model is suitable because there is no pattern in the residual plot.

    D

Question 4

The table shows a company's costs $y$y (in millions) in week $x$x. The equation $y=5x+12$y=5x+12 is being used to model the data.

  1. Complete the table of residuals:

    $x$x $y$y Value generated by model Residual
    $1$1 $22$22 $\editable{}$ $\editable{}$
    $2$2 $25$25 $\editable{}$ $\editable{}$
    $4$4 $33$33 $\editable{}$ $\editable{}$
    $6$6 $39$39 $\editable{}$ $\editable{}$
    $9$9 $53$53 $\editable{}$ $\editable{}$
    $12$12 $69$69 $\editable{}$ $\editable{}$
    $14$14 $81$81 $\editable{}$ $\editable{}$
    $17$17 $99$99 $\editable{}$ $\editable{}$
  2. Plot the residuals on the scatter plot.

    Loading Graph...

  3. Is this model a good fit for the data?

    Yes

    A

    No

    B

    Yes

    A

    No

    B

 

Outcomes

ACMGM058

use a residual plot to assess the appropriateness of fitting a linear model to the data

What is Mathspace

About Mathspace