When we want to analyse the fit of a linear model to a bivariate set of data, we start by analysing the value of the correlation coefficient.
However, we might find we have a strong value for r, but looking at the data more closely, we realise that it is not actually linear, but instead is curved in shape.
The scatter plots below illustrate this idea. Both sets of data have a strong correlation, with similar lines of best fit.
On the left, the data points appear to be scattered randomly above and below the least-squares regression line. This randomness is expected when the linear model is suitable for the data.
However, on the right, the scatter plot shows a distinct pattern in the arrangement of the data points - starting below the line-of-best fit, then above the line, before returning below the line. Any pattern, such as this, suggests that the linear model is not appropriate.
To help us recognise any patterns and determine the suitability of a linear model, we can use a tool called a residual plot.
Residuals are the vertical distances from each data point to a line. When your calculator determines the least-squares regression line, it is minimising the residuals (actually the sum of the squares of the residuals) to choose the optimal coefficients for the line of best fit.
The scatter plots below show how the residuals are short when the line of best fit is chosen appropriately, and longer for a line that is a poor fit to the data.
Good fit (least-squares regression line)
Poor fit
Experiment with this interactive tool to practice finding a good fit for data. The aim is to minimise the sum of the squares of the residual values.
The closer the regression line is to the data points, the smaller the residuals are.
To calculate a residual for a data value:
\text{Residual} = \text{Actual value} - \text{Predicted value}
\text{Residual} = y - \hat{y}Remember that the predicted value, \hat{y}, is obtained from the equation of the least-squares regression line (or the y-coordinate of the corresponding point on the line of best fit).
A positive residual means the actual data point is above the least-squares regression line and a negative residual means the raw data point is below the line.
Using the above relationships between the residual, actual value and predicted values, we are able to calculate any one of these values if we know the other two.
For instance, if the predicted value is 22 and the actual value is 19, then we can calculate the residual: \begin{aligned} \text{Residual} &= y - \hat{y} \\ &= 19-22 \\ y &= -3 \end{aligned}
If the residual is equal to 5 and the predicted value is 18, then we can calculate the actual value, with some rearranging to solve the equation:\begin{aligned} \text{Residual} &= y - \hat{y} \\ 5 &= y-18 \\ y &= 5+18 \\ &= 23 \end{aligned}
Similarly, if the residual is equal to -7 and the predicted value is actual value is 4, then we can calculate the predicted value (without knowing the equations for the least-squares regression line):\begin{aligned} \text{Residual} &= y - \hat{y} \\ -7 &= 4-\hat{y} \\ \hat{y} &= 4+7 \\ &= 11 \end{aligned}
The following table shows the sets of data (x,\, y) and the predicted \hat{y}-values based on a least-squares regression line. Complete the table by finding the residuals.
x | 1 | 3 | 5 | 7 | 9 |
---|---|---|---|---|---|
y | 22.7 | 22.3 | 24.2 | 21.8 | 21.5 |
\hat{y} | 25.2 | 23.4 | 21.6 | 19.8 | 18 |
\text{Residuals} |
Line of best fit - The line which most closely models a set of bivariate data.
Least-squares regression - A technique for finding the line of best fit, which would then be called the least-squares regression line. This technique involves minimising the sum of the squares of the residuals, which is best done with technology.
Residual - The vertical distance between a data point and the line of best fit.
Calculating residuals:\text{Residual} = \text{Actual value} - \text{Predicted value}
\text{Residual} = y - \hat{y}Remember that the predicted value, \hat{y}, is obtained from the equation of the least-squares regression line.
To help us recognise any patterns and determine the suitability of a linear model, we can use a tool called a residual plot.
To explore further, use this applet to move the points and show how the residuals are measured vertically from the least-squares regression line. Then switch to the "Residuals" to see how these residuals can be converted to a residual plot.
Residual plots show residual values on the y-axis and the independent variable on the x-axis.
A residual plot pattern can help you determine what is wrong with your model. For example, it may reveal clear outliers in the data or that there is a pattern in the data, causing the forecast to fall short of the mark.
The residual plot for a set of data is shown below.
Which of these scatter plots shows the original data set?
Residual plot - A graph that displays the residual for each point, rather than the actual data points.
If the residual data points are above the x-axis, then the original data points should be above the line of best fit.
If the residual data points are below the x-axis, then the original data points should be below the line of best fit.
Once we've plotted our residuals against the independent variable, we want to analyse the plot for the suitability of using a linear regression model.
If the linear model is suitable:
The residuals are randomly scattered above and below the horizontal axis
No clustering of the residuals
Residuals are relatively small in size
If the linear model is not suitable:
The residual plot will show a clear pattern and/or
The residuals are relatively large in size
Here are some examples where the residual plot indicates that a linear model is suitable or not.
The table shows a company's costs y (in millions) in week x. The equation y=5x+12 is being used to model the data.
Complete the table of residuals:
x | y | \text{Model value} | \text{Residual} |
---|---|---|---|
1 | 22 | ||
2 | 25 | ||
4 | 33 | ||
6 | 39 | ||
9 | 53 | ||
12 | 69 | ||
14 | 81 | ||
17 | 99 |
Plot the residuals on the scatter plot.
Is this model a good fit for the data?
If the linear model is suitable:
The residuals are randomly scattered above and below the horizontal axis
No clustering of the residuals
Residuals are relatively small in size
If the linear model is not suitable:
The residual plot will show a clear pattern and/or
The residuals are relatively large in size