When we want to analyse the fit of a linear model to a bivariate set of data, we start by analysing the value of the correlation coefficient.
However, we might find we have a strong value for $r$r, but looking at the data more closely, we realise that it is not actually linear, but instead is curved in shape.
The scatter plots below illustrate this idea. Both sets of data have a strong correlation, with similar lines of best fit.
On the left, the data points appear to be scattered randomly above and below the least-squares regression line. This randomness is expected when the linear model is suitable for the data.
However, on the right, the scatter plot shows a distinct pattern in the arrangement of the data points - starting below the line-of-best fit, then above the line, before returning below the line. Any pattern, such as this, suggests that the linear model is not appropriate.
Linear data
Non-linear data
To help us recognise any patterns and determine the suitability of a linear model, we can use a tool called a residual plot.
Residuals are the vertical distances from each data point to a line. When your calculator determines the least-squares regression line, it is minimising the residuals (actually the sum of the squares of the residuals) to choose the optimal coefficients for the line of best fit.
The scatter plots below show how the residuals are short when the line of best fit is chosen appropriately, and longer for a line that is a poor fit to the data.
Good fit (least-squares regression line)
Poor fit
Experiment with this interactive tool to practice finding a good fit for data. The aim is to minimise the sum of the squares of the residual values.
To explore further, use this applet to move the points and show how the residuals are measured vertically from the least-squares regression line. Then switch to the "Residuals" to see how these residuals can be converted to a residual plot.
Line of best fit - The line which most closely models a set of bivariate data.
Least-squares regression - A technique for finding the line of best fit, which would then be called the least-squares regression line. This technique involves minimising the sum of the squares of the residuals, which is best done with technology.
Residual - The vertical distance between a data point and the line of best fit.
Residual plot - A graph that displays the residual for each point, rather than the actual data points.
$\text{Residual}=\text{Actual value}-\text{Predicted value}$Residual=Actual value−Predicted value
$\text{Residual}=y-\hat{y}$Residual=y−^y
Remember that the predicted value, $\hat{y}$^y, is obtained from the equation of the least-squares regression line.
A positive residual means the actual data point is above the least-squares regression line and a negative residual means the raw data point is below the line.
Using the above relationships between the residual, actual value and predicted values, we are able to calculate any one of these values if we know the other two.
For instance, if the predicted value is $22$22 and the actual value is $19$19, then we can calculate the residual:
$\text{residual}$residual | $=$= | $y-\hat{y}$y−^y |
$=$= | $19-22$19−22 | |
$y$y | $=$= | $-3$−3 |
If the residual is equal to $5$5 and the predicted value is $18$18, then we can calculate the actual value, with some rearranging to solve the equation:
$\text{residual}$residual | $=$= | $y-\hat{y}$y−^y |
$5$5 | $=$= | $y-18$y−18 |
$y$y | $=$= | $5+18$5+18 |
$=$= | $23$23 |
Similarly, if the residual is equal to $-7$−7 and the predicted value is actual value is $4$4, then we can calculate the predicted value (without knowing the equations for the least-squares regression line):
$\text{residual}$residual | $=$= | $y-\hat{y}$y−^y |
$-7$−7 | $=$= | $4-\hat{y}$4−^y |
$\hat{y}$^y | $=$= | $4+7$4+7 |
$=$= | $11$11 |
The following table shows the sets of data $\left(x,y\right)$(x,y) and the predicted $\hat{y}$^y values based on a least-squares regression line. Complete the table by finding the residuals.
$x$x-values | $1$1 | $3$3 | $5$5 | $7$7 | $9$9 |
---|---|---|---|---|---|
$y$y-values | $22.7$22.7 | $22.3$22.3 | $24.2$24.2 | $21.8$21.8 | $21.5$21.5 |
$\hat{y}$^y | $25.2$25.2 | $23.4$23.4 | $21.6$21.6 | $19.8$19.8 | $18$18 |
Residuals | $\editable{}$ | $\editable{}$ | $\editable{}$ | $\editable{}$ | $\editable{}$ |
The residual plot for a set of data is shown below.
Which of these scatter plots shows the original data set?
The table shows a company's profit $P$P (in $millions) for total monthly sales $S$S. The equation $P=0.4S-10$P=0.4S−10 is being used to model the data.
(a) Complete table with predicted profit and residuals, based on the linear model.
Sales $S$S |
Profit $P$P |
Predicted profit $\hat{P}$^P |
Residual $P-\hat{P}$P−^P |
---|---|---|---|
$30$30 | $-8$−8 | ||
$80$80 | $24$24 | ||
$50$50 | $12$12 | ||
$100$100 | $23$23 | ||
$60$60 | $17$17 | ||
$70$70 | $23$23 | ||
$90$90 | $24$24 | ||
$40$40 | $3$3 |
Think: calculate the predicted value and residual value of $P$P for each of the given $S$S values.
Do: The residual is calculated using the formula, $\text{residual}=y-\hat{y}$residual=y−^y
The required substitutions and calculations for the first row are:
$\hat{P}$^P | $=$= | $0.4\times30-10$0.4×30−10 |
$=$= | $2$2 | |
$\text{residual}$residual | $=$= | $-0.8-0.2$−0.8−0.2 |
$=$= | $-10$−10 |
The remaining values are shown in the completed table:
Sales $S$S |
Profit $P$P |
Predicted profit $\hat{P}$^P |
Residual $P-\hat{P}$P−^P |
---|---|---|---|
$30$30 | $-8$−8 | $2$2 | $-10$−10 |
$80$80 | $24$24 | $22$22 | $2$2 |
$50$50 | $12$12 | $10$10 | $2$2 |
$100$100 | $23$23 | $30$30 | $-7$−7 |
$60$60 | $17$17 | $14$14 | $3$3 |
$70$70 | $23$23 | $18$18 | $5$5 |
$90$90 | $24$24 | $26$26 | $-2$−2 |
$40$40 | $3$3 | $6$6 | $-3$−3 |
(b) Construct a residual plot for the data in part (a).
Think: Each value of $S$S and the corresponding residual value will make up the coordinates for each point on the residual plot.
Do: Construct the graph, choosing appropriate scales and labelling the axes. Take care to place each point accurately.
(c) Is this model a good fit for the data? Justify your answer.
Think: If the linear model is a good fit, the residual plot should show a random scattering of points values, above and below $0$0, with no obvious pattern.
Do: No, a linear model is not a good fit for this data as there is a pattern present in the residual plot.
Calculating residuals and constructing the residual plot manually for a large set of data is tedious, so we can use our CAS calculator to do this for us.
Select the brand of calculator you use below to work through an example of using a calculator to generate a residual plot.
Casio Classpad
How to use the CASIO Classpad to complete the following tasks regarding creating residual plots.
Consider the data set given below:
$x$x | $2$2 | $4$4 | $5$5 | $7$7 | $11$11 | $15$15 | $16$16 | $19$19 | $22$22 | $25$25 |
---|---|---|---|---|---|---|---|---|---|---|
$y$y | $1.5$1.5 | $5.8$5.8 | $6.9$6.9 | $13.2$13.2 | $20.0$20.0 | $34.5$34.5 | $34.7$34.7 | $41.0$41.0 | $49.2$49.2 | $55.1$55.1 |
Use your calculator to generate the residual plot associated with the least squares regression line for the data.
TI Nspire
How to use the TI Nspire to complete the following tasks regarding creating residual plots.
Consider the data set given below:
$x$x | $2$2 | $4$4 | $5$5 | $7$7 | $11$11 | $15$15 | $16$16 | $19$19 | $22$22 | $25$25 |
---|---|---|---|---|---|---|---|---|---|---|
$y$y | $1.5$1.5 | $5.8$5.8 | $6.9$6.9 | $13.2$13.2 | $20.0$20.0 | $34.5$34.5 | $34.7$34.7 | $41.0$41.0 | $49.2$49.2 | $55.1$55.1 |
Use your calculator to generate the residual plot associated with the least squares regression line for the data.
Once we've plotted our residuals against the independent variable, we want to analyse the plot for the suitability of using a linear regression model.
If the linear model is suitable:
If the linear model is NOT suitable:
Here are some examples where the residual plot indicates that a linear model is suitable or not.
Linear model is suitable
Linear model is NOT suitable
If we take a look at the image below, we see on the left a scatterplot and a linear regression line fitted to some data. On the right we see the residual plot for the data.
Were we to only look at the scatterplot and the strong correlation ($0.994$0.994), we'd assume a linear model was appropriate. But when we examine the residual plot, there is certainly a pattern evident in the residuals (in this case, a parabolic pattern) and so we might need to rethink what sort of model might best suit this data.
The table shows a company's costs $y$y (in millions) in week $x$x. The equation $y=5x+12$y=5x+12 is being used to model the data.
Complete the table of residuals:
$x$x | $y$y | Value generated by model | Residual |
---|---|---|---|
$1$1 | $22$22 | $17$17 | $5$5 |
$2$2 | $25$25 | $22$22 | $3$3 |
$4$4 | $33$33 | $32$32 | $\editable{}$ |
$6$6 | $39$39 | $\editable{}$ | $\editable{}$ |
$9$9 | $53$53 | $\editable{}$ | $\editable{}$ |
$12$12 | $69$69 | $\editable{}$ | $\editable{}$ |
$14$14 | $81$81 | $\editable{}$ | $\editable{}$ |
$17$17 | $99$99 | $\editable{}$ | $\editable{}$ |
Plot the residuals on the scatter plot.
Is this model a good fit for the data?
Yes
No