topic badge

6.09 Linear regressions and residual plots

Lesson

Least squares regression

Recall that when we examine a set of bivariate data, we first examine the data and its scatter plot to look for a correlation between the two variables. Mathematically, we can calculate the correlation coefficient ($r$r) between the two variables.

Previously, we looked at how we could estimate the line of best fit by eye, but in order to make more accurate predictions, we want to calculate the equation of this line mathematically. We will use the least squares regression line as our line of best fit.

What does least squares mean?

When we have a line of best fit, there is a difference between the actual points and the line. The vertical distance between a data point and the line of best fit is called the residual

The best way to understand it is through a demonstration. Follow the steps below for the GeoGebra applet. 

  1. Refresh the applet and generate a new scatter plot to experiment with.
  2. Drag the slider to Stage 2 and move the blue dots around the rectangle until you have a line which you think is the line of best fit.
  3. Drag the slider to Stage 3 and you'll see the residuals. For now, think of these as the distance between your line and the actual data.
  4. Drag the slider to Stage 4. Here you see the residuals turn to squares. Now move the blue dots on the line around again and try to make the total area of all the squares combined to be as small as possible.
  5. Drag the slider to Stage 5. How did you do? Your blue Line of Best Fit should be very close to the green Least Squares Regression Line. And the sum of your squares should be very close to the least sum of the squares.

Remember!

Residual: The vertical difference between a data point and the line of best fit.

Least squares regression line: The line of best fit which minimizes the sum of the areas of the squares formed by the residuals.

Calculating the Least Squares Regression Line

Most of the time, you will be required to calculate the equation of the least squares regression line using technology. As mentioned previously, there are lots of tools available. In the previous investigation we looked at using Google Sheets. We'll include the instructions for the TI-83 or TI-84.

  1. Enter all of your data by pressing [STAT] and then selecting 1:Edit. Remember that your independent variable should go in L1 and your dependent variable in L2.
  2. Once your data is in, press [STAT] then select CALC and 4:LinReg(ax+b)
  3. Your equation will $y=ax+b$y=ax+b with the $a$a and $b$b filled in with the given values

Making predictions

Once we have our least squares regression line, we're ready to start making predictions. Since our line is the best possible fit for the data we have, we can use it as a model to predict the likely value for the dependent variable based on a value for the independent variable that we'd like to predict for.

The process of predicting is two-fold:

  • Firstly we need to substitute the $x$x value we are interested in, into our Least Squares Regression Line
  • Then we need to consider whether our prediction is reliable or not

Analyzing the validity of our prediction

After you have used your linear model to make a prediction for a particular value for the independent variable, you now need to determine how valid or reliable it is.

To do this, you need to consider two things:

  1. How strong is the correlation?
  2. Is the data being interpolated or extrapolated?

Interpolation means you have used an $x$x value in your prediction that is within the available range of data that you were working with. Suppose the $x$x values range between $35$35 and $98$98, so any x value you choose within this range would be considered an interpolation.

Extrapolation means you have used an $x$x value in your prediction that is outside the available range of data. Suppose the $x$xvalues range between $35$35 and $98$98, then anything below $35$35 or above $98$98 would be considered an extrapolation.

Let's put it all together.

  • If my correlation is strong and I have used interpolation, then my prediction will be reliable.
  • If my correlation is moderate to weak and I have used interpolation, then my prediction is far less reliable.
  • If my correlation is strong and I have used extrapolation, then my prediction will be far less reliable, especially if I extrapolate far beyond the range of available data.
  • The worst of all is if I have a weak correlation and I extrapolate. There's little chance of this prediction being reliable.

Practice questions

QUESTION 1

During an alcohol education program, $10$10 adults were offered up to $6$6 drinks and were then given a simulated driving test where they scored a result out of a possible $100$100.

Number of drinks ($x$x) $3$3 $2$2 $6$6 $4$4 $4$4 $1$1 $6$6 $3$3 $4$4 $2$2
Driving score ($y$y) $66$66 $61$61 $43$43 $58$58 $56$56 $73$73 $31$31 $64$64 $55$55 $62$62
  1. Using a graphics calculator (or other technology), calculate the correlation coefficient between these variables.

    Give your answer to two decimal places.

  2. Choose the description which best describes the statistical relationship between these two variables.

    Strong negative linear relationship

    A

    Moderate negative linear relationship

    B

    Weak relationship

    C

    Moderate positive linear relationship

    D

    Strong positive linear relationship

    E

    Strong negative linear relationship

    A

    Moderate negative linear relationship

    B

    Weak relationship

    C

    Moderate positive linear relationship

    D

    Strong positive linear relationship

    E
  3. Use your graphing calculator to form an equation for the least squares regression line of $y$y on $x$x.

    Give your answer in the form $y=mx+b$y=mx+b. Give all values to one decimal place.

  4. Use your regression line to predict the driving score of a young adult who consumed $5$5 drinks.

    Give your answer to one decimal place.

  5. Choose the description which best describes the validity of the prediction in part (d).

    Despite a strong correlation, unreliable due to extrapolation.

    A

    Despite an interpolated prediction, unreliable due to a moderate to weak correlation.

    B

    Very unreliable due to extrapolation and a moderate to weak correlation.

    C

    Reliable due to interpolation and a strong correlation.

    D

    Despite a strong correlation, unreliable due to extrapolation.

    A

    Despite an interpolated prediction, unreliable due to a moderate to weak correlation.

    B

    Very unreliable due to extrapolation and a moderate to weak correlation.

    C

    Reliable due to interpolation and a strong correlation.

    D

 

Residual plots

We have only looked at linear models so far and we need to remember that the correlation coefficient only relates to a linear fit. However, sometimes a non-linear model might be better. 

We might find we have a strong value for $r$r, but looking at the data more closely, we realize that it is not actually linear, but instead is curved in shape.

So we need another tool with which to analyze the suitability of fitting a linear model to our data. This tool is analyzing a plot of the residuals.

How do we calculate residuals?

Calculating Residuals

Residual = Actual Value - Predicted Value

Residual = $y_a-y_p$yayp

Remember that the predicted value is obtained from the equation of the Least Squares Regression Line

A positive residual means the raw data point is above the Least Squares Regression Line and a negative residual means the raw data point is below the line. You can see the residuals in Stage 3 of the GeoGebra applet earlier in this lesson.

Calculating residuals manually for a large set of data is a bit tedious. We would need to add two columns to our table of values, one for the predicted value and one of the residual.

Consider the table of values below, where we are using $y=2x+1$y=2x+1 as the line of best fit.

$x$x $y$y (actual) $y_p$yp (predicted) Residual
$3$3 $6$6 $2\times3+1=7$2×3+1=7 $6-7=-1$67=1
$4$4 $9$9 $2\times4+1=9$2×4+1=9 $9-9=0$99=0

Instead of doing this by hand, we can use spreadsheets or a graphing calculator to do the calculations for us all at once. 

Once we have calculated all of the residuals, we can create a new scatterplot of $x$x versus residual. 

Analyzing the Residual Plot

Once we've plotted our residuals against the independent variable, we want to analyze the plot for the suitability of using a linear regression model.

There are a few things that indicate that a linear model is suitable for the data set.

  1. The residuals are randomly scattered above an below the horizontal axis
  2. No clustering of the residuals
  3. Residuals are all a similar distance from the horizontal axis
Key Idea

If a linear model is suitable, we will see a random residual plot.

If a linear model is not suitable, we will see a pattern in the residual plot. We should then consider non-linear models.

If we take a look at the image below, we see on the left a scatter plot and a linear regression line fitted to some data. On the right, we see the residual plot for the data.

Were we to only look at the scatter plot and the strong correlation (0.9944), we'd assume a linear model was perfect. But when we examine the residual plot, there is certainly a pattern evident in the residuals (in this case, a parabolic pattern) and so we might need to rethink what sort of model might best suit this data.

Practice questions

QUESTION 2

The following table shows the sets of data $\left(x,y\right)$(x,y) and the predicted $\hat{y}$^y values based on a least-squares regression line. Complete the table by finding the residuals.

  1. $x$x-values $1$1 $3$3 $5$5 $7$7 $9$9
    $y$y-values $22.7$22.7 $22.3$22.3 $24.2$24.2 $21.8$21.8 $21.5$21.5
    $\hat{y}$^y $25.2$25.2 $23.4$23.4 $21.6$21.6 $19.8$19.8 $18$18
    Residuals $\editable{}$ $\editable{}$ $\editable{}$ $\editable{}$ $\editable{}$
     

QUESTION 3

The table shows a company's costs $y$y (in millions) in week $x$x. The equation $y=5x+12$y=5x+12 is being used to model the data.

  1. Complete the table of residuals:

    $x$x $y$y Value generated by model Residual
    $1$1 $22$22 $\editable{}$ $\editable{}$
    $2$2 $25$25 $\editable{}$ $\editable{}$
    $4$4 $33$33 $\editable{}$ $\editable{}$
    $6$6 $39$39 $\editable{}$ $\editable{}$
    $9$9 $53$53 $\editable{}$ $\editable{}$
    $12$12 $69$69 $\editable{}$ $\editable{}$
    $14$14 $81$81 $\editable{}$ $\editable{}$
    $17$17 $99$99 $\editable{}$ $\editable{}$
  2. Plot the residuals on the scatter plot.

    Loading Graph...

  3. Is this model a good fit for the data?

    Yes

    A

    No

    B

    Yes

    A

    No

    B

Outcomes

S-ID.6.b'

Informally assess the fit of a function by plotting and analyzing residuals. '[Linear focus; discuss general principle.]

S-ID.7

Interpret the slope (rate of change) and the intercept (constant term) of a linear model in the context of the data.

S-ID.8

Compute (using technology) and interpret the correlation coefficient of a linear fit.

What is Mathspace

About Mathspace