Recall that when we examine a set of bivariate data, we first examine the data and its scatter plot to look for a correlation between the two variables. Mathematically, we can calculate the correlation coefficient ($r$r) between the two variables.
Previously, we looked at how we could estimate the line of best fit by eye, but in order to make more accurate predictions, we want to calculate the equation of this line mathematically. We will use the least squares regression line as our line of best fit.
When we have a line of best fit, there is a difference between the actual points and the line. The vertical distance between a data point and the line of best fit is called the residual.
The best way to understand it is through a demonstration. Follow the steps below for the GeoGebra applet.
Drag the slider to Stage 5. How did you do? Your blue Line of Best Fit should be very close to the green Least Squares Regression Line. And the sum of your squares should be very close to the least sum of the squares.
Residual: The vertical difference between a data point and the line of best fit.
Least squares regression line: The line of best fit which minimizes the sum of the areas of the squares formed by the residuals.
Most of the time, you will be required to calculate the equation of the least squares regression line using technology. As mentioned previously, there are lots of tools available. In the previous investigation we looked at using Google Sheets. We'll include the instructions for the TI-83 or TI-84.
[STAT]
and then selecting 1:Edit
. Remember that your independent variable should go in L1
and your dependent variable in L2
.[STAT]
then select CALC
and 4:LinReg(ax+b)
Once we have our least squares regression line, we're ready to start making predictions. Since our line is the best possible fit for the data we have, we can use it as a model to predict the likely value for the dependent variable based on a value for the independent variable that we'd like to predict for.
The process of predicting is two-fold:
After you have used your linear model to make a prediction for a particular value for the independent variable, you now need to determine how valid or reliable it is.
To do this, you need to consider two things:
Interpolation means you have used an $x$x value in your prediction that is within the available range of data that you were working with. Suppose the $x$x values range between $35$35 and $98$98, so any x value you choose within this range would be considered an interpolation.
Extrapolation means you have used an $x$x value in your prediction that is outside the available range of data. Suppose the $x$xvalues range between $35$35 and $98$98, then anything below $35$35 or above $98$98 would be considered an extrapolation.
Let's put it all together.
During an alcohol education program, $10$10 adults were offered up to $6$6 drinks and were then given a simulated driving test where they scored a result out of a possible $100$100.
Number of drinks ($x$x) | $3$3 | $2$2 | $6$6 | $4$4 | $4$4 | $1$1 | $6$6 | $3$3 | $4$4 | $2$2 |
---|---|---|---|---|---|---|---|---|---|---|
Driving score ($y$y) | $66$66 | $61$61 | $43$43 | $58$58 | $56$56 | $73$73 | $31$31 | $64$64 | $55$55 | $62$62 |
Using a graphics calculator (or other technology), calculate the correlation coefficient between these variables.
Give your answer to two decimal places.
Choose the description which best describes the statistical relationship between these two variables.
Strong negative linear relationship
Moderate negative linear relationship
Weak relationship
Moderate positive linear relationship
Strong positive linear relationship
Use your graphing calculator to form an equation for the least squares regression line of $y$y on $x$x.
Give your answer in the form $y=mx+b$y=mx+b. Give all values to one decimal place.
Use your regression line to predict the driving score of a young adult who consumed $5$5 drinks.
Give your answer to one decimal place.
Choose the description which best describes the validity of the prediction in part (d).
Despite a strong correlation, unreliable due to extrapolation.
Despite an interpolated prediction, unreliable due to a moderate to weak correlation.
Very unreliable due to extrapolation and a moderate to weak correlation.
Reliable due to interpolation and a strong correlation.
We have only looked at linear models so far and we need to remember that the correlation coefficient only relates to a linear fit. However, sometimes a non-linear model might be better.
We might find we have a strong value for $r$r, but looking at the data more closely, we realize that it is not actually linear, but instead is curved in shape.
So we need another tool with which to analyze the suitability of fitting a linear model to our data. This tool is analyzing a plot of the residuals.
Residual = Actual Value - Predicted Value
Residual = $y_a-y_p$ya−yp
Remember that the predicted value is obtained from the equation of the Least Squares Regression Line
A positive residual means the raw data point is above the Least Squares Regression Line and a negative residual means the raw data point is below the line. You can see the residuals in Stage 3 of the GeoGebra applet earlier in this lesson.
Calculating residuals manually for a large set of data is a bit tedious. We would need to add two columns to our table of values, one for the predicted value and one of the residual.
Consider the table of values below, where we are using $y=2x+1$y=2x+1 as the line of best fit.
$x$x | $y$y (actual) | $y_p$yp (predicted) | Residual |
---|---|---|---|
$3$3 | $6$6 | $2\times3+1=7$2×3+1=7 | $6-7=-1$6−7=−1 |
$4$4 | $9$9 | $2\times4+1=9$2×4+1=9 | $9-9=0$9−9=0 |
Instead of doing this by hand, we can use spreadsheets or a graphing calculator to do the calculations for us all at once.
Once we have calculated all of the residuals, we can create a new scatterplot of $x$x versus residual.
Once we've plotted our residuals against the independent variable, we want to analyze the plot for the suitability of using a linear regression model.
There are a few things that indicate that a linear model is suitable for the data set.
If a linear model is suitable, we will see a random residual plot.
If a linear model is not suitable, we will see a pattern in the residual plot. We should then consider non-linear models.
If we take a look at the image below, we see on the left a scatter plot and a linear regression line fitted to some data. On the right, we see the residual plot for the data.
Were we to only look at the scatter plot and the strong correlation (0.9944), we'd assume a linear model was perfect. But when we examine the residual plot, there is certainly a pattern evident in the residuals (in this case, a parabolic pattern) and so we might need to rethink what sort of model might best suit this data.
The following table shows the sets of data $\left(x,y\right)$(x,y) and the predicted $\hat{y}$^y values based on a least-squares regression line. Complete the table by finding the residuals.
$x$x-values | $1$1 | $3$3 | $5$5 | $7$7 | $9$9 |
---|---|---|---|---|---|
$y$y-values | $22.7$22.7 | $22.3$22.3 | $24.2$24.2 | $21.8$21.8 | $21.5$21.5 |
$\hat{y}$^y | $25.2$25.2 | $23.4$23.4 | $21.6$21.6 | $19.8$19.8 | $18$18 |
Residuals | $\editable{}$ | $\editable{}$ | $\editable{}$ | $\editable{}$ | $\editable{}$ |
The table shows a company's costs $y$y (in millions) in week $x$x. The equation $y=5x+12$y=5x+12 is being used to model the data.
Complete the table of residuals:
$x$x | $y$y | Value generated by model | Residual |
---|---|---|---|
$1$1 | $22$22 | $17$17 | $5$5 |
$2$2 | $25$25 | $22$22 | $3$3 |
$4$4 | $33$33 | $32$32 | $\editable{}$ |
$6$6 | $39$39 | $\editable{}$ | $\editable{}$ |
$9$9 | $53$53 | $\editable{}$ | $\editable{}$ |
$12$12 | $69$69 | $\editable{}$ | $\editable{}$ |
$14$14 | $81$81 | $\editable{}$ | $\editable{}$ |
$17$17 | $99$99 | $\editable{}$ | $\editable{}$ |
Plot the residuals on the scatter plot.
Is this model a good fit for the data?
Yes
No