topic badge
AustraliaVIC
VCE 12 General 2023

3.02 Interpret and make predictions

Lesson

Predictions

Once we have a least squares line, we're ready to start making predictions. Since our line is the best possible fit for the data we have, we can use it as a model to predict the likely value for the response variable based on a value for the explanatory variable or vice versa.

For example:

  • Given a value for the explanatory variable, x, we can substitute it into the equation to find a value for the response variable, y.

  • Likewise, given a value for the explanatory variable, y, we can substitute it into the equation to find a value for the response variable, x.

If the value used is within the range of data, this type of prediction is called interpolation. However, if the value used lies outside the range of data, then this is called extrapolation. The further outside the range of data your chosen value is, the less reliable the prediction.

This next video will demonstrate how we can make a prediction once we've input our data and calculated the least squares regression line.

Loading video...
Idea summary

Interpolation means you have used an x value in your prediction that is within the range of x values in the data that you were working with.

Extrapolation means you have used an x value in your prediction that is outside the range of x values in the data.

Interpret the slope, intercept, and variation

The slope, or gradient, of a least squares line, tells us the average rate of change of one variable with respect to another variable. We usually say that the response variable, increases/decreases for each unit of the explanatory variable.

The y-intercept tells us what the response variable is predicted to be when the explanatory variable is 0.

The coefficient of determination, r^2, can be used to explain the variation of the response variable in terms of the variation of the explanatory variable. This is usually expressed as a percentage. For example, given a coefficient of determination of 0.75, then we can say that 75\% of the variation in the response variable is explained by the variation in the explanatory variable.

Examples

Example 1

A least squares regression line is given by y=6.72+3.59x.

a

State the gradient of the line.

Worked Solution
Create a strategy

Recall the equation y=a+bx, where b is the gradient of the line.

Apply the idea

b=3.59

b

Which of the following is true?

A
The gradient of the line indicates that the bivariate data set has a negative correlation.
B
The gradient of the line indicates that the bivariate data set has a positive correlation.
Worked Solution
Create a strategy

Check the sign of the gradient.

Apply the idea

Remember that a positive gradient means a positive correlation, and a negative gradient means a negative correlation. So the correct answer is B.

c

Which of the following is true?

A
If x increases by 1 unit, then y increases by 3.59 units.
B
If x increases by 1 unit, then y decreases by 3.59 units.
C
If x increases by 1 unit, then y decreases by 6.72 units.
D
If x increases by 1 unit, then y increases by 6.72 units.
Worked Solution
Create a strategy

Recall that in a linear function the gradient can be represented by b=\dfrac{\text{rise}}{\text{run}}, where the rise along the y-axis is divided by the run along the x-axis.

Apply the idea

From part (a), we identified the gradient as b=3.59.

As the gradient doesn't have a denominator, it tells us that the units y increases for every 1 unit of x. We can also see that it is in positive, which means it is increasing.

So the correct answer is A.

d

State the value of the y-intercept.

Worked Solution
Create a strategy

Recall the equation y=a+bx, where a is also known as y-intercept of the line.

Apply the idea

a=6.72

Reflect and check

To solve the value of c, we can susbtitute 0 for x and solve for y.

\displaystyle y\displaystyle =\displaystyle a+bxWrite the equation
\displaystyle =\displaystyle 6.72+3.59(0)Substitute the values
\displaystyle =\displaystyle 6.72Evaluate
Idea summary

Given a least squares squares regression line of the form y=a+bx

The b value shows the gradient:

  • if the gradient is positive, when the explanatory increases by 1 unit, the response variable increases by b units.

  • if the gradient is negative, when the explanatory increases by 1 unit, the response variable decreases by b units.

The y-intercept tells us what the response variable is predicted to be when the explanatory variable is 0.

The coefficient of determination, r^2, can be used to explain the variation of the response variable in terms of the variation of the explanatory variable.

Residuals

When we want to analyse the fit of a linear model to a bivariate set of data, we start by analysing the value of the correlation coefficient. Often, however, we might find we have a strong value for r, but when looking at the data more closely, we realise that it is not actually linear, but instead is curved in shape.

So we need another method to analyse the suitability of fitting a linear model to our data. This method is achieved by analysing the residuals. There are two ways we can assess the suitability of our least squares line using residuals.

The first is to look at an individual value and calculate the residual.

To calculate a residual for a data value:\text{Residual} = \text{Raw data} - \text{Predicted value}Remember that the predicted value is obtained from the equation of the Least Squares Regression Line

A positive residual means the raw data point is above the least squares regression line and a negative residual means the raw data point is below the line.

Idea summary

To calculate a residual for a data value:\text{Residual} = \text{Raw data} - \text{Predicted value}Remember that the predicted value is obtained from the equation of the Least Squares Regression Line

A positive residual means the raw data point is above the least squares regression line and a negative residual means the raw data point is below the line.

Residual plots

Another way is to look at a plot of all the residuals. When looking at this residual plot, there are a few things that indicate that a linear model is suitable for the data set.

  1. The residuals are randomly scattered above an below the horizontal axis

  2. No clustering of the residuals

  3. Residuals are all a similar distance from the horizontal axis

So, if a linear model is not suitable, we will see a pattern in the residual plot, rather than a random scatter.

If we take a look at the image below, we see on the left a scatterplot and a linear regression line fitted to some data. On the right, we see the residual plot for the data.

Were we to only look at the scatterplot and the strong correlation (0.9944), we'd assume a linear model was perfect. But when we examine the residual plot, there is certainly a pattern evident in the residuals (in this case, a parabolic pattern) and so we might need to rethink what sort of model might best suit this data.

A CAS calculator showing residuals. Ask your teacher for more information.

Examples

Example 2

The table shows a company's costs y (in millions) in week x. The equation y=5x+12 is being used to model the data.

a

Complete the table of residuals:

xy\text{Model value}\text{Residual}
122
225
433
639
953
1269
1481
1799
Worked Solution
Create a strategy

To calculate the model value, substitute each value of x into the equation of the least-square regression line.

To calculate the residual value, subtract the model value from y.

Apply the idea

Solving for the first row:

\displaystyle \text{Model value}\displaystyle =\displaystyle 5 \times 1 +12Substitute x=1
\displaystyle =\displaystyle 17Evaluate
\displaystyle \text{Residual}\displaystyle =\displaystyle 22-17Subtract the model value from y
\displaystyle =\displaystyle 5Evaluate

The same process can be done for the remaining values shown in the completed table:

xy\text{Model value}\text{Residual}
122175
225223
433321
63942-3
95357-4
126972-3
148182-1
1799972
b

Plot the residuals on the scatter plot.

Worked Solution
Create a strategy

Plot each value of x along with its residual value.

Apply the idea

The points we are plotting have coordinates: (1,5),(2,3),(4,1),(6,-3),(9,-4),(12,-3),(14,-1),(17,2).

Here's the residual plot:

2
4
6
8
10
12
14
16
18
x
-5
-4
-3
-2
-1
1
2
3
4
5
\text{Residual}
c

Is this model a good fit for the data?

Worked Solution
Create a strategy

Look if there is a pattern formed on the residual plot in part (b).

Apply the idea

No, the linear model is not a good fit for this data as there is a parabolic pattern present in the residual plot.

Example 3

The least squares regression line is given by y=a+bx. An x-value of 5 gives a predicted value of y=9, and an x-value of 8 gives a predicted value of y=3. Find the equation of the least squares regression line.

Worked Solution
Create a strategy

To find b, use the equation: b=\dfrac{y_2 - y_1}{x_2 - x_1}

To find a, we can substitute in the coordinates of any point the regression line passes through to form an equation in terms of only b.

Apply the idea

In finding the slope, substitute the coordinates:

\displaystyle b\displaystyle =\displaystyle \dfrac{3-9}{8-5}Substitute y_2 = 3,\,y_1 = 9,\, x_2 = 8,\,x_1= 5
\displaystyle =\displaystyle \dfrac{-6}{3}Evaluate both parts
\displaystyle =\displaystyle -2Simplify

In finding the intercept, substitute b=-2 back into the equation and we can use the point (5,\,9) as the line passes through it.

\displaystyle y\displaystyle =\displaystyle a+bxRewrite the equation
\displaystyle 9\displaystyle =\displaystyle a-2\times5Substitute b=-2,\,x=5,\,y=9
\displaystyle 9\displaystyle =\displaystyle a-10Evaluate the product
\displaystyle a\displaystyle =\displaystyle 19Add 10 to both sides

Substitute a=19 and b=-2 back into the equation: y=19-2x

Idea summary

When looking at this residual plot, there are a few things that indicate that a linear model is suitable for the data set.

  1. The residuals are randomly scattered above an below the horizontal axis

  2. No clustering of the residuals

  3. Residuals are all a similar distance from the horizontal axis

So, if a linear model is not suitable, we will see a pattern in the residual plot, rather than a random scatter.

Full regression analysis

A full regression analysis generally includes the following steps:

  1. Construct a scatterplot in order to observe the nature of the relationship between the variables.

  2. Calculate Pearson's correlation coefficient, r, to measure the strength of the relationship between the variables. When discussing r, always comment on its direction (positive or negative), strength (weak, moderate, strong or very strong) and form. Look for any outliers.

  3. Calculate the equation of the least squares regression line and plot it against the scatterplot. The regression equation should favour using the given variables.

  4. Interpret the coefficients of the regression equation. In terms of y=a+bx, a is the y-intercept, and b is the gradient. When asked to comment on a, the following statement could be used:

    • "The value of the response variable when the explanatory variable is zero, is predicted to be a."

    When asked to comment on b, the following statement could be used:

    • "On average, the response variable is estimated to change by b units for each 1 unit increase in the explanatory variable."

  5. The coefficient of determination can be used to discuss how much variability can be explained by the relationship of x and y. Denoted, r^2, the coefficient of determination is typically represented as a percentage. We can use r^2 to make statements such as:

    • "r^2\% of variation in the response variable can be explained by the variation in the explanatory variable."

  6. Construct a scatterplot to test the data for linearity. Once the scatterplot is constructed, one of the following statements can be made:

    • "There is no clear pattern in the residual plot, so the data is linear" or

    • "There is a clear pattern in the residual plot, so the data is non-linear"

  7. Values of the explanatory variable can be substituted into the regression equation to make predictions.

    • If the value is within the data set, the prediction is reliable because it is interpolated.

    • If the value is outside the data set, the prediction is unreliable because it is extrapolated.

Idea summary

Regression analysis is an effective statistical tool for examining the connection between two or more variables of interest. While there are many different forms of regression analysis, they all focus on the impact of one or more independent variables on a dependent variable.

Outcomes

U3.AoS1.26

calculate the coefficient of determination, 𝑟^2, and interpret in the context of the association being modelled and use the model to make predictions, being aware of the problem of extrapolation

U3.AoS1.24

determine the equation of the least squares line giving the coefficients correct to a required number of decimal places or significant figures as specified, and distinguish between correlation and causation

U3.AoS1.25

use the least squares line of best fit to model and analyse the linear association between two numerical variables and interpret the model in the context of the association being modelled

What is Mathspace

About Mathspace