Once we have a least squares line, we're ready to start making predictions. Since our line is the best possible fit for the data we have, we can use it as a model to predict the likely value for the response variable based on a value for the explanatory variable or vice versa.
For example:
Given a value for the explanatory variable, x, we can substitute it into the equation to find a value for the response variable, y.
Likewise, given a value for the explanatory variable, y, we can substitute it into the equation to find a value for the response variable, x.
If the value used is within the range of data, this type of prediction is called interpolation. However, if the value used lies outside the range of data, then this is called extrapolation. The further outside the range of data your chosen value is, the less reliable the prediction.
This next video will demonstrate how we can make a prediction once we've input our data and calculated the least squares regression line.
Interpolation means you have used an x value in your prediction that is within the range of x values in the data that you were working with.
Extrapolation means you have used an x value in your prediction that is outside the range of x values in the data.
The slope, or gradient, of a least squares line, tells us the average rate of change of one variable with respect to another variable. We usually say that the response variable, increases/decreases for each unit of the explanatory variable.
The y-intercept tells us what the response variable is predicted to be when the explanatory variable is 0.
The coefficient of determination, r^2, can be used to explain the variation of the response variable in terms of the variation of the explanatory variable. This is usually expressed as a percentage. For example, given a coefficient of determination of 0.75, then we can say that 75\% of the variation in the response variable is explained by the variation in the explanatory variable.
A least squares regression line is given by y=6.72+3.59x.
State the gradient of the line.
Which of the following is true?
Which of the following is true?
State the value of the y-intercept.
Given a least squares squares regression line of the form y=a+bx
The b value shows the gradient:
if the gradient is positive, when the explanatory increases by 1 unit, the response variable increases by b units.
if the gradient is negative, when the explanatory increases by 1 unit, the response variable decreases by b units.
The y-intercept tells us what the response variable is predicted to be when the explanatory variable is 0.
The coefficient of determination, r^2, can be used to explain the variation of the response variable in terms of the variation of the explanatory variable.
When we want to analyse the fit of a linear model to a bivariate set of data, we start by analysing the value of the correlation coefficient. Often, however, we might find we have a strong value for r, but when looking at the data more closely, we realise that it is not actually linear, but instead is curved in shape.
So we need another method to analyse the suitability of fitting a linear model to our data. This method is achieved by analysing the residuals. There are two ways we can assess the suitability of our least squares line using residuals.
The first is to look at an individual value and calculate the residual.
To calculate a residual for a data value:\text{Residual} = \text{Raw data} - \text{Predicted value}Remember that the predicted value is obtained from the equation of the Least Squares Regression Line
A positive residual means the raw data point is above the least squares regression line and a negative residual means the raw data point is below the line.
To calculate a residual for a data value:\text{Residual} = \text{Raw data} - \text{Predicted value}Remember that the predicted value is obtained from the equation of the Least Squares Regression Line
A positive residual means the raw data point is above the least squares regression line and a negative residual means the raw data point is below the line.
Another way is to look at a plot of all the residuals. When looking at this residual plot, there are a few things that indicate that a linear model is suitable for the data set.
The residuals are randomly scattered above an below the horizontal axis
No clustering of the residuals
Residuals are all a similar distance from the horizontal axis
So, if a linear model is not suitable, we will see a pattern in the residual plot, rather than a random scatter.
If we take a look at the image below, we see on the left a scatterplot and a linear regression line fitted to some data. On the right, we see the residual plot for the data.
Were we to only look at the scatterplot and the strong correlation (0.9944), we'd assume a linear model was perfect. But when we examine the residual plot, there is certainly a pattern evident in the residuals (in this case, a parabolic pattern) and so we might need to rethink what sort of model might best suit this data.
The table shows a company's costs y (in millions) in week x. The equation y=5x+12 is being used to model the data.
Complete the table of residuals:
x | y | \text{Model value} | \text{Residual} |
---|---|---|---|
1 | 22 | ||
2 | 25 | ||
4 | 33 | ||
6 | 39 | ||
9 | 53 | ||
12 | 69 | ||
14 | 81 | ||
17 | 99 |
Plot the residuals on the scatter plot.
Is this model a good fit for the data?
The least squares regression line is given by y=a+bx. An x-value of 5 gives a predicted value of y=9, and an x-value of 8 gives a predicted value of y=3. Find the equation of the least squares regression line.
When looking at this residual plot, there are a few things that indicate that a linear model is suitable for the data set.
The residuals are randomly scattered above an below the horizontal axis
No clustering of the residuals
Residuals are all a similar distance from the horizontal axis
So, if a linear model is not suitable, we will see a pattern in the residual plot, rather than a random scatter.
A full regression analysis generally includes the following steps:
Construct a scatterplot in order to observe the nature of the relationship between the variables.
Calculate Pearson's correlation coefficient, r, to measure the strength of the relationship between the variables. When discussing r, always comment on its direction (positive or negative), strength (weak, moderate, strong or very strong) and form. Look for any outliers.
Calculate the equation of the least squares regression line and plot it against the scatterplot. The regression equation should favour using the given variables.
Interpret the coefficients of the regression equation. In terms of y=a+bx, a is the y-intercept, and b is the gradient. When asked to comment on a, the following statement could be used:
"The value of the response variable when the explanatory variable is zero, is predicted to be a."
When asked to comment on b, the following statement could be used:
"On average, the response variable is estimated to change by b units for each 1 unit increase in the explanatory variable."
The coefficient of determination can be used to discuss how much variability can be explained by the relationship of x and y. Denoted, r^2, the coefficient of determination is typically represented as a percentage. We can use r^2 to make statements such as:
"r^2\% of variation in the response variable can be explained by the variation in the explanatory variable."
Construct a scatterplot to test the data for linearity. Once the scatterplot is constructed, one of the following statements can be made:
"There is no clear pattern in the residual plot, so the data is linear" or
"There is a clear pattern in the residual plot, so the data is non-linear"
Values of the explanatory variable can be substituted into the regression equation to make predictions.
If the value is within the data set, the prediction is reliable because it is interpolated.
If the value is outside the data set, the prediction is unreliable because it is extrapolated.
Regression analysis is an effective statistical tool for examining the connection between two or more variables of interest. While there are many different forms of regression analysis, they all focus on the impact of one or more independent variables on a dependent variable.