We analyzed the association between two variables using the correlation coefficient in lesson  7.01 Scatter plots and lines of fit and lesson  7.02 Fitting functions to data . In this lesson, we'll use another calculation to observe the correlation between two variables.
Consider the scatter plot of data relating the number of guests at a restaurant and the cost of the meal and the residual plot of the data:
From a scatter plot and a line of fit, we can further analyze an association between two variables by examining the residuals of the model.
By taking the residuals of each point in the data set and plotting them at their corresponding x-values, we form a residual plot for the data.
The residual plot is constructed using the same x-axis scale and x-coordinates from the original scatter plot, and plotting the residual values as the y-coordinates.
A residual plot can be used to decide if a straight line is an appropriate model for the data. And, it identifies the strength of the relationship by showing how much the model over-predicts (negative residual) and under-predicts (positive residual) the actual data. Looking for unusually large residuals can help us identify outliers in the data set.
Two key features will help provide evidence about whether or not a linear model is appropriate, and indicate the strength of the relationship:
Pattern - if the linear model fitted is appropriate, then points on the residual plot should be randomly scattered about the x-axis without a noticeable pattern.
Size of residuals - residuals that are small in size relative to the data being predicted indicate a stronger association. Large residuals would indicate the model significantly under- or over-predicts the actual data.
Following are some example scatter plots with the line of best fit and residuals, and their corresponding residual plots.
Scatter plot and residual plot of weak positive linear association:
From the scatterplot we can see the association is positive. The residual plot has no obvious pattern, suggesting a linear model is appropriate. The residuals are relatively large indicating a weak relationship.
Scatter plot and residual plot of strong negative linear association with an outlier:
From the scatterplot we can see the association is negative. Other than the outlier, the residuals are relatively small, indicating a strong relationship. The outlier in the scatterplot stands out in the residual plot. Its inclusion leads to most of the data points being over-predicted by the best fit line.
Scatter plot and residual plot of non-linear association:
The residual plot displays a clear pattern, indicating that a linear model is not appropriate for this data set.
The scatter plot shows the relationship between the electricity usage of a household and the cost of their monthly utility bill.
The equation of the line of best fit is y=0.255x-81.49
The residual plot of the data is shown:
Interpret the strength and linear association of the data using the line of best fit and residual plot.
Find and interpret the residual for the point \left(930, 150\right).
Consider the following data set and scatterplot with line of fit.
x | 10 | 11 | 13 | 18 | 19 | 21 | 23 | 25 | 28 | 29 | 31 |
---|---|---|---|---|---|---|---|---|---|---|---|
y | 12 | 13 | 9 | 8 | 7 | 7 | 4 | 2 | 3 | -1 | -2 |
Create a residual plot for the data.
Determine if a linear model is an appropriate choice for the data.
A residual plot shows the strength of the correlation between two variables. The closer the data points on a residual plot are to the x-axis, the stronger the correlation between the data. A model is considered strong when the residuals are small relative to the value being predicted. Calculate the residuals for a residual plot using the formula:\text{residual}=\text{actual}-\text{predicted}
In general, a residual plot with points randomly dispersed about the x-axis indicates that the model is appropriate for the data.