The least-squares regression line is a linear representation of the general trend of our data.
Once we have determined the least-squares regression line, we can use it as a model to predict the likely value of the response variable based on a given value of the explanatory variable.
The process of predicting has two parts:
We can make predictions 'by hand' either using a graph of the line of best fit or substituting into the equation of the least-squares regression line.
The example below shows a scatter plot, with the least-squares regression line for the relationship of daily ice-cream sales versus the maximum daily temperature.
Although there is a clear trend of increasing sales as the temperature increases, it would be difficult to predict the sales for a given day from the raw data. However, we can use the least-squares regression line.
If we want to predict the sales on a day when the temperature reaches $30$30 degrees, we could do so with these steps that follow the red arrows on the graph:
When we are making predictions from a graph we should take care to work accurately, but still expect a small amount of variation due to the limited precision of working with graphs.
For the example above, the equation of the least-squares regression line is
$S$S | $=$= | $10\times T-45$10×T−45 |
where $S$S is the number of ice cream sales and $T$T is the maximum daily temperature.
To predict the sales on a day where the maximum temperature is $30$30 degrees, we simply substitute the temperature into the equation:
$S$S | $=$= | $10\times30-45$10×30−45 |
$=$= | $245$245 |
So, we can predict that $245$245 ice creams will be sold on that day.
Let's look at some examples to see how we can make a prediction once we've input our data and calculated the least-squares regression line. Select the brand of calculator you use below to work through an example.
Casio Classpad
How to use the CASIO Classpad to complete the following tasks regarding linear regression and making predictions.
Consider the data obtained from a chemical process where the yield of the process is thought to be linearly related to the reaction temperature.
Temperature ($x,^\circ C$x,°C) | $50$50 | $53$53 | $54$54 | $56$56 | $59$59 | $62$62 | $65$65 | $67$67 | $71$71 | $74$74 |
---|---|---|---|---|---|---|---|---|---|---|
Yield ($y,g$y,g) | $122$122 | $118$118 | $128$128 | $125$125 | $136$136 | $144$144 | $142$142 | $149$149 | $161$161 | $168$168 |
Using your calculator, find an equation for the least squares regression line of $y$y on $x$x.
Give your answer in the form $y=ax+b$y=ax+b. Give all values to two decimal places.
What yield is predicted at a reaction temperature of $60$60$^\circ$°C?
Give your answer to the nearest gram.
What yield is predicted at a reaction temperature of $85$85$^\circ$°C?
Give your answer to the nearest gram.
What temperature gives a predicted yield of $155$155 g?
Give your answer to one decimal place.
TI Nspire
How to use the TI Nspire to complete the following tasks regarding linear regression and making predictions.
Consider the data obtained from a chemical process where the yield of the process is thought to be linearly related to the reaction temperature.
Temperature ($x,^\circ C$x,°C) | $50$50 | $53$53 | $54$54 | $56$56 | $59$59 | $62$62 | $65$65 | $67$67 | $71$71 | $74$74 |
---|---|---|---|---|---|---|---|---|---|---|
Yield ($y,g$y,g) | $122$122 | $118$118 | $128$128 | $125$125 | $136$136 | $144$144 | $142$142 | $149$149 | $161$161 | $168$168 |
Using your calculator, find an equation for the least squares regression line of $y$y on $x$x.
Give your answer in the form $y=ax+b$y=ax+b. Give all values to two decimal places.
What yield is predicted at a reaction temperature of $60$60$^\circ$°C?
Give your answer to the nearest gram.
What yield is predicted at a reaction temperature of $85$85$^\circ$°C?
Give your answer to the nearest gram.
What temperature gives a predicted yield of $155$155 g?
Give your answer to one decimal place.
A bivariate data set has a line of best fit with equation $y=-8.71x+6.79$y=−8.71x+6.79.
Predict the value of $y$y when $x=3.49$x=3.49.
An important consideration when we are making predictions is recognising if the prediction is within the range of data values for which we have actual measurements. If it is, we refer to the prediction as an interpolation. If not, we refer to the predication as an extrapolation.
The diagram below illustrates these terms.
Interpolation means you have used an $x$x value in your prediction that is within the range of $x$x values in the data that you were working with.
Extrapolation means you have used an $x$x value in your prediction that is outside the range of $x$x values in the data.
The scatter plots below are annotated to show examples of interpolation (left) and extrapolation (right) from the line of best fit.
To judge the reliability of the prediction we need to consider two things:
We have already seen how to calculate the correlation coefficient and interpret the strength of a relationship, using this chart.
If the correlation is weak, then the data values are more widely scattered, so there will be greater uncertainty in the prediction than for data with a strong correlation.
In the examples above, the $x$x values range between $56.3$56.3 and $79$79, so a prediction using any $x$x value within this range would be considered interpolation.
A prediction using an $x$x value below $56.3$56.3 or above $79$79 would be an extrapolation.
When we are extrapolating, we are making a prediction for values that we have not made any similar measurements, It is possible that a linear relationship is not valid outside of a certain range so we always have to treat an extrapolation as unreliable.
Research on the number of cigarettes smoked during pregnancy and the birth weights of the newborn babies was conducted.
Average number of cigarettes per day ($x$x) | $45.5$45.5 | $13.2$13.2 | $22.4$22.4 | $24.4$24.4 | $8.4$8.4 | $36.7$36.7 | $1.4$1.4 | $18$18 | $10.4$10.4 | $13.3$13.3 | $36.5$36.5 | $19.4$19.4 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Birth weight in kilograms ($y$y) | $3.7$3.7 | $5.4$5.4 | $4.7$4.7 | $4.6$4.6 | $5.3$5.3 | $4.1$4.1 | $6.7$6.7 | $4.8$4.8 | $5.3$5.3 | $4.9$4.9 | $3.6$3.6 | $5.4$5.4 |
Using technology, calculate the correlation coefficient between the average number of cigarettes per day and birth weight.
Round your answer to three decimal places.
Choose the description which best describes the statistical relationship between these two variables.
Strong positive linear relationship
Weak relationship
Moderate negative linear relationship
Moderate positive linear relationship
Strong negative linear relationship
Use technology to form an equation for the least squares regression line of $y$y on $x$x.
Give all values to two decimal places. Give the equation of the line in the form $y=mx+c$y=mx+c.
Use your regression line to predict the birth weight of a newborn whose mother smoked on average $5$5 cigarettes per day.
Round your answer to two decimal places.
Choose the description which best describes the validity of the prediction in part (d).
Unreliable due to extrapolation and weak correlation.
Despite a strong correlation, unreliable due to extrapolation far from the data range where the linear trend does not continue.
Reliable due to interpolation and a strong correlation.
Despite an interpolated prediction, unreliable due to a weak correlation.
During an alcohol education programme, $10$10 adults were offered up to $6$6 drinks and were then given a simulated driving test where they scored a result out of a possible $100$100.
Number of drinks ($x$x) | $3$3 | $2$2 | $6$6 | $4$4 | $4$4 | $1$1 | $6$6 | $3$3 | $4$4 | $2$2 |
---|---|---|---|---|---|---|---|---|---|---|
Driving score ($y$y) | $65$65 | $60$60 | $43$43 | $59$59 | $57$57 | $73$73 | $32$32 | $63$63 | $55$55 | $61$61 |
Using technology, calculate the correlation coefficient between these variables.
Round your answer to two decimal places.
Choose the description which best describes the statistical relationship between these two variables.
Moderate negative linear relationship
Strong negative linear relationship
Moderate positive linear relationship
Weak relationship
Strong positive linear relationship
Use your graphing calculator to form an equation for the least squares regression line of $y$y on $x$x.
Give your answer in the form $y=mx+c$y=mx+c. Give all values to one decimal place.
Use your regression line to predict the driving score of a young adult who consumed $5$5 drinks.
Round your answer to one decimal place.
Choose the description which best describes the reliability of the prediction in part (d).
Unreliable due to extrapolation and weak correlation.
Despite a strong correlation, unreliable due to extrapolation far from the data range where the linear trend does not continue.
Despite an interpolated prediction, unreliable due to a weak correlation.
Reliable due to interpolation and a strong correlation.