topic badge

2.02 Making predictions

Lesson

Predictions

The least-squares regression line is a linear representation of the general trend of our data.

Once we have determined the least-squares regression line, we can use it as a model to predict the likely value of the response variable based on a given value of the explanatory variable.

The process of predicting has two parts:

  • Substitute the x value into the rule for the least-squares regression line to get the predicted \hat{y} value

  • Then we need to consider whether our prediction is reliable or not

We can make predictions 'by hand' either using a graph of the line of best fit or substituting into the equation of the least-squares regression line.

The example below shows a scatter plot, with the least-squares regression line for the relationship of daily ice-cream sales versus the maximum daily temperature.

A graph showing relationship between temperature and ice cream. Ask your teacher for more information.

Although there is a clear trend of increasing sales as the temperature increases, it would be difficult to predict the sales for a given day from the raw data. However, we can use the least-squares regression line.

If we want to predict the sales on a day when the temperature reaches 30 degrees, we could do so with these steps that follow the arrows on the graph:

  1. From 30 degrees on the horizontal axis, draw a vertical line to intersect with the line of best fit.

  2. From the point of intersection, draw a horizontal line to the vertical axis.

  3. Read the predicted sales, approximately 245 ice-creams, form the vertical axis.

When we are making predictions from a graph we should take care to work accurately, but still expect a small amount of variation due to the limited precision of working with graphs.

For the example above, the equation of the least-squares regression line is: S=10\times T-45 where S is the number of ice cream sales and T is the maximum daily temperature.

To predict the sales on a day where the maximum temperature is 30 degrees, we simply substitute the temperature into the equation: \begin{aligned} S &= 10 \times 30 -45 \\ &= 245 \end{aligned}So, we can predict that 245 ice creams will be sold on that day.

Examples

Example 1

A bivariate data set has a line of best fit with equation y=-8.71x+6.79.

Predict the value of y when x=3.49.

Worked Solution
Create a strategy

Substitute the value of x into the equation to get the predicted y value.

Apply the idea
\displaystyle y\displaystyle =\displaystyle -8.71(3.49)+6.79Substitute x=3.49
\displaystyle =\displaystyle -23.6079Evaluate using a calculator
Idea summary

We can make predictions either using a graph of the line of best fit or substituting into the equation of the least-squares regression line.

Interpolation and extrapolation

An important consideration when we are making predictions is recognising if the prediction is within the range of data values for which we have actual measurements. If it is, we refer to the prediction as an interpolation. If not, we refer to the predication as an extrapolation.

The diagram below illustrates these terms.

The image shows the area of interpolation and extrapolation on a scatterplot. Ask your teacher for more information.

Interpolation means you have used an x value in your prediction that is within the range of x values in the data that you were working with.

Extrapolation means you have used an x value in your prediction that is outside the range of x values in the data.

The scatter plots below are annotated to show examples of interpolation (left) and extrapolation (right) from the line of best fit.

A scatter plot with interpolation from a line of best fit. Ask your teacher for more information.
A scatter plot with extrapolation from a line of best fit. Ask your teacher for more information.

To judge the reliability of the prediction we need to consider two things:

  • How strong is the correlation?

  • Is the data interpolated or extrapolated?

  • How many points are contained in the data set?

We have already seen how to calculate the correlation coefficient and interpret the strength of a relationship, using this chart.

A number line showing values and descriptions of correlation from negative 1 to 1. Ask your teacher for more information.

If the correlation is weak, then the data values are more widely scattered, so there will be greater uncertainty in the prediction than for data with a strong correlation.

When we are extrapolating, we are making a prediction for values that we have not made any similar measurements, It is possible that a linear relationship is not valid outside of a certain range so we always have to treat an extrapolation as unreliable.

Reliability of predictions:

  • If we are interpolating and the correlation is strong, then the prediction will be reliable.

  • If we are interpolating and the correlation is moderate or weak and we are interpolating, then the prediction is less reliable.

  • If we are extrapolating, the prediction is unreliable, especially if the correlation is weak or we extrapolate far beyond the range of available data.

  • Predictions from a data set with a large number of points (e.g. more than 30) will be more reliable than predictions from a small data set.

Examples

Example 2

Research on the number of cigarettes smoked during pregnancy and the birth weights of the newborn babies was conducted.

\text{Average} \\ \text{number of} \\ \text{cigarettes} \\ \text{per day }(x)46.313.021.425.08.636.51.017.910.613.437.318.5
\text{Birth} \\ \text{weight in}\\ \text{kilograms }\\(y)3.95.85.04.85.54.57.05.15.55.13.85.7
a

Using technology, calculate the correlation coefficient between the average number of cigarettes per day and birth weight. Round your answer to three decimal places.

Worked Solution
Create a strategy

Use the linear regression function on your calculator.

Apply the idea

Using the Statistics mode, enter each x-value along with its y-value into a data table on your calculator then find the linear regression.

Look for the correlation coefficient (r):r=-0.908

b

Choose the description which best describes the statistical relationship between these two variables.

A
Strong negative linear relationship
B
Moderate positive linear relationship
C
Weak relationship
D
Strong positive linear relationship
E
Moderate negative linear relationship
Worked Solution
Create a strategy

Use the figure below to identify the best description of the correlation:

A number line showing values and descriptions of correlation from negative 1 to 1. Ask your teacher for more information.
Apply the idea

In this case, the absolute value of the correlation coefficient is closer to 1 than to 0, and the sign is negative, which means it corresponds to a strong negative correlation.

So the correct answer is A.

c

Use technology to form an equation for the least squares regression line of y on x.

Give all values to two decimal places. Give the equation of the line in the form y=mx+c.

Worked Solution
Create a strategy

Use the linear regression function on your calculator.

Apply the idea

Using the Statistics mode, enter each x-value along with its y-value into a data table on your calculator then find the linear regression.y=-0.06x+6.37

d

Use your regression line to predict the birth weight of a newborn whose mother smoked on average 5 cigarettes per day. Round your answer to two decimal places.

Worked Solution
Create a strategy

To use a regression line to predict a y value for a given x value, we substitute the x value into the equation.

Apply the idea

x represents the number of cigaretters per day, so we will substitute x=5 into the equation to find y.

\displaystyle y\displaystyle =\displaystyle -0.06 \times 5 + 6.37Substitute x=5
\displaystyle =\displaystyle 6.07Evaluate using a calculator

So the birth weight would be 6.07 \text{ kg}.

e

Choose the description which best describes the validity of the prediction in part (d).

A
Despite an interpolated prediction, unreliable due to a moderate to weak correlation.
B
Reliable due interpolation and a strong correlation
C
Very unreliable due to extrapolation and a moderate to weak correlation
D
Despite a strong correlation, unreliable due to extrapolation far from the data range where the linear trend does not continue.
Worked Solution
Create a strategy

Check if x=5 lies within the original range of x-values.

Apply the idea

We can see that the smallest value of x is 1, and next to it is 8.6. So, the prediction in part (d) is interpolation since x=5 lies within these x-values. In part (b), we found there is a strong correlation between the two variables.

So the correct answer is B.

Idea summary

Interpolation means you have used an x value in your prediction that is within the range of x values in the data that you were working with.

Extrapolation means you have used an x value in your prediction that is outside the range of x values in the data.

Reliability of predictions:

  • If we are interpolating and the correlation is strong, then the prediction will be reliable.

  • If we are interpolating and the correlation is moderate or weak, then the prediction is less reliable.

  • If we are extrapolating, the prediction is unreliable, especially if the correlation is weak or we extrapolate far beyond the range of available data.

  • Predictions from a data set with a large number of points (e.g. more than 30) will be more reliable than predictions from a small data set.

Outcomes

ACMGM061

use the equation of a fitted line to make predictions

ACMGM062

distinguish between interpolation and extrapolation when using the fitted line to make predictions, recognising the potential dangers of extrapolation

ACMGM065

identify possible non-causal explanations for an association, including coincidence and confounding due to a common response to another variable, and communicate these explanations in a systematic and concise manner

What is Mathspace

About Mathspace