topic badge

5.07 Making predictions

Lesson

Making predictions

The line of best fit is a linear representation of the general trend of our data.

Once we have determined the equation of the line of best fit, we can use it as a model to predict the likely value of the response variable based on a given value of the explanatory variable.

The process of predicting has two parts:

  • Substitute the $x$x value into the rule for the line of best fit to get the predicted $y$y value
  • Then we need to consider whether our prediction is reliable or not

 

We can make predictions 'by hand' either using a graph of the line of best fit or by substituting into the equation of the line.

 

Predictions from a graph

The example below shows a scatter plot for the relationship of daily ice-cream sales versus the maximum daily temperature, along with the line of best fit.

Although there is a clear trend of increasing sales as the temperature increases, it would be difficult to predict the sales for a given day from the raw data.  However, we can estimate this by using the line of best fit.

If we want to predict the sales on a day when the temperature reaches $30$30 degrees, we could do so with these steps that follow the red arrows on the graph:

  1. From $30$30 degrees on the horizontal axis, draw a vertical line to intersect with the line of best fit.
  2. From the point of intersection, draw a horizontal line to the vertical axis
  3. Read the predicted sales, approximately $245$245 ice-creams, form the vertical axis.

 

When we are making predictions from a graph we should take care to work accurately, but still expect a small amount of variation due to the limited precision of working with graphs.

 

Predictions from an equation

For the example above, the equation of the line of best fit is: 

$S$S $=$= $10\times T-45$10×T45

Where $S$S is the number of ice cream sales and $T$T is the maximum daily temperature.

To predict the sales on a day where the maximum temperature is $30$30 degrees, we simply substitute the temperature into the equation:

$S$S $=$= $10\times30-45$10×3045
  $=$= $245$245

So, we can predict that $245$245 ice creams will be sold on that day.

 

Practice question

Question 1

Scientists conducted a study to see people's reaction times after they've had different amounts of sleep.

The data is presented below with a line of best fit.

Number of hours of sleep ($x$x) $1.1$1.1 $1.5$1.5 $2.1$2.1 $2.5$2.5 $3.5$3.5 $4$4
Reaction time in seconds ($y$y) $4.66$4.66 $4.1$4.1 $4.66$4.66 $3.7$3.7 $3.6$3.6 $3.4$3.4

 

Loading Graph...
A number plane has x-axis ranging from 0 to 5 and the y-axis ranging from 0 to 5. Data points are plotted at $\left(\frac{11}{10},\frac{233}{50}\right)$(1110,23350), $\left(\frac{3}{2},\frac{41}{10}\right)$(32,4110), $\left(\frac{21}{10},\frac{233}{50}\right)$(2110,23350), $\left(\frac{5}{2},\frac{37}{10}\right)$(52,3710), $\left(\frac{7}{2},\frac{18}{5}\right)$(72,185), and $\left(4,\frac{17}{5}\right)$(4,175).  A line of best fit is drawn on the data points to highlight the overall trend.
  1. What would be the reaction time for someone who has slept $5$5 hours?

  2. Predict the number of hours someone sleeps if they have a reaction time of $4$4 seconds.

Question 2

A bivariate data set has a line of best fit with equation $y=-8.71x+6.79$y=8.71x+6.79.

Predict the value of $y$y when $x=3.49$x=3.49.

 

Interpolation and extrapolation

An important consideration when we are making predictions is recognising if the prediction is within the range of data values for which we have actual measurements.  If it is, we refer to the prediction as interpolation.  If not, we refer to the predication as extrapolation.

The bivariate data set on the right has generated a line of best fit, and the range of the $x$x-values has been highlighted.

Making predictions within this range is interpolation, and making predictions outside this range is extrapolation.

 

Interpolation and extrapolation

Interpolation means you have used an $x$x value in your prediction that is within the range of $x$x values in the data that you were working with.

Extrapolation means you have used an $x$x value in your prediction that is outside the range of $x$x values in the data.

 

The scatter plots below are annotated to show examples of interpolation (left) and extrapolation (right) from the line of best fit.

 

Reliability of predictions

To judge the reliability of the prediction we need to consider three things:

  • How strong is the correlation?
  • Is the data interpolated or extrapolated?
  • How many points are contained in the data set?

We have already seen how to calculate the correlation coefficient and interpret the strength of a relationship, using this chart.

Correlation coefficient

 

If the correlation is weak, then the data values are more widely scattered, so there will be greater uncertainty in the prediction than for data with a strong correlation.

An interpolated prediction is made within the range of $x$x-values upon which the model was based. Since the model is designed to align closely to the data in this range, interpolated predictions will be more reliable than extrapolated predictions.   

When we are extrapolating, we are making the assumption that the linear model continues. This assumption is generally valid when making a prediction relatively close to the given data range. A sudden change in trend is unlikely for most of the phenomena we will model, so we will consider such predictions made to be reliable when using a strong model.

When we extrapolate well beyond the data range, even if the linear trend continues, we will consider such predictions unreliable. As we are making a prediction for values that we have not made any similar measurements, it is possible that the underlying linear relationship is slightly different from the one generated by the given data. This will lead to predictions fairing worse as we predict further beyond the given data range. There are also many cases where it is clear the linear trend cannot continue well beyond the data range. In such cases, the predictions will be unreliable and may also be unreasonable - that is give values that are not possible. For example, a model may show a negative linear trend between a chicken's age and the number of eggs it lays per week, however, the linear model could not continue indefinitely as it would suggest the chicken would lay a negative number of eggs at some age.

 

Reliability of predictions

If we are interpolating:

  • And the correlation is weak, then the prediction will not be reliable.
  • And the correlation is moderate, then the prediction is moderately reliable.
  • And the correlation is strong, then the prediction will be reliable.

If we are extrapolating:

  • And the correlation is weak, then the prediction will not be reliable.
  • Just beyond the data range and the correlation is moderate, then the prediction is moderately reliable
  • Just beyond the data range and the correlation is strong, then the prediction will be reliable
  • Well beyond the data range, then the prediction will be not be reliable and may be unreasonable.

Predictions from a data set with a large number of points (e.g. more than $30$30) will be more reliable than predictions from a small data set.

 

Worked example

A farmer sprays his wheat fields with a fertiliser. The data below gives the yield of wheat per hectare for various spray concentrations:

Spray concentration ($x$x ml/l)

$1$1 $2$2 $4$4 $7$7 $8$8 $10$10

Yield of wheat, ($y$y tonnes per hectare)

$1.19$1.19 $1.50$1.50 $1.62$1.62 $1.65$1.65 $2.05$2.05 $2.07$2.07

The line of best fit for this data is $\hat{y}=0.088x+1.212$^y=0.088x+1.212 and $r=0.929$r=0.929 (given to $3$3 decimal places),

(a) Use the line to make predictions for the yield when using the following concentrations of fertiliser: $5$5 ml/l, $10.8$10.8 ml/l and $35$35 ml/l. (Give answers to $2$2 decimal places)

Think: Substitute the concentration for the value of $x$x in the equation of the line and evaluate.

Do:

Spray concentration ($x$x ml/l) Predicted yield of wheat ($\hat{y}$^y tonnes/hectare)
$5$5 $1.65$1.65
$10.8$10.8 $2.16$2.16
$35$35 $4.29$4.29

(b) Comment on the reliability of each prediction made in part (a).

Think: Consider the strength of the model and whether the prediction is interpolation, extrapolation just beyond the data range or extrapolation well beyond the data range.

Do:

The correlation coefficient tells us we have a strong model.

The prediction for $5$5 ml/l is interpolation on a strong model and hence, should be reliable.

The prediction for $10.8$10.8 ml/l is extrapolation just beyond the data range on a strong model and hence, should be reasonably reliable. 

The prediction for $35$35 ml/l is extrapolation well beyond the data range and we know the linear trend cannot continue indefinitely since a crop of wheat must have a maximum possible yield and at some point the fertiliser will no longer be effective and may even be toxic. Hence, despite the strong model this prediction is not reliable and may not even be reasonable - may be beyond the maximum yield possible.

 

Practice questions

Question 3

A prediction for the $y$y-value when $x=5$x=5 is made from the data set below.

Is the prediction an extrapolation or an interpolation?

$x$x $4$4 $7$7 $8$8 $11$11 $12$12 $13$13 $17$17 $18$18 $19$19 $20$20
$y$y $0$0 $2$2 $4$4 $7$7 $6$6 $4$4 $8$8 $8$8 $11$11 $8$8
  1. Extrapolation

    A

    Interpolation

    B

Question 4

One litre of gas is raised to various temperatures and its pressure is measured.

The data has been graphed below with a line of best fit.

Temperature (K) $300$300 $302$302 $304$304 $308$308 $310$310
Pressure (Pa) $2400$2400 $2416$2416 $2434$2434 $2462$2462 $2478$2478
Temperature (K) $312$312 $314$314 $316$316 $318$318 $320$320
Pressure (Pa) $2496$2496 $2512$2512 $2526$2526 $2546$2546 $2562$2562

Loading Graph...

  1. The pressure was not recorded when the temperature was $306$306 K.

    Is it reasonable to use the line of best fit to predict the pressure?

    Yes

    A

    No

    B
  2. Predict the pressure when the temperature is $306$306 K.

  3. Within which range of temperatures is it reasonable to use the line of best fit to predict pressure?

    $\left[300,320\right]$[300,320]

    A

    $\left[300,600\right]$[300,600]

    B

    $\left[0,320\right]$[0,320]

    C

    $\left[280,340\right]$[280,340]

    D

Question 5

The data below has a correlation coefficient $r=0.57$r=0.57 and a line of best fit $y=0.27x-0.62$y=0.27x0.62.

$x$x $23$23 $81$81 $11$11 $44$44 $50$50 $91$91 $51$51 $95$95 $53$53 $82$82
$y$y $1$1 $35.2$35.2 $0.8$0.8 $12.8$12.8 $23$23 $1.7$1.7 $21.2$21.2 $35$35 $1$1 $19$19
  1. Predict the value of y when $x=145$x=145.

  2. Comment on the reliability of the prediction in part (a), giving reasons.

    The data is strongly correlated and the prediction is an interpolation, so the prediction is reliable.

    A

    The data is moderately correlated and the prediction is an interpolation, so the prediction is moderately reliable.

    B

    The data is strongly correlated and the prediction is an extrapolation close to the data range, so the prediction is reliable.

    C

    The data is moderately correlated and the prediction is an extrapolation far from the data range. There's no telling that the linear trend continues so the prediction is unreliable.

    D

Question 6

Several cars underwent a brake test and their age was measured against their stopping distance. The scatter plot shows the results and a line of best fit that approximates the positive correlation.

Loading Graph...

  1. According to the line, what is the stopping distance of a car that is $6$6 years old?

  2. Using the two points that lie on the line, determine the gradient of the line of best fit.

  3. Assuming the line of best fit is in the form $y=mx+b$y=mx+b, determine the value of $b$b, the vertical intercept of the line.

  4. Use the line of best fit to estimate the stopping distance of a car that is $7.5$7.5 years old.

  5. Is the estimation in the previous part an example of interpolation or extrapolation?

    Interpolation

    A

    Extrapolation

    B
  6. Is the predicted value in part (d) reliable or unreliable?

    Reliable

    A

    Unreliable

    B

 

Outcomes

4.1.3.4

interpret relationships in terms of the variables [complex]

4.1.3.6

use the line of best fit to make predictions, both by interpolation and extrapolation [complex]

What is Mathspace

About Mathspace