topic badge

7.05 Making predictions

Lesson

Given a set of data relating $x$x and $y$y, a line of best fit is our best estimate for how the dependent variable $y$y changes in response to the independent variable $x$x, assuming that the variables have a linear relationship. The equation of the line of best fit then allows us to go one step further and make predictions about other possible ordered pairs that fit in with this relationship.

For example, we may use data from $10$10 employees to obtain the equation

$y=0.0001x+0.5$y=0.0001x+0.5

relating salary ($x$x) and the number of years someone stays in their job ($y$y).

This equation now allows us to predict values beyond the values given by the $10$10 employees. We can predict:

  • how long someone will stay in their job ($y$y) for any given salary ($x$x)
  • what someone's salary would need to be ($y$y) for them to stay in their job for a certain number of years ($x$x).

So the line of best fit allows us to go beyond the observed data points we started with, and predict other values within the relationship.

Worked example

A hobby store records the age of their customers, $x$x, along with the amount of money the customer spends during their visit, $y$y, and generates a bivariate data set. The line of best fit for the data set is found to be

$y=2x-30$y=2x30

(a) According to this model, how much money should they expect a $25$25-year-old to spend in a single visit?
(b) According to this model, what age range should they expect to spend at least $\$50$$50 per visit?

Think: Part (a) asks us to find a $y$y-value using $x=25$x=25, and part (b) asks us to find an $x$x-value using $y=50$y=50. We will substitute these values into the equation to solve for the other variable.

Do: (a) When $x=25$x=25$y=2\times25-30$y=2×2530$=$=$20$20, so they should expect them to spend $\$20$$20.
(b) When $y=50$y=50$50=2x-30$50=2x30. Rearranging to make $x$x the subject gives $x=\frac{50+30}{2}$x=50+302$=$=$40$40, so they should expect that someone who is $40$40 years old will spend $\$50$$50. Since this line of best fit describes a positive linear relationship, they should expect anyone $40$40 years or older to spend $\$50$$50 or more.

 

The limits of prediction

We already know that most of the data points in a data set do not typically lie on the line of best fit. Other variables affect the values of a data set, and measurement error can also contribute to making the data set more spread out and the relationship less clear. The effects of other variables and errors on the data set is called noise.

So how accurate are our predictions? One factor to take into account is the part of the line we are using to make our prediction. 

The bivariate data set on the right has generated a line of best fit, and the range of the $x$x-values has been highlighted.

Making predictions within this range is interpolation, and making predictions outside this range is extrapolation.

   
   
If we make a prediction from an $x$x-value that is within the range of the $x$x-values in the data set, then we call the prediction an interpolation. Values in this range have been used to generate the line of best fit, so predictions within this range don't require any extension of the model.  If we make a prediction using an $x$x-value from outside the range of the data set, we call the prediction an extrapolation. Values outside of the range have not been used to generate the line of best fit, so there is less certainty of what the relationship is like outside of this range. Making predictions outside of this range usually results in a less accurate prediction.

 

Worked example

The owner and operator of an online store selling custom computer keyboards keeps track of the number of keyboards she makes, $x$x, and how much profit she makes in dollars, $y$y, each week for several months. The fewest keyboards she made in any of the weeks was $2$2, and the most she made was $5$5. The bivariate data set has a line of best fit given by

$y=400x-650$y=400x650

with a correlation coefficient of $r=0.9$r=0.9.

(a) If she makes $3$3 keyboards next week, how much profit does this model predict she should expect? How confident should she be in this prediction?
(b) If she makes $20$20 keyboards the week after, how much profit does this model predict she should expect? How confident should she be in this prediction?

Think: The correlation coefficient is $0.9$0.9, so she should have a high degree of confidence when making an interpolation. The range of $x$x-values is between $2$2 and $5$5.

Do: (a) When $x=3$x=3$y=400\times3-650$y=400×3650$=$=$\$550$$550. Since the observed linear relationship is strong ($r=0.9$r=0.9) and this prediction is an interpolation, she should be confident in this prediction.
(b) When $x=20$x=20$y=400\times20-650$y=400×20650$=$=$\$7350$$7350. Since this prediction is an extrapolation, she should not be as confident about this prediction.

Outcomes

MS2-12-2

analyses representations of data in order to make inferences, predictions and draw conclusions

MS2-12-7

solves problems requiring statistical processes, including the use of the normal distribution and the correlation of bivariate data

What is Mathspace

About Mathspace