NZ Level 8 (NZC) Level 3 (NCEA) [In development]

Least-Squares Lines and Making Predictions

Lesson

When we examine a set of bivariate data, remember that we first examine the data and its graph for the strength of the relationship between the two variables. We do this by calculating correlation coefficient between the two sets of data. That is, we want to see how well the dependent variable correlates with the independent variable.

Why do we bother to examine the strength of the relationship between two variables? Usually because we want to use this information to formulate predictions.

For example, we might want to study the effectiveness of spending money on Facebook advertising to attract new clients. We might collect data on a business similar to ours and then use our findings to predict how many clients we could expect to sign up for a monthly advertising spend of $500.

The first thing we need to do in order to use our data to make predictions, if to fit some sort of function to the shape of our data. For our purposes, we'll be looking at scattergraphs whose pattern is linear and we will therefore fit a linear function to our data.

You might remember as drawing in the Line of Best Fit. The more mathematically correct name is the Least Squares Regression Line, but effectively they're the same thing.

So what is the Least Squares Regression Line?

The best way to understand it is through a demonstration.

Experiment with this Geogebra applet.

- Refresh the applet and generate a new scattergraph to experiment with.
- Drag the slider to Stage 2 and move the blue dots around the rectangle until you have a line which you think is the Line of Best Fit.
- Drag the slider to Stage 3 and you'll see the residuals. For now, think of these as the distance between your line and the actual data.
- Drag the slider to Stage 4. Here you see the residuals turn to squares. Now move the blue dots on the line around again and try to make the total area of all the squares combined to be as small as possible.
- Drag the slider to Stage 5. How did you do? Your blue Line of Best Fit should be very close to the green Least Squares Regression Line. And the sum of your squares should be very close to the least sum of the squares.

Most of the time, you will be required to calculate the equation of the Least Squares Regression Line using technology.

Here's a great video on how to use the TI-Nspire to create a scatter graph and calculate the equation of the Least Squares Regression Line.

Least Squares Regression Line Notation

$\hat{y}$^`y`$=ax+b$=`a``x`+`b`

$\hat{y}$^`y` means the predicted value of $y$`y`

$x$`x` is our independent variable

$a$`a` is the gradient of our line

$b$`b` is the $y$`y` - intercept of our line.

Once we have our line, we're ready to start making predictions. Since our line is the best possible fit for the data we have, we can use it as a model to predict the likely value for the dependent variable based on a value for the independent variable that we'd like to predict for.

The process of predicting is two-fold.

- Firstly we need to substitute the $x$
`x`value we are interested in, into our Least Squares Regression Line - Then we need to consider whether our prediction is reliable or not

Let's watch this next video and see how we can make a prediction once we've input our data and calculated the Least Squares Regression Line.

After you have used your linear model to make a prediction for a particular value for the independent variable, you now need to determine how valid or reliable it is.

To do this, you need to consider two things:

- How strong is the correlation?
- Is the data interpolated or extrapolated?

We already know all about correlation and the strength of a relationship. Let's talk about the second point.

Interpolation means you have used an $x$`x` value in your prediction that is within the available range of data that you were working with. In the video example, the $x$`x` values range between 35 and 98, so any x value you choose within this range would be considered an interpolation.

Extrapolation means you have used an $x$`x` value in your prediction that is outside the available range of data. So in the video example, anything below 35 or above 98 would be considered an extrapolation.

Let's put it all together.

- If my correlation is strong and I have used interpolation, then my prediction will be reliable.
- If my correlation is moderate to weak and I have used interpolation, then my prediction is far less reliable.
- If my correlation is strong and I have used extrapolation, then my prediction will be far less reliable, especially if I extrapolate far beyond the range of available data.
- The worst of all is if I have a weak correlation and I extrapolate. There's little chance of this prediction being reliable.

During an alcohol education programme, $10$10 adults were offered up to $6$6 drinks and were then given a simulated driving test where they scored a result out of a possible $100$100.

Number of drinks ($x$x) |
$3$3 | $2$2 | $6$6 | $4$4 | $4$4 | $1$1 | $6$6 | $3$3 | $4$4 | $2$2 |
---|---|---|---|---|---|---|---|---|---|---|

Driving score ($y$y) |
$66$66 | $61$61 | $43$43 | $58$58 | $56$56 | $73$73 | $31$31 | $64$64 | $55$55 | $62$62 |

Using a graphics calculator (or other technology), calculate the correlation coefficient between these variables.

Give your answer to two decimal places.

Choose the description which best describes the statistical relationship between these two variables.

Strong negative linear relationship

AModerate negative linear relationship

BWeak relationship

CModerate positive linear relationship

DStrong positive linear relationship

EStrong negative linear relationship

AModerate negative linear relationship

BWeak relationship

CModerate positive linear relationship

DStrong positive linear relationship

EUse your graphing calculator to form an equation for the least squares regression line of $y$

`y`on $x$`x`.Give your answer in the form $y=mx+b$

`y`=`m``x`+`b`. Give all values to one decimal place.Use your regression line to predict the driving score of a young adult who consumed $5$5 drinks.

Give your answer to one decimal place.

Choose the description which best describes the validity of the prediction in part (d).

Despite a strong correlation, unreliable due to extrapolation.

ADespite an interpolated prediction, unreliable due to a moderate to weak correlation.

BVery unreliable due to extrapolation and a moderate to weak correlation.

CReliable due to interpolation and a strong correlation.

DDespite a strong correlation, unreliable due to extrapolation.

ADespite an interpolated prediction, unreliable due to a moderate to weak correlation.

BVery unreliable due to extrapolation and a moderate to weak correlation.

CReliable due to interpolation and a strong correlation.

D

A bivariate data set contains $10$10 data points with the following summary statistics:

$\overline{x}$x$=$=$5.13$5.13 |
$s_x=2.85$sx=2.85 |
$\overline{y}$y$=$=$18.81$18.81 |
$s_y=7.54$sy=7.54 |
$r=0.993$r=0.993 |

Calculate the slope of the least squares regression line.

Give your answer to two decimal places.

Using the rounded value of the previous part, calculate the vertical intercept of the least squares regression line.

Give your answer to two decimal places.

Hence state the equation of the least squares regression line.

The equation of the least squares regression line for a data set is given by $y=bx-0.44$`y`=`b``x`−0.44.

Given that the mean of $x$

`x`is $33$33 and the mean of $y$`y`is $108.46$108.46, solve for the value of $b$`b`.Given that the $s_x=114$

`s``x`=114 and $s_y=396$`s``y`=396, solve for the correlation coefficient $r$`r`.What is the strength of the relationship?

moderate negative linear relationship

Aweak relationship

Bstrong negative linear relationship

Cstrong positive linear relationship

Dmoderate positive linear relationship

Emoderate negative linear relationship

Aweak relationship

Bstrong negative linear relationship

Cstrong positive linear relationship

Dmoderate positive linear relationship

E

Carry out investigations of phenomena, using the statistical enquiry cycle: A conducting experiments using experimental design principles, conducting surveys, and using existing data sets B finding, using, and assessing appropriate models (including linear regression for bivariate data and additive models for time-series data), seeking explanations, and making predictions C using informed contextual knowledge, exploratory data analysis, and statistical inference D communicating findings and evaluating all stages of the cycle.

Investigate bivariate measurement data