topic badge

3.085 Linear regression

Lesson

Least squares regression line

A mathematical way of finding a straight line that best fits a scatterplot is the least-squares regression line. The equation of the least-squares regression line is in the form $y=mx+c$y=mx+c, where $m$m is the value of the gradient of the line, and $c$c is the value of the vertical intercept. $y$y is the variable graphed on the vertical axis and $x$x is the variable graphed on the horizontal axis. On a scientific calculator, it may look like $y=A+Bx$y=A+Bx instead.

The gradient and vertical intercept of the least-squares regression line equation come from the sample mean and sample standard deviation of the independent and dependent variables.

Formulae to calculate the least squares regression line

$y=mx+c$y=mx+c

$m=r\frac{s_y}{s_x}$m=rsysx

$c=\overline{y}-m\overline{x}$c=ymx

 

where $s_x=$sx= the standard deviation of $x$x

where $s_y=$sy= the standard deviation of $y$y

where $\overline{x}=$x= the mean of $x$x

where $\overline{y}=$y= the mean of $y$y

and $r=$r= the correlation coefficient

Technology, such as a scientific or graphics calculator or a spreadsheet, can help us find the least-squares regression line very easily so that we do not need to use the formula. In practice, you enter the bivariate data as a set of coordinate pairs, and the line of best fit will be calculated automatically.

Let's look at what the computer or calculator is actually doing.

Firstly it calculates the vertical distances from each data point to a line. These distances are called residuals. Then it squares every residual and adds them all together to find a total. As the line changes, this total will increase or decrease. The line of best fit is the one which has the smallest total.

Try manipulating this applet to find the line of best fit:

Here is a set of bivariate data:

Here is a line that we will apply least squares regression to:

We find the residuals, square them, and add them together to get a total:

This total can be lowered by moving the line. The lowest total possible comes from this line, the line of best fit for the data:

Experiment with this Geogebra applet.

  1. Refresh the applet and generate a new scatter plot to experiment with.
  2. Drag the slider to Stage 2 and move the blue dots around the rectangle until you have a line which you think is the Line of Best Fit.
  3. Drag the slider to Stage 3 and you'll see what are called residuals. For now, think of these as the distance between your line and the actual data.
  4. Drag the slider to Stage 4. Here you see the residuals turn to squares. Now move the blue dots on the line around again and try to make the total area of all the squares combined to be as small as possible.
  5. Drag the slider to Stage 5. How did you do? Your blue Line of Best Fit should be very close to the green Least Squares Regression Line. And the sum of your squares should be very close to the least sum of the squares.

Regression methods

Least squares regression is only one way of finding the line of best fit. Statisticians will sometimes use other methods depending on the shape of the data (including outliers) and other factors to find the line of best fit.

Summary

Line of best fit - The line which most closely models a set of bivariate data.

Residual - The vertical distance between a data point and a line modelling it.

Least squares regression - A technique for finding the line of best fit involving minimising the sum of the squares of the residuals.

Practice questions

Question 1

A least squares regression line is given by $y=3.59x+6.72$y=3.59x+6.72.

  1. State the gradient of the line.

  2. Which of the following is true?

    The gradient of the line indicates that the bivariate data set has a positive correlation.

    A

    The gradient of the line indicates that the bivariate data set has a negative correlation.

    B
  3. Which of the following is true?

    If $x$x increases by $1$1 unit, then $y$y increases by $3.59$3.59 units.

    A

    If $x$x increases by $1$1 unit, then $y$y decreases by $3.59$3.59 units.

    B

    If $x$x increases by $1$1 unit, then $y$y decreases by $6.72$6.72 units.

    C

    If $x$x increases by $1$1 unit, then $y$y increases by $6.72$6.72 units.

    D
  4. State the value of the $y$y-intercept.

Making predictions

Once we have our least squares regression line, we're ready to start making predictions. Since our line is the best possible fit for the data we have, we can use it as a model to predict the likely value for the dependent variable based on a value for the independent variable that we'd like to predict for.

Interpolation means you have used an $x$x value in your prediction that is within the available range of data that you were working with. Suppose the $x$x values range between $35$35 and $98$98, so any $x$x value you choose within this range would be considered an interpolation.

Extrapolation means you have used an $x$x value in your prediction that is outside the available range of data. Suppose the $x$xvalues range between $35$35 and $98$98, then anything below $35$35 or above $98$98 would be considered an extrapolation.

It is important to recognise that there are limitations to interpolating and extrapolating depending on the context. It is dangerous to make predictions that are a fair way outside the range of data. Therefore, it is important that you consider the context of the variables and whether it is reasonable or realistic.

Practice question

Question 2

During an alcohol education programme, $10$10 adults were offered up to $6$6 drinks and were then given a simulated driving test where they scored a result out of a possible $100$100.

Number of drinks ($x$x) $3$3 $2$2 $6$6 $4$4 $4$4 $1$1 $6$6 $3$3 $4$4 $2$2
Driving score ($y$y) $66$66 $61$61 $43$43 $58$58 $56$56 $73$73 $31$31 $64$64 $55$55 $62$62
  1. Using a graphics calculator (or other technology), calculate the correlation coefficient between these variables.

    Give your answer to two decimal places.

  2. Choose the description which best describes the statistical relationship between these two variables.

    Strong negative linear relationship

    A

    Moderate negative linear relationship

    B

    Weak relationship

    C

    Moderate positive linear relationship

    D

    Strong positive linear relationship

    E
  3. Use your graphing calculator to form an equation for the least squares regression line of $y$y on $x$x.

    Give your answer in the form $y=mx+b$y=mx+b. Give all values to one decimal place.

  4. Use your regression line to predict the driving score of a young adult who consumed $5$5 drinks.

    Give your answer to one decimal place.

  5. Choose the description which best describes the validity of the prediction in part (d).

    Despite a strong correlation, unreliable due to extrapolation.

    A

    Despite an interpolated prediction, unreliable due to a moderate to weak correlation.

    B

    Very unreliable due to extrapolation and a moderate to weak correlation.

    C

    Reliable due to interpolation and a strong correlation.

    D

Outcomes

MA12-8

solves problems using appropriate statistical processes

What is Mathspace

About Mathspace