A mathematical way of finding a straight line that best fits a scatterplot is the least-squares regression line. The equation of the least-squares regression line is in the form $y=mx+c$y=mx+c, where $m$m is the value of the gradient of the line, and $c$c is the value of the vertical intercept. $y$y is the variable graphed on the vertical axis and $x$x is the variable graphed on the horizontal axis. On a scientific calculator, it may look like $y=A+Bx$y=A+Bx instead.
The gradient and vertical intercept of the least-squares regression line equation come from the sample mean and sample standard deviation of the independent and dependent variables.
$y=mx+c$y=mx+c
$m=r\frac{s_y}{s_x}$m=rsysx
$c=\overline{y}-m\overline{x}$c=y−mx
where $s_x=$sx= the standard deviation of $x$x
where $s_y=$sy= the standard deviation of $y$y
where $\overline{x}=$x= the mean of $x$x
where $\overline{y}=$y= the mean of $y$y
and $r=$r= the correlation coefficient
Technology, such as a scientific or graphics calculator or a spreadsheet, can help us find the least-squares regression line very easily so that we do not need to use the formula. In practice, you enter the bivariate data as a set of coordinate pairs, and the line of best fit will be calculated automatically.
Let's look at what the computer or calculator is actually doing.
Firstly it calculates the vertical distances from each data point to a line. These distances are called residuals. Then it squares every residual and adds them all together to find a total. As the line changes, this total will increase or decrease. The line of best fit is the one which has the smallest total.
Try manipulating this applet to find the line of best fit:
Here is a set of bivariate data:
Here is a line that we will apply least squares regression to:
We find the residuals, square them, and add them together to get a total:
This total can be lowered by moving the line. The lowest total possible comes from this line, the line of best fit for the data:
Experiment with this Geogebra applet.
|
Least squares regression is only one way of finding the line of best fit. Statisticians will sometimes use other methods depending on the shape of the data (including outliers) and other factors to find the line of best fit.
Line of best fit - The line which most closely models a set of bivariate data.
Residual - The vertical distance between a data point and a line modelling it.
Least squares regression - A technique for finding the line of best fit involving minimising the sum of the squares of the residuals.
A least squares regression line is given by $y=3.59x+6.72$y=3.59x+6.72.
State the gradient of the line.
Which of the following is true?
The gradient of the line indicates that the bivariate data set has a positive correlation.
The gradient of the line indicates that the bivariate data set has a negative correlation.
Which of the following is true?
If $x$x increases by $1$1 unit, then $y$y increases by $3.59$3.59 units.
If $x$x increases by $1$1 unit, then $y$y decreases by $3.59$3.59 units.
If $x$x increases by $1$1 unit, then $y$y decreases by $6.72$6.72 units.
If $x$x increases by $1$1 unit, then $y$y increases by $6.72$6.72 units.
State the value of the $y$y-intercept.
Once we have our least squares regression line, we're ready to start making predictions. Since our line is the best possible fit for the data we have, we can use it as a model to predict the likely value for the dependent variable based on a value for the independent variable that we'd like to predict for.
Interpolation means you have used an $x$x value in your prediction that is within the available range of data that you were working with. Suppose the $x$x values range between $35$35 and $98$98, so any $x$x value you choose within this range would be considered an interpolation.
Extrapolation means you have used an $x$x value in your prediction that is outside the available range of data. Suppose the $x$xvalues range between $35$35 and $98$98, then anything below $35$35 or above $98$98 would be considered an extrapolation.
It is important to recognise that there are limitations to interpolating and extrapolating depending on the context. It is dangerous to make predictions that are a fair way outside the range of data. Therefore, it is important that you consider the context of the variables and whether it is reasonable or realistic.
During an alcohol education programme, $10$10 adults were offered up to $6$6 drinks and were then given a simulated driving test where they scored a result out of a possible $100$100.
Number of drinks ($x$x) | $3$3 | $2$2 | $6$6 | $4$4 | $4$4 | $1$1 | $6$6 | $3$3 | $4$4 | $2$2 |
---|---|---|---|---|---|---|---|---|---|---|
Driving score ($y$y) | $66$66 | $61$61 | $43$43 | $58$58 | $56$56 | $73$73 | $31$31 | $64$64 | $55$55 | $62$62 |
Using a graphics calculator (or other technology), calculate the correlation coefficient between these variables.
Give your answer to two decimal places.
Choose the description which best describes the statistical relationship between these two variables.
Strong negative linear relationship
Moderate negative linear relationship
Weak relationship
Moderate positive linear relationship
Strong positive linear relationship
Use your graphing calculator to form an equation for the least squares regression line of $y$y on $x$x.
Give your answer in the form $y=mx+b$y=mx+b. Give all values to one decimal place.
Use your regression line to predict the driving score of a young adult who consumed $5$5 drinks.
Give your answer to one decimal place.
Choose the description which best describes the validity of the prediction in part (d).
Despite a strong correlation, unreliable due to extrapolation.
Despite an interpolated prediction, unreliable due to a moderate to weak correlation.
Very unreliable due to extrapolation and a moderate to weak correlation.
Reliable due to interpolation and a strong correlation.