A line of best fit is a straight line that best represents bivariate data on a scatter plot. This line may pass through some of the points, none of the points, or all of the points. Lines of best fit are useful to determine the equation of the line and use the equation to make predictions.
The most common method of fitting a line of best fit to a scatter plot of bivariate data is using the least squares method. This is then called a least squares line of best fit, or sometimes a Least Squares Regression Line.
In a sentence, it is a technique that minimises the sum of the squares of the vertical distance from each data point to a straight line. Perhaps an easier way to understand it is through a demonstration.
Experiment with this Geogebra applet.
Refresh the applet and generate a new scatter plot to experiment with.
Drag the slider to Stage 2 and move the blue dots around the rectangle until you have a line which you think is the Line of Best Fit.
Drag the slider to Stage 3 to see what are called residuals. For now, think of these as the distance between the line and the actual data.
Drag the slider to Stage 4. Here you see the residuals turn to squares. Now move the blue dots on the line around again and try to make the total area of all the squares combined to be as small as possible. (Note, depending on the scale, these squares may appear rectangular)
Drag the slider to Stage 5. Your blue Line of Best Fit should be very close to the green Least Squares Regression Line. And the sum of your squares should be very close to the least sum of the squares.
In Stage 2, we make sure to move the line of best fits where the number of data points is equal above and below the line. In Stage 4, we minimize the overall area of all the squares combined. By doing this, moving to Stage 5, we should be very close to the green line or the least-squares regression line.
The equation of a least squares line of best fit is the same as the equation of a straight line, but is usually presented in a different form. We may recall the equation of a straight line as, y=mx+c or y=ax+b
However, a least squares line is usually written as, y=a+bx or RV=a+b\times EV where RV stands for the Response variable and EV stands for the Explanatory variable
Plotting a least squares line on a graph is the same as plotting the graph of a straight line and finding the equation of a least squares line from a graph is the same as finding the equation of a straight line from a graph.
Most of the time, you will be required to calculate the equation of the Least Squares Regression Line using technology.
Here's a video on how to use the TI-Nspire to create a scatter graph and calculate the equation of the Least Squares Regression Line.
The table shows the number of people who went to watch a movie x weeks after it was released.
\text{Weeks }(x) | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|
\text{Number of people }(y) | 37 | 37 | 33 | 33 | 29 | 29 | 25 |
Plot the points from the table.
If a line of best fit were drawn to approximate the relationship, which of the following could be its equation?
Graph the line of best fit whose equation is given by y=-2x+40.
Use the equation of the line of best fit to find the number of people who went to watch the movie 12 weeks after it was released.
The equations of a straight line are as follows:
y=mx+c
y=ax+b
Equation of a least squares regression line is usually written as:
y=a+bx
RV=a+b\times EV
where RV stands for the Response variable and EV stands for the Explanatory variable
If we aren't given the data set but instead have certain statistics calculated from the data set, we can still calculate the equation of the least squares regression line.
We will use the following formulae: y=a+bx,\,b=r\dfrac{s_y}{s_x}, and a=\overline{y}-b\overline{x} where s_x is the standard deviation of x, \,s_y is the standard deviation of y, \,\overline{x} is the mean of x, \,\overline{y} is the mean of y, and r is the correlation coefficient.
A bivariate data set contains 10 data points with the following summary statistics:
\overline{x}=5.13
s_x=2.85
\overline{y}=18.81
s_y=7.54
r=0.993
Calculate the slope of the least squares regression line. Give your answer to two decimal places.
Using the rounded value of the previous part, calculate the vertical intercept of the least squares regression line. Give your answer to two decimal places.
State the equation of the least squares regression line.
The equation for the line of best fit is given by P=-4t+116, where t is time. This relationship shows that over time, P is:
The following are the formulae in calculating the least squares regression line: y=a+bx \qquad b=r\dfrac{s_y}{s_x}\qquad a=\overline{y}-b\overline{x}
where:
a= the y-intercept
b= the slope
s_x= the standard deviation of x
s_y= the standard deviation of y
\overline{x}= the mean of x
\overline{y}= the mean of y
r= the correlation coefficient