When we display bivariate data that appears to have a linear relationship, we often wish to find a line that best models the relationship so we can see the trend and make predictions. We call this the line of best fit.

Exploration

We want to draw a line of best fit for the following scatterplot:

Let's try drawing three lines across the data and consider which is most appropriate.

We can tell straight away that $A$A is not the right line. This data appears to have a positive linear relationship, but $A$A has a negative gradient. $B$B has the correct sign for its gradient, and it passes through three points! However, there are many more points above the line than below it, and we should try to make sure the line of best fit passes through the centre of all the points. The means that line $C$C is the best fit for this data out of the three lines.

Drawing a line of best fit by eye

One method is to draw an oval around the points on the scatterplot, then cut the oval in half with a line.
The line may pass exactly through all of the points, some of the points, or none of the points.
It always represents the general trend of the of the data (increasing or decreasing).
The number of points above the line should be approximately the same as the number of points below the line.
Be wary of outliers (points that fall far from the general trend of the rest of the data) as they are highly influential and will skew the line of best fit. An outlier may be removed if it is a single anomaly and you wish to make more reliable predictions for the “majority” of the data. It should be made clear this was done.

Below is an example of what a good line of best fit might look like.

Practice questions

Question 1

The following scatter plot shows the data for two variables, $x$x and $y$y.

A scatter plot with 8 small, dark solid circular points showing data for two variables $x$`x` and $y$`y`. The x-axis is labeled "x" ranging from 0 to 10 in increments of 1. The y-axis is labeled "y", ranging from 0 to 10 in increments of 1. The scatter plot has visible, light shade, continuous grid lines for each increment. The position of the points seem to form an upward-sloping pattern. The points are located at $\left(1,2\right)$(1,2), $\left(2,1\right)$(2,1), $\left(3,3\right)$(3,3), $\left(4,5\right)$(4,5), $\left(5,6\right)$(5,6), $\left(6,5\right)$(6,5), $\left(7,7\right)$(7,7), and $\left(8,7\right)$(8,7). The coordinates of the plotted points are not explicitly labeled.

Determine which of the following graphs contains the line of best fit.

A scatter plot with 8 small, dark solid circular points showing data for two variables $x$x and $y$y. The x-axis is labeled "x" ranging from 0 to 10 in increments of 1. The y-axis is labeled "y", ranging from 0 to 10 in increments of 1. The scatter plot has visible, light shade, continuous grid lines for each increment. The position of the points seem to form an upward-sloping pattern. The points are located at $\left(1,2\right)$(1,2), $\left(2,1\right)$(2,1), $\left(3,3\right)$(3,3), $\left(4,5\right)$(4,5), $\left(5,6\right)$(5,6), $\left(6,5\right)$(6,5), $\left(7,7\right)$(7,7), and $\left(8,7\right)$(8,7). A straight bold green color line runs diagonally from the lower left of the plot, upwards to the upper right of the plot. The plotted points are positioned closely along this line, most are above but some are below this line. The coordinates of the plotted points are not explicitly labeled.

A

A scatter plot with 8 small, dark solid circular points showing data for two variables $x$x and $y$y. The x-axis is labeled "x" ranging from 0 to 10 in increments of 1. The y-axis is labeled "y", ranging from 0 to 10 in increments of 1. The scatter plot has visible, light shade, continuous grid lines for each increment. The position of the points seem to form an upward-sloping pattern. The points are located at $\left(1,2\right)$(1,2), $\left(2,1\right)$(2,1), $\left(3,3\right)$(3,3), $\left(4,5\right)$(4,5), $\left(5,6\right)$(5,6), $\left(6,5\right)$(6,5), $\left(7,7\right)$(7,7), and $\left(8,7\right)$(8,7). A straight bold green color line runs diagonally from the lower left of the plot, upwards to the upper right of the plot. The plotted points are positioned closely along this line, some are above and some are below this line. The coordinates of the plotted points are not explicitly labeled.

B

A scatter plot with 8 small, dark solid circular points showing data for two variables $x$x and $y$y. The x-axis is labeled "x" ranging from 0 to 10 in increments of 1. The y-axis is labeled "y", ranging from 0 to 10 in increments of 1. The scatter plot has visible, light shade, continuous grid lines for each increment. The position of the points seem to form an upward-sloping pattern. The points are located at $\left(1,2\right)$(1,2), $\left(2,1\right)$(2,1), $\left(3,3\right)$(3,3), $\left(4,5\right)$(4,5), $\left(5,6\right)$(5,6), $\left(6,5\right)$(6,5), $\left(7,7\right)$(7,7), and $\left(8,7\right)$(8,7). A straight bold green color line runs diagonally from the lower left of the plot, upwards to the upper right of the plot. The plotted points are positioned closely along this line, some are above but most are below this line. The coordinates of the plotted points are not explicitly labeled.

C

A scatter plot with 8 small, dark solid circular points showing data for two variables $x$x and $y$y. The x-axis is labeled "x" ranging from 0 to 10 in increments of 1. The y-axis is labeled "y", ranging from 0 to 10 in increments of 1. The scatter plot has visible, light shade, continuous grid lines for each increment. The position of the points seem to form an upward-sloping pattern. The points are located at $\left(1,2\right)$(1,2), $\left(2,1\right)$(2,1), $\left(3,3\right)$(3,3), $\left(4,5\right)$(4,5), $\left(5,6\right)$(5,6), $\left(6,5\right)$(6,5), $\left(7,7\right)$(7,7), and $\left(8,7\right)$(8,7). A straight bold green color line runs diagonally from the lower left of the plot, upwards to the upper right of the plot. The plotted points are positioned very closely along this line, some are above and some are below this line. The coordinates of the plotted points are not explicitly labeled.

D

Question 2

The following scatter plot graphs data for the number of copies of a particular book sold at various prices.

Loading Graph...

A scatter plot with the vertical axis labeled "Copies sold" ranging from $90$90 to $190$190 in increments of $10$10, and the horizontal axis labeled "Price" ranging from $16$16 to $36$36 in increments of $2$2. There are nine data points plotted as black solid dots that generally show a decreasing trend from topmost-left to bottommost-right points. The specific coordinates of the points are as follows: $\left(36,\frac{531}{5}\right)$(36,5315), $\left(20,\frac{8129}{50}\right)$(20,812950), $\left(22,\frac{3798}{25}\right)$(22,379825), $\left(24,\frac{7523}{50}\right)$(24,752350), $\left(26,\frac{1469}{10}\right)$(26,146910), $\left(28,\frac{3466}{25}\right)$(28,346625), $\left(30,\frac{3362}{25}\right)$(30,336225), $\left(32,\frac{6281}{50}\right)$(32,628150), and $\left(34,\frac{2824}{25}\right)$(34,282425). The coordinates are not explicitly labeled.

Determine which of the following graphs contains the line of best fit.
Loading Graph...

A scatter plot with the vertical axis labeled "Copies sold" ranging from $90$90 to $190$190 in increments of $10$10, and the horizontal axis labeled "Price" ranging from $16$16 to $36$36 in increments of $2$2. There are nine data points plotted as black solid dots that generally show a decreasing trend from topmost-left to bottommost-right points. The specific coordinates of the points are as follows: $\left(36,\frac{531}{5}\right)$(36,5315), $\left(20,\frac{8129}{50}\right)$(20,812950), $\left(22,\frac{3798}{25}\right)$(22,379825), $\left(24,\frac{7523}{50}\right)$(24,752350), $\left(26,\frac{1469}{10}\right)$(26,146910), $\left(28,\frac{3466}{25}\right)$(28,346625), $\left(30,\frac{3362}{25}\right)$(30,336225), $\left(32,\frac{6281}{50}\right)$(32,628150), and $\left(34,\frac{2824}{25}\right)$(34,282425). Although line of best fit is decreasing, it does not align with the trend of the nine data points, as some points are not closely plotted along the line. The line and the coordinates of the points are not explicitly labeled.

A
Loading Graph...

A scatter plot with the vertical axis labeled "Copies sold" ranging from $90$90 to $190$190 in increments of $10$10, and the horizontal axis labeled "Price" ranging from $16$16 to $36$36 in increments of $2$2. There are nine data points plotted as black solid dots that generally show a decreasing trend from topmost-left to bottommost-right points. The specific coordinates of the points are as follows $\left(36,\frac{531}{5}\right)$(36,5315), $\left(20,\frac{8129}{50}\right)$(20,812950), $\left(22,\frac{3798}{25}\right)$(22,379825), $\left(24,\frac{7523}{50}\right)$(24,752350), $\left(26,\frac{1469}{10}\right)$(26,146910), $\left(28,\frac{3466}{25}\right)$(28,346625), $\left(30,\frac{3362}{25}\right)$(30,336225), $\left(32,\frac{6281}{50}\right)$(32,628150), and $\left(34,\frac{2824}{25}\right)$(34,282425). Although line of best fit is decreasing, it does not align with the trend of the nine data points, as most points are positioned above the line. The line and the coordinates of the points are not explicitly labeled.

B
Loading Graph...

A scatter plot with the vertical axis labeled "Copies sold" ranging from $90$90 to $190$190 in increments of $10$10, and the horizontal axis labeled "Price" ranging from $16$16 to $36$36 in increments of $2$2. There are nine data points plotted as black solid dots that generally show a decreasing trend from topmost-left to bottommost-right points. The specific coordinates of the points are as follows: $\left(36,\frac{531}{5}\right)$(36,5315), $\left(20,\frac{8129}{50}\right)$(20,812950), $\left(22,\frac{3798}{25}\right)$(22,379825), $\left(24,\frac{7523}{50}\right)$(24,752350), $\left(26,\frac{1469}{10}\right)$(26,146910), $\left(28,\frac{3466}{25}\right)$(28,346625), $\left(30,\frac{3362}{25}\right)$(30,336225), $\left(32,\frac{6281}{50}\right)$(32,628150), and $\left(34,\frac{2824}{25}\right)$(34,282425). Although line of best fit is decreasing, it does not align with the trend of the nine data points, as most points are positioned below the line. The line and the coordinates of the points are not explicitly labeled.

C
Loading Graph...

A scatter plot with the vertical axis labeled "Copies sold" ranging from $90$90 to $190$190 in increments of $10$10, and the horizontal axis labeled "Price" ranging from $16$16 to $36$36 in increments of $2$2. There are nine data points plotted as black solid dots that generally show a decreasing trend from topmost-left to bottommost-right points. The specific coordinates of the points are as follows: $\left(36,\frac{531}{5}\right)$(36,5315), $\left(20,\frac{8129}{50}\right)$(20,812950), $\left(22,\frac{3798}{25}\right)$(22,379825), $\left(24,\frac{7523}{50}\right)$(24,752350), $\left(26,\frac{1469}{10}\right)$(26,146910), $\left(28,\frac{3466}{25}\right)$(28,346625), $\left(30,\frac{3362}{25}\right)$(30,336225), $\left(32,\frac{6281}{50}\right)$(32,628150), and $\left(34,\frac{2824}{25}\right)$(34,282425). The line of best fit follows the trend of the nine data points as all the data points are plotted very closely to line. The line and all the coordinates and are not explicitly labeled.

D
Use the line of best fit to find the number of books that will be sold when the price is $\$33$$33.

Loading Graph...

A scatter plot with the vertical axis labeled "Copies sold" ranging from 90 to 190 in increments of $10$10, and the horizontal axis labeled "Price" ranging from $16$16 to $36$36 in increments of $2$2. There are nine data points plotted as black solid dots that generally show a decreasing trend from topmost-left to bottommost-right points The specific coordinates of the points are as follows: $\left(36,\frac{531}{5}\right)$(36,5315), $\left(20,\frac{8129}{50}\right)$(20,812950), $\left(22,\frac{3798}{25}\right)$(22,379825), $\left(24,\frac{7523}{50}\right)$(24,752350), $\left(26,\frac{1469}{10}\right)$(26,146910), $\left(28,\frac{3466}{25}\right)$(28,346625), $\left(30,\frac{3362}{25}\right)$(30,336225), $\left(32,\frac{6281}{50}\right)$(32,628150), and $\left(34,\frac{2824}{25}\right)$(34,282425). The line of best fit was plotted and passes through the point $\left(33,\frac{2936}{25}\right)$(33,293625), which is not explicitly stated nor plotted in the coordinate plane and the problem. The coordinates of the points are not explicitly labeled.

$112$112
A
$123$123
B
$105$105
C
$117$117
D
Use the line of best fit to find the number of books that will be sold when the price is $\$18$$18.
$173$173
A
$181$181
B
$186$186
C
$166$166
D
Consider the statements below.

Which of the two is most correct?
The relationship between the price of the book and the number of copies sold is positive.
A
The relationship between the price of the book and the number of copies sold is negative.
B

Interpreting the line of best fit

The line of best fit will be of the form $y=mx+c$y=mx+c. We can find the equation of the line approximately by reading the gradient and $y$y-intercept from a line we have visually fit to the data or we can use technology. Alternatively, we may simply be given the equation. From the equation we can ascertain the direction (positive/negative) of the relationship and can also interpret the gradient and vertical intercept in terms of the variables involved. When we are analysing data, it is important that we consider the context.

Worked example

The average number of pages read to a child each day and the child’s growing vocabulary are measured and the data set given below. Here $y$y represents the vocabulary (the response variable) and $x$x represents the number of pages read per day (the explanatory variable).

Pages read per day ($x$`x`)	$25$25	$27$27	$29$29	$3$3	$13$13	$31$31	$18$18	$29$29	$29$29	$5$5
Total vocabulary ($y$`y`)	$402$402	$440$440	$467$467	$76$76	$220$220	$487$487	$295$295	$457$457	$460$460	$106$106

The line of best fit in the form $y=mx+c$y=mx+c, is found to be $y=14.87x+30.26$y=14.87x+30.26.

(a) Interpret the value of the gradient of the line of best fit.

Think: The gradient is the coefficient of $x$x, hence $m=14.87$m=14.87. Since this is a positive number, it indicates that there is a positive relationship between the variables. And tells us for each increase in the independent variable by $1$1 the dependent variable increases by approximately $15$15.

Do: In the context of this example, this tells us that for each additional page of reading per day, a child's vocabulary increases by approximately $15$15 words.

(b) Interpret the vertical intercept of the line of best fit.

Think: The $c$c value provided is the vertical intercept of the line. This value predicts the outcome when the independent variable is zero.

Do: In the context of this example, the vertical intercept tells us that a child that does no reading would have a vocabulary of approximately $30$30 words.

Interpreting the least-squares regression line

Given a least squares squares regression line of the form $y=mx+c$y=mx+c

The $m$m value shows the gradient:

if the gradient is positive, when the explanatory increases by $1$1 unit, the response variable increases by $m$m units.
if the gradient is negative, when the explanatory increases by $1$1 unit, the response variable decreases by $m$m units.

The $c$c value shows the vertical intercept (also known as the $y-$y−intercept):

when the explanatory variable is $0$0, the value of the response variable is $c$c.

When we interpret the vertical intercept, we need to consider if it makes sense for the explanatory variable to be zero and the response variable to have a value indicated by $c$c.

Did you know?

Regression is the process of examining the relationship between two or more variables. And the most common method for finding a linear model to fit data is called the least squares method. A line of best fit found using this method is called the least squares regression line. To find the equation from data we can use technology, click on the link to view steps shown using a calculator or spreadsheet.

Practice questions

Question 3

A least squares regression line is given by $y=3.59x+6.72$y=3.59x+6.72.

State the gradient of the line.
Which of the following is true?
The gradient of the line indicates that the bivariate data set has a positive correlation.
A
The gradient of the line indicates that the bivariate data set has a negative correlation.
B
Which of the following is true?
If $x$x increases by $1$1 unit, then $y$y increases by $3.59$3.59 units.
A
If $x$x increases by $1$1 unit, then $y$y decreases by $3.59$3.59 units.
B
If $x$x increases by $1$1 unit, then $y$y decreases by $6.72$6.72 units.
C
If $x$x increases by $1$1 unit, then $y$y increases by $6.72$6.72 units.
D
State the value of the $y$y-intercept.

Question 4

Scientists record the number of aphids ($A$A) in areas with different numbers of ladybeetles ($L$L) in the scatterplot below.

They calculate the line of best fit to be $A=-3.82L+3865.21$A=−3.82L+3865.21.

Loading Graph...

How much does the average aphid population change by with each extra ladybeetle? Give your answer to the nearest aphid.
What is the average aphid population of a region with no ladybeetles? Give your answer to the nearest aphid.

Question 5

The heights (in cm) and the weights (in kg) of $8$8 primary school children is shown on the scattergraph below.

State the $y$y-value of the $y$y-intercept.
The $y$y-intercept indicates that when a child is $\editable{}$ cm, their average weight is $\editable{}$ kg.
Does the interpretation in the previous part make sense in this context?
Yes, when the independent variable has a value of zero, this is still within the data range and the value of the dependent variable makes sense.
A
No, when the independent variable has a value of zero, this is outside the data range and the value of the dependent variable does not make sense.
B

Outcomes

4.1.3.2

find the line of best fit by eye

4.1.3.4

interpret relationships in terms of the variables [complex]

4.1.3.6

use the line of best fit to make predictions, both by interpolation and extrapolation [complex]

5.06 Line of best fit

Exploration

Drawing a line of best fit by eye

Practice questions

Question 1

Question 2

Interpreting the line of best fit

Worked example

Practice questions

Question 3

Question 4

Question 5

Outcomes

4.1.3.2

4.1.3.4

4.1.3.6

What is Mathspace

About Mathspace