topic badge

5.06 Line of best fit

Lesson

When we display bivariate data that appears to have a linear relationship, we often wish to find a line that best models the relationship so we can see the trend and make predictions. We call this the line of best fit.

 

Exploration

We want to draw a line of best fit for the following scatterplot:

Let's try drawing three lines across the data and consider which is most appropriate.

We can tell straight away that $A$A is not the right line. This data appears to have a positive linear relationship, but $A$A has a negative gradient. $B$B has the correct sign for its gradient, and it passes through three points! However, there are many more points above the line than below it, and we should try to make sure the line of best fit passes through the centre of all the points. The means that line $C$C is the best fit for this data out of the three lines.

 

Drawing a line of best fit by eye

  • One method is to draw an oval around the points on the scatterplot, then cut the oval in half with a line.
  • The line may pass exactly through all of the points, some of the points, or none of the points.
  • It always represents the general trend of the of the data (increasing or decreasing).
  • The number of points above the line should be approximately the same as the number of points below the line.
  • Be wary of outliers (points that fall far from the general trend of the rest of the data) as they are highly influential and will skew the line of best fit. An outlier may be removed if it is a single anomaly and you wish to make more reliable predictions for the “majority” of the data. It should be made clear this was done.

Below is an example of what a good line of best fit might look like.

 

Practice questions

Question 1

The following scatter plot shows the data for two variables, $x$x and $y$y.

  1. Determine which of the following graphs contains the line of best fit.

    A

    B

    C

    D

Question 2

The following scatter plot graphs data for the number of copies of a particular book sold at various prices.

Loading Graph...
A scatter plot with the vertical axis labeled "Copies sold" ranging from 90 to 190 in increments of 10, and the horizontal axis labeled "Price" ranging from 18 to 36 in increments of 2. There are nine data points plotted that generally show a decreasing trend from left to right, starting at the top left with the highest number of copies sold at the lowest price and ending at the bottom right with the lowest number of copies sold at the highest price. The specific coordinates of the points are as follows: $\left(36,\frac{531}{5}\right)$(36,5315)$\left(20,\frac{8129}{50}\right)$(20,812950)$\left(22,\frac{3798}{25}\right)$(22,379825)$\left(24,\frac{7523}{50}\right)$(24,752350)$\left(26,\frac{1469}{10}\right)$(26,146910)$\left(28,\frac{3466}{25}\right)$(28,346625)$\left(30,\frac{3362}{25}\right)$(30,336225)$\left(32,\frac{6281}{50}\right)$(32,628150),and $\left(34,\frac{2824}{25}\right)$(34,282425). The coordinates are not explicitly labeled. 
  1. Determine which of the following graphs contains the line of best fit.

    Loading Graph...
    A scatter plot with the vertical axis labeled "Copies sold" ranging from 90 to 190 in increments of 10, and the horizontal axis labeled "Price" ranging from 18 to 36 in increments of 2. There are nine data points plotted that generally show a decreasing trend from left to right, starting at the top left with the highest number of copies sold at the lowest price and ending at the bottom right with the lowest number of copies sold at the highest price. The specific coordinates of the points are as follows $\left(36,\frac{531}{5}\right)$(36,5315), $\left(20,\frac{8129}{50}\right)$(20,812950), $\left(22,\frac{3798}{25}\right)$(22,379825), $\left(24,\frac{7523}{50}\right)$(24,752350), $\left(26,\frac{1469}{10}\right)$(26,146910), $\left(28,\frac{3466}{25}\right)$(28,346625), $\left(30,\frac{3362}{25}\right)$(30,336225), $\left(32,\frac{6281}{50}\right)$(32,628150), and $\left(34,\frac{2824}{25}\right)$(34,282425). The coordinates are not explicitly labeled. Line $ArbitraryFunction('(-(92/25)*x)/2+160+(16*(92/25))/2')$ArbitraryFunction(((92/25)*x)/2+160+(16*(92/25))/2) is plotted but not explicitly labeled. Some points are above and some points are below line $ArbitraryFunction('(-(92/25)*x)/2+160+(16*(92/25))/2')$ArbitraryFunction(((92/25)*x)/2+160+(16*(92/25))/2) but line $ArbitraryFunction('(-(92/25)*x)/2+160+(16*(92/25))/2')$ArbitraryFunction(((92/25)*x)/2+160+(16*(92/25))/2)does not follow the trend of the data points.
    A
    Loading Graph...
    A scatter plot with the vertical axis labeled "Copies sold" ranging from 90 to 190 in increments of 10, and the horizontal axis labeled "Price" ranging from 18 to 36 in increments of 2. There are nine data points plotted that generally show a decreasing trend from left to right, starting at the top left with the highest number of copies sold at the lowest price and ending at the bottom right with the lowest number of copies sold at the highest price. The specific coordinates of the points are as follows $\left(36,\frac{531}{5}\right)$(36,5315), $\left(20,\frac{8129}{50}\right)$(20,812950), $\left(22,\frac{3798}{25}\right)$(22,379825), $\left(24,\frac{7523}{50}\right)$(24,752350), $\left(26,\frac{1469}{10}\right)$(26,146910), $\left(28,\frac{3466}{25}\right)$(28,346625), $\left(30,\frac{3362}{25}\right)$(30,336225), $\left(32,\frac{6281}{50}\right)$(32,628150), and $\left(34,\frac{2824}{25}\right)$(34,282425). The coordinates are not explicitly labeled. Line $ArbitraryFunction('-(92/25)*x+5972/25-5')$ArbitraryFunction((92/25)*x+5972/255) is plotted but not explicitly labeled. Line $ArbitraryFunction('-(92/25)*x+5972/25-5')$ArbitraryFunction((92/25)*x+5972/255) follows the trend of the data points but all points are above the line $ArbitraryFunction('-(92/25)*x+5972/25-5')$ArbitraryFunction((92/25)*x+5972/255).
    B
    Loading Graph...
    A scatter plot with the vertical axis labeled "Copies sold" ranging from 90 to 190 in increments of 10, and the horizontal axis labeled "Price" ranging from 18 to 36 in increments of 2. There are nine data points plotted that generally show a decreasing trend from left to right, starting at the top left with the highest number of copies sold at the lowest price and ending at the bottom right with the lowest number of copies sold at the highest price. The specific coordinates of the points are as follows $\left(36,\frac{531}{5}\right)$(36,5315), $\left(20,\frac{8129}{50}\right)$(20,812950), $\left(22,\frac{3798}{25}\right)$(22,379825), $\left(24,\frac{7523}{50}\right)$(24,752350), $\left(26,\frac{1469}{10}\right)$(26,146910), $\left(28,\frac{3466}{25}\right)$(28,346625), $\left(30,\frac{3362}{25}\right)$(30,336225), $\left(32,\frac{6281}{50}\right)$(32,628150), and $\left(34,\frac{2824}{25}\right)$(34,282425). The coordinates are not explicitly labeled. Line $ArbitraryFunction('-(92/25)*x+5972/25+5')$ArbitraryFunction((92/25)*x+5972/25+5) is plotted but not explicitly labeled. Line $ArbitraryFunction('-(92/25)*x+5972/25+5')$ArbitraryFunction((92/25)*x+5972/25+5) follows the trend of the data points but all points are below the line $ArbitraryFunction('-(92/25)*x+5972/25+5')$ArbitraryFunction((92/25)*x+5972/25+5).
    C
    Loading Graph...
    A scatter plot with the vertical axis labeled "Copies sold" ranging from 90 to 190 in increments of 10, and the horizontal axis labeled "Price" ranging from 18 to 36 in increments of 2. There are nine data points plotted that generally show a decreasing trend from left to right, starting at the top left with the highest number of copies sold at the lowest price and ending at the bottom right with the lowest number of copies sold at the highest price. The specific coordinates of the points are as follows $\left(36,\frac{531}{5}\right)$(36,5315), $\left(20,\frac{8129}{50}\right)$(20,812950), $\left(22,\frac{3798}{25}\right)$(22,379825), $\left(24,\frac{7523}{50}\right)$(24,752350), $\left(26,\frac{1469}{10}\right)$(26,146910), $\left(28,\frac{3466}{25}\right)$(28,346625), $\left(30,\frac{3362}{25}\right)$(30,336225), $\left(32,\frac{6281}{50}\right)$(32,628150), and $\left(34,\frac{2824}{25}\right)$(34,282425). The coordinates are not explicitly labeled. Line $ArbitraryFunction('-(92/25)*x+5972/25')$ArbitraryFunction((92/25)*x+5972/25) is plotted but not explicitly labeled. Line $ArbitraryFunction('-(92/25)*x+5972/25')$ArbitraryFunction((92/25)*x+5972/25) follows the trend of the data points and all the data points are plotted very closely to line $ArbitraryFunction('-(92/25)*x+5972/25')$ArbitraryFunction((92/25)*x+5972/25).
    D
  2. Use the line of best fit to find the number of books that will be sold when the price is $\$33$$33.

    Loading Graph...
    A scatter plot with the vertical axis labeled "Copies sold" ranging from 90 to 190 in increments of 10, and the horizontal axis labeled "Price" ranging from 18 to 36 in increments of 2. There are nine data points plotted that generally show a decreasing trend from left to right, starting at the top left with the highest number of copies sold at the lowest price and ending at the bottom right with the lowest number of copies sold at the highest price. The specific coordinates of the points are as follows $\left(36,\frac{531}{5}\right)$(36,5315)$\left(20,\frac{8129}{50}\right)$(20,812950)$\left(22,\frac{3798}{25}\right)$(22,379825)$\left(24,\frac{7523}{50}\right)$(24,752350)$\left(26,\frac{1469}{10}\right)$(26,146910)$\left(28,\frac{3466}{25}\right)$(28,346625)$\left(30,\frac{3362}{25}\right)$(30,336225)$\left(32,\frac{6281}{50}\right)$(32,628150), and $\left(34,\frac{2824}{25}\right)$(34,282425). The coordinates are not explicitly labeled. The best fit line is plotted but the equation of the line is not explicitly indicated. The best fit line follows the trend of the data points and all the data points are plotted very closely to the best fit line.
     

    $112$112

    A

    $123$123

    B

    $105$105

    C

    $117$117

    D
  3. Use the line of best fit to find the number of books that will be sold when the price is $\$18$$18.

    $173$173

    A

    $181$181

    B

    $186$186

    C

    $166$166

    D
  4. Consider the statements below.

    Which of the two is most correct?

    The relationship between the price of the book and the number of copies sold is positive.

    A

    The relationship between the price of the book and the number of copies sold is negative.

    B

 

Interpreting the line of best fit

The line of best fit will be of the form $y=mx+c$y=mx+c. We can find the equation of the line approximately by reading the gradient and $y$y-intercept from a line we have visually fit to the data or we can use technology. Alternatively, we may simply be given the equation. From the equation we can ascertain the direction (positive/negative) of the relationship and can also interpret the gradient and vertical intercept in terms of the variables involved. When we are analysing data, it is important that we consider the context.

Worked example

The average number of pages read to a child each day and the child’s growing vocabulary are measured and the data set given below. Here $y$y represents the vocabulary (the response variable) and $x$x represents the number of pages read per day (the explanatory variable).

Pages read per day ($x$x) $25$25 $27$27 $29$29 $3$3 $13$13 $31$31 $18$18 $29$29 $29$29 $5$5
Total vocabulary ($y$y) $402$402 $440$440 $467$467 $76$76 $220$220 $487$487 $295$295 $457$457 $460$460 $106$106

The line of best fit in the form $y=mx+c$y=mx+c, is found to be $y=14.87x+30.26$y=14.87x+30.26.

(a) Interpret the value of the gradient of the line of best fit.

Think: The gradient is the coefficient of $x$x, hence $m=14.87$m=14.87.  Since this is a positive number, it indicates that there is a positive relationship between the variables.  And tells us for each increase in the independent variable by $1$1 the dependent variable increases by approximately $15$15.

Do: In the context of this example, this tells us that for each additional page of reading per day, a child's vocabulary increases by approximately $15$15 words.

(b) Interpret the vertical intercept of the line of best fit. 

Think: The $c$c value provided is the vertical intercept of the line. This value predicts the outcome when the independent variable is zero.

Do: In the context of this example, the vertical intercept tells us that a child that does no reading would have a vocabulary of approximately $30$30 words. 

 

Interpreting the least-squares regression line

Given a least squares squares regression line of the form $y=mx+c$y=mx+c

The $m$m value shows the gradient:

  • if the gradient is positive, when the explanatory increases by $1$1 unit, the response variable increases by $m$m units. 
  • if the gradient is negative, when the explanatory increases by $1$1 unit, the response variable decreases by $m$m units. 

The $c$c value shows the vertical intercept (also known as the $y-$yintercept):

  • when the explanatory variable is $0$0, the value of the response variable is $c$c.

When we interpret the vertical intercept, we need to consider if it makes sense for the explanatory variable to be zero and the response variable to have a value indicated by $c$c.

Did you know?

Regression is the process of examining the relationship between two or more variables. And the most common method for finding a linear model to fit data is called the least squares method. A line of best fit found using this method is called the least squares regression line. To find the equation from data we can use technology, click on the link to view steps shown using a calculator or spreadsheet. 

 

Practice questions

Question 3

A least squares regression line is given by $y=3.59x+6.72$y=3.59x+6.72.

  1. State the gradient of the line.

  2. Which of the following is true?

    The gradient of the line indicates that the bivariate data set has a positive correlation.

    A

    The gradient of the line indicates that the bivariate data set has a negative correlation.

    B
  3. Which of the following is true?

    If $x$x increases by $1$1 unit, then $y$y increases by $3.59$3.59 units.

    A

    If $x$x increases by $1$1 unit, then $y$y decreases by $3.59$3.59 units.

    B

    If $x$x increases by $1$1 unit, then $y$y decreases by $6.72$6.72 units.

    C

    If $x$x increases by $1$1 unit, then $y$y increases by $6.72$6.72 units.

    D
  4. State the value of the $y$y-intercept.

Question 4

Scientists record the number of aphids ($A$A) in areas with different numbers of ladybeetles ($L$L) in the scatterplot below.

They calculate the line of best fit to be $A=-3.82L+3865.21$A=3.82L+3865.21.

Loading Graph...

  1. How much does the average aphid population change by with each extra ladybeetle? Give your answer to the nearest aphid.

  2. What is the average aphid population of a region with no ladybeetles? Give your answer to the nearest aphid.

Question 5

The heights (in cm) and the weights (in kg) of $8$8 primary school children is shown on the scattergraph below.

  1. State the $y$y-value of the $y$y-intercept.

  2. The $y$y-intercept indicates that when a child is $\editable{}$ cm, their average weight is $\editable{}$ kg.

  3. Does the interpretation in the previous part make sense in this context?

    Yes, when the independent variable has a value of zero, this is still within the data range and the value of the dependent variable makes sense.

    A

    No, when the independent variable has a value of zero, this is outside the data range and the value of the dependent variable does not make sense.

    B

 

Outcomes

4.1.3.2

find the line of best fit by eye

4.1.3.4

interpret relationships in terms of the variables [complex]

4.1.3.6

use the line of best fit to make predictions, both by interpolation and extrapolation [complex]

What is Mathspace

About Mathspace