When we display bivariate data that appears to have a linear relationship, we often wish to find a line that best models the relationship so we can see the trend and make predictions. We call this the line of best fit.
We want to draw a line of best fit for the following scatterplot:
Let's try drawing three lines across the data and consider which is most appropriate.
We can tell straight away that $A$A is not the right line. This data appears to have a positive linear relationship, but $A$A has a negative gradient. $B$B has the correct sign for its gradient, and it passes through three points! However, there are many more points above the line than below it, and we should try to make sure the line of best fit passes through the centre of all the points. The means that line $C$C is the best fit for this data out of the three lines.
Below is an example of what a good line of best fit might look like.
The following scatter plot shows the data for two variables, $x$x and $y$y.
Determine which of the following graphs contains the line of best fit.
The following scatter plot graphs data for the number of copies of a particular book sold at various prices.
Determine which of the following graphs contains the line of best fit.
Use the line of best fit to find the number of books that will be sold when the price is $\$33$$33.
$112$112
$123$123
$105$105
$117$117
Use the line of best fit to find the number of books that will be sold when the price is $\$18$$18.
$173$173
$181$181
$186$186
$166$166
Consider the statements below.
Which of the two is most correct?
The relationship between the price of the book and the number of copies sold is positive.
The relationship between the price of the book and the number of copies sold is negative.
The line of best fit will be of the form $y=mx+c$y=mx+c. We can find the equation of the line approximately by reading the gradient and $y$y-intercept from a line we have visually fit to the data or we can use technology. Alternatively, we may simply be given the equation. From the equation we can ascertain the direction (positive/negative) of the relationship and can also interpret the gradient and vertical intercept in terms of the variables involved. When we are analysing data, it is important that we consider the context.
The average number of pages read to a child each day and the child’s growing vocabulary are measured and the data set given below. Here $y$y represents the vocabulary (the response variable) and $x$x represents the number of pages read per day (the explanatory variable).
Pages read per day ($x$x) | $25$25 | $27$27 | $29$29 | $3$3 | $13$13 | $31$31 | $18$18 | $29$29 | $29$29 | $5$5 |
---|---|---|---|---|---|---|---|---|---|---|
Total vocabulary ($y$y) | $402$402 | $440$440 | $467$467 | $76$76 | $220$220 | $487$487 | $295$295 | $457$457 | $460$460 | $106$106 |
The line of best fit in the form $y=mx+c$y=mx+c, is found to be $y=14.87x+30.26$y=14.87x+30.26.
(a) Interpret the value of the gradient of the line of best fit.
Think: The gradient is the coefficient of $x$x, hence $m=14.87$m=14.87. Since this is a positive number, it indicates that there is a positive relationship between the variables. And tells us for each increase in the independent variable by $1$1 the dependent variable increases by approximately $15$15.
Do: In the context of this example, this tells us that for each additional page of reading per day, a child's vocabulary increases by approximately $15$15 words.
(b) Interpret the vertical intercept of the line of best fit.
Think: The $c$c value provided is the vertical intercept of the line. This value predicts the outcome when the independent variable is zero.
Do: In the context of this example, the vertical intercept tells us that a child that does no reading would have a vocabulary of approximately $30$30 words.
Given a least squares squares regression line of the form $y=mx+c$y=mx+c
The $m$m value shows the gradient:
The $c$c value shows the vertical intercept (also known as the $y-$y−intercept):
When we interpret the vertical intercept, we need to consider if it makes sense for the explanatory variable to be zero and the response variable to have a value indicated by $c$c.
Regression is the process of examining the relationship between two or more variables. And the most common method for finding a linear model to fit data is called the least squares method. A line of best fit found using this method is called the least squares regression line. To find the equation from data we can use technology.
A least squares regression line is given by $y=3.59x+6.72$y=3.59x+6.72.
State the gradient of the line.
Which of the following is true?
The gradient of the line indicates that the bivariate data set has a positive correlation.
The gradient of the line indicates that the bivariate data set has a negative correlation.
Which of the following is true?
If $x$x increases by $1$1 unit, then $y$y increases by $3.59$3.59 units.
If $x$x increases by $1$1 unit, then $y$y decreases by $3.59$3.59 units.
If $x$x increases by $1$1 unit, then $y$y decreases by $6.72$6.72 units.
If $x$x increases by $1$1 unit, then $y$y increases by $6.72$6.72 units.
State the value of the $y$y-intercept.
Scientists record the number of aphids ($A$A) in areas with different numbers of ladybeetles ($L$L) in the scatterplot below.
They calculate the line of best fit to be $A=-3.82L+3865.21$A=−3.82L+3865.21.
How much does the average aphid population change by with each extra ladybeetle? Give your answer to the nearest aphid.
What is the average aphid population of a region with no ladybeetles? Give your answer to the nearest aphid.
The heights (in cm) and the weights (in kg) of $8$8 primary school children is shown on the scattergraph below.
State the $y$y-value of the $y$y-intercept.
The $y$y-intercept indicates that when a child is $\editable{}$ cm, their average weight is $\editable{}$ kg.
Does the interpretation in the previous part make sense in this context?
Yes, when the independent variable has a value of zero, this is still within the data range and the value of the dependent variable makes sense.
No, when the independent variable has a value of zero, this is outside the data range and the value of the dependent variable does not make sense.