In the previous chapter, we learnt to describe the relationship between numerical variables. If there is a relationship, we can recognise when it is linear or non-linear, strong or weak, and positive or negative.
We can draw a line, by eye, that best fits the points in a scatter plot. The images below show some examples of a line of best fit. The line of best fit does not necessarily go through the data points. It is positioned so that it minimises the overall distance from the points.
Constructing a line of best fit in this way can be very useful because the line summarises all of the data points in a way that allows us to predict the value of the response (dependent) variable that corresponds to a given value of the explanatory (independent) variable, or vice-versa. However, it takes a certain amount of judgement to choose the position and angle of the line. Constructing the line by eye is not very reliable, especially if the scatter plot does not show a strong relationship.
Fortunately, there are mathematical calculations that we can use to determine a line of best fit with more consistency. In this lesson, we will learn to construct a least-squares regression line.
The best way to understand it is through a demonstration. Try moving the points to see how the least-squares regression line follows the data.
The least-squares regression line is a straight line, so it can be represented by a linear function, in gradient-intercept form:
$y$y$=mx+c$=mx+c
$y$y is the predicted value of the response (dependent) variable
$x$x is our explanatory (independent) variable
$m$m is the gradient of our line
$c$c is the vertical intercept (or $y$y-intercept) of our line.
It is common practice to use the symbol $\hat{y}$^y (which is pronounced "y-hat"), instead of $y$y, as the variable for the predicted value of the response variable.
Unless we are working with a very small data set, the calculations to determine the least-squares regression line are too tedious to do by hand. In this course, we will use technology for these calculations.
Select the brand of calculator you use below to work through an example of using a calculator to generate and scatter plot and find the equation of the least squares regression line.
Casio Classpad
How to use the CASIO Classpad to complete the following tasks regarding scatterplots and linear regression.
The average number of pages read to a child each day and the child’s growing vocabulary are measured. Consider the data set given below:
Pages read per day ($x$x) | $25$25 | $27$27 | $29$29 | $3$3 | $13$13 | $31$31 | $18$18 | $29$29 | $29$29 | $5$5 |
---|---|---|---|---|---|---|---|---|---|---|
Total vocabulary ($y$y) | $402$402 | $440$440 | $467$467 | $76$76 | $220$220 | $487$487 | $295$295 | $457$457 | $460$460 | $106$106 |
Use your calculator to generate a scatterplot of the data.
Using your calculator, find an equation for the least squares regression line of $y$y on $x$x.
Give your answer in the form $y=ax+b$y=ax+b. Give all values to two decimal places.
TI Nspire
How to use the TI Nspire to complete the following tasks regarding scatterplots and linear regression.
The average number of pages read to a child each day and the child’s growing vocabulary are measured. Consider the data set given below:
Pages read per day ($x$x) | $25$25 | $27$27 | $29$29 | $3$3 | $13$13 | $31$31 | $18$18 | $29$29 | $29$29 | $5$5 |
---|---|---|---|---|---|---|---|---|---|---|
Total vocabulary ($y$y) | $402$402 | $440$440 | $467$467 | $76$76 | $220$220 | $487$487 | $295$295 | $457$457 | $460$460 | $106$106 |
Use your calculator to generate a scatterplot of the data.
Using your calculator, find an equation for the least squares regression line of $y$y on $x$x.
Give your answer in the form $y=ax+b$y=ax+b. Give all values to two decimal places.
Use technology to find the line of best fit for the data below. Write the equation with the coefficient and constant term to the nearest two decimal places.
$x$x | $24$24 | $37$37 | $19$19 | $31$31 | $32$32 | $22$22 | $14$14 | $30$30 | $23$23 | $40$40 |
---|---|---|---|---|---|---|---|---|---|---|
$y$y | $-7$−7 | $-8$−8 | $-3$−3 | $-6$−6 | $-9$−9 | $-8$−8 | $-2$−2 | $-8$−8 | $-8$−8 | $-12$−12 |
Now that we can determine the equation of the least-squares regression line, we are able to make important observations about the original data. When we are analysing data, it is important that we consider the context.
In the example above, $y$y represents the vocabulary (the response variable) and $x$x represents the number of pages read per day (the explanatory variable).
The $a$a value displayed is the gradient of the least-squares regression line. Since this is a positive number, it indicates that there is a positive relationship between the variables. In the context of this example, this tells us that for each additional page of reading per day, a child's vocabulary increases by approximately $15$15 words.
The $b$b value provided is the vertical intercept of the line of best fit and, in the context of this example, tells us that a child that does no reading would have a vocabulary of approximately $30$30 words.
The $a$a value shows the gradient:
The $b$b value shows the vertical intercept (also known as the $y-$y−intercept):
When we interpret the vertical intercept, we need to consider if it makes sense for the explanatory variable to be zero and the response variable to have a value indicated by $b$b.
A least squares regression line is given by $y=3.59x+6.72$y=3.59x+6.72.
State the gradient of the line.
Which of the following is true?
The gradient of the line indicates that the bivariate data set has a positive correlation.
The gradient of the line indicates that the bivariate data set has a negative correlation.
Which of the following is true?
If $x$x increases by $1$1 unit, then $y$y increases by $3.59$3.59 units.
If $x$x increases by $1$1 unit, then $y$y decreases by $3.59$3.59 units.
If $x$x increases by $1$1 unit, then $y$y decreases by $6.72$6.72 units.
If $x$x increases by $1$1 unit, then $y$y increases by $6.72$6.72 units.
State the value of the $y$y-intercept.
The price of various second-hand Mitsubishi Lancers are shown below.
Age | $1$1 | $2$2 | $0$0 | $5$5 | $7$7 | $4$4 | $3$3 | $4$4 | $8$8 | $2$2 |
---|---|---|---|---|---|---|---|---|---|---|
Value (dollars) | $16000$16000 | $13000$13000 | $21990$21990 | $10000$10000 | $8600$8600 | $12500$12500 | $11000$11000 | $11000$11000 | $4500$4500 | $14500$14500 |
Find the equation of the Least Squares Regression Line for the price ($y$y) in terms of age ($x$x).
Round all values to the nearest integer.
State the value of the $y$y-intercept.
The value of the $y$y-intercept indicates that when a Mitsubishi Lancer is brand new, its value is, on average, $\editable{}$ dollars.
Does the interpretation in the previous part make sense in this context?
Yes, when the explanatory variable has a value of zero, this is still within the data range and the value of the dependent variable makes sense.
No, when the explanatory variable has a value of zero, this is outside the data range and the value of the dependent variable does not make sense.
State the gradient of the line.
Which of the following is true?
If the explanatory variable increases by $1$1 unit, then the dependent variable increases by $1694$1694 units.
If the explanatory variable increases by $1$1 unit, then the dependent variable increases by $18407$18407 units.
If the explanatory variable increases by $1$1 unit, then the dependent variable decreases by $18407$18407 units.
If the explanatory variable increases by $1$1 unit, then the dependent variable decreases by $1694$1694 units.
The correlation coefficient was introduced in the previous chapter as a measure that tells us the strength of a relationship between two variables. It is denoted by the letter $r$r. The sign of $r$r also tells us the direction of the relationship.
Some key aspects of the correlation coefficient are summarised below, followed by examples that show how we can calculate the correlation coefficient with our CAS calculator.
The strength of correlation depends on the size of the r value, so we can ignore the positive or negative sign:
Even when two variables have a strong relationship and $r$r is close to $1$1 or $-1$−1 we cannot say that one variable causes change in the other variable.
Select the brand of calculator you use below to work through an example of using a calculator to find the correlation coefficient.
Casio Classpad
How to use the CASIO Classpad to complete the following task to find the correlation coefficient.
A café records the temperature and number of hot chocolates sold on the same day of the week for 10 weeks. Consider the data set given below:
Temperature ($x\ ^\circ C$x °C) | $14$14 | $25$25 | $18$18 | $19$19 | $27$27 | $24$24 | $16$16 | $12$12 | $26$26 | $22$22 |
---|---|---|---|---|---|---|---|---|---|---|
Hot chocolates sold ($y$y) | $35$35 | $8$8 | $20$20 | $22$22 | $5$5 | $23$23 | $32$32 | $34$34 | $8$8 | $14$14 |
Calculate the value of the correlation coefficient ($r$r).
Give your answer to three decimal places.
TI Nspire
How to use the TI Nspire to complete the following task to find the correlation coefficient.
A café records the temperature and number of hot chocolates sold on the same day of the week for 10 weeks. Consider the data set given below:
Temperature ($x\ ^\circ C$x °C) | $14$14 | $25$25 | $18$18 | $19$19 | $27$27 | $24$24 | $16$16 | $12$12 | $26$26 | $22$22 |
---|---|---|---|---|---|---|---|---|---|---|
Hot chocolates sold ($y$y) | $35$35 | $8$8 | $20$20 | $22$22 | $5$5 | $23$23 | $32$32 | $34$34 | $8$8 | $14$14 |
Calculate the value of the correlation coefficient ($r$r).
Give your answer to three decimal places.
Given the following data:
x | $1$1 | $4$4 | $7$7 | $10$10 | $13$13 | $16$16 | $19$19 |
---|---|---|---|---|---|---|---|
y | $4$4 | $4.25$4.25 | $4.55$4.55 | $4.4$4.4 | $4.45$4.45 | $4.75$4.75 | $4.2$4.2 |
Calculate the correlation coefficient and give your answer to two decimal places.
Choose the best description of this correlation.
Moderate negative
Strong positive
Weak negative
Moderate positive
Strong negative
Weak positive
When we discuss the coefficient of determination or the value of $r^2$r2, we can already tell that it must be related to the value of the correlation coefficient ($r$r) and something to do with measuring the relationship between two variables.
$r^2$r2 tells us the proportion of the response variable ($y$y) that can be explained by the variation in the explanatory variable ($x$x).
For example, if $r^2=0.92$r2=0.92 then we can say that $92%$92% of the variation in the response variable is explained by the variation in the explanatory variable.
Alternatively, we can say that $8%$8% of the variation in the response variable is not explained by the variation in the explanatory variable.
Unlike the correlation coefficient, the coefficient of determination is always a positive value, so it does not indicate if there is a positive or negative relationship between variables.
The closer $r^2$r2 is to $1$1, the more the variation in the response variable is explained by the variation in the explanatory variable.
If we already have the value of $r$r, we can square the value to get $r^2$r2.
So if $r=0.8$r=0.8, then $r^2=0.64$r2=0.64
If $r=-0.9$r=−0.9, then $r^2=0.81$r2=0.81
If we don't already have the value of $r$r, our calculators will calculate it for us. In fact, the $r^2$r2 value is given on the same screen as the $r$r value.
Select the brand of calculator you use below to work through an example of using a calculator to find the coefficient of determination.
Casio Classpad
How to use the CASIO Classpad to complete the following task to find the coefficient of determination.
A café records the temperature and number of hot chocolates sold on the same day of the week for $10$10 weeks. Consider the data set given below:
Temperature ($x\ ^\circ$x °C) | $14$14 | $25$25 | $18$18 | $19$19 | $27$27 | $24$24 | $16$16 | $12$12 | $26$26 | $22$22 |
---|---|---|---|---|---|---|---|---|---|---|
Hot chocolates sold ($y$y) | $35$35 | $8$8 | $20$20 | $22$22 | $5$5 | $23$23 | $32$32 | $34$34 | $8$8 | $14$14 |
Calculate the value of the coefficient of determination ($r^2$r2).
Give your answer to three decimal places.
TI Nspire
How to use the TI Nspire to complete the following task to find the coefficient of determination.
A café records the temperature and number of hot chocolates sold on the same day of the week for $10$10 weeks. Consider the data set given below:
Temperature ($x\ ^\circ$x °C) | $14$14 | $25$25 | $18$18 | $19$19 | $27$27 | $24$24 | $16$16 | $12$12 | $26$26 | $22$22 |
---|---|---|---|---|---|---|---|---|---|---|
Hot chocolates sold ($y$y) | $35$35 | $8$8 | $20$20 | $22$22 | $5$5 | $23$23 | $32$32 | $34$34 | $8$8 | $14$14 |
Calculate the value of the coefficient of determination ($r^2$r2).
Give your answer to three decimal places.
Remember, a value of $r^2$r2 close to $1$1 does not imply that a change in $x$x is causing a change in $y$y variable.
A scientist investigated the link between the number of cancer cells killed by a certain drug and the strength of the drug used. The results were recorded and the coefficient of determination $r^2$r2 was found to be $0.92$0.92.
Which of the following is true?
Select all that apply.
There is a strong relationship between the strength of the drug used and the cancer cells killed.
The number of cancer cells killed causes the strength of the drug used.
We cannot infer a causal relationship between strength of the drug used and the cancer cells killed.
The strength of the drug used causes the cancer cells to be killed.
There is a weak relationship between the strength of the drug used and the cancer cells killed.
A linear association between two data sets is such that the coefficient of determination $r^2$r2 is $0.80$0.80.
Calculate the correlation coefficient if the relationship is negative.
Give your answer to four decimal places.
Hence choose the option that best describes the strength of the relationship.
Weak
Moderate
Strong
A linear association between two data sets is such that the correlation coefficient is $-0.72$−0.72.
What proportion of the variation can be explained by the linear relationship?
Give your answer to the nearest percent.
The heights (in cm) and the weights (in kg) of $8$8 primary school children is shown on the scattergraph below.
Calculate the value of the coefficient of determination.
Give your answer to two decimal places.
Hence or otherwise calculate the value of the correlation coefficient.
Give your answer to two decimal places.
What percentage of the variation in weight is accounted for by the height of the child?
Give your answer to the nearest whole percent.
Consider these two comments on the claim “The weight of a child is primarily influenced by their height.”
Which do you think is most correct?
This claim is valid and is supported by the strong relationship between the two variables.
While this claim is supported by a strong relationship between the two variables, we cannot state causality as there may be other factors influencing the outcome.