When we examine a set of bivariate data, remember that we first examine the data and its graph for the strength of the relationship between the two variables. We do this by calculating correlation coefficient between the two sets of data. That is, we want to see how well the dependent variable correlates with the independent variable.
Why do we bother to examine the strength of the relationship between two variables? Usually because we want to use this information to formulate predictions.
For example, we might want to study the effectiveness of spending money on Facebook advertising to attract new clients. We might collect data on a business similar to ours and then use our findings to predict how many clients we could expect to sign up for a monthly advertising spend of $500.
The first thing we need to do in order to use our data to make predictions, if to fit some sort of function to the shape of our data. For our purposes, we'll be looking at scattergraphs whose pattern is linear and we will therefore fit a linear function to our data.
You might remember as drawing in the Line of Best Fit. The more mathematically correct name is the Least Squares Regression Line, but effectively they're the same thing.
So what is the Least Squares Regression Line?
The best way to understand it is through a demonstration.
Experiment with this Geogebra applet.
Most of the time, you will be required to calculate the equation of the Least Squares Regression Line using technology.
Here's a great video on how to use the TI-Nspire to create a scatter graph and calculate the equation of the Least Squares Regression Line.
$\hat{y}$^y$=ax+b$=ax+b
$\hat{y}$^y means the predicted value of $y$y
$x$x is our independent variable
$a$a is the slope of our line
$b$b is the $y$y - intercept of our line.
Once we have our line, we're ready to start making predictions. Since our line is the best possible fit for the data we have, we can use it as a model to predict the likely value for the dependent variable based on a value for the independent variable that we'd like to predict for.
The process of predicting is two-fold.
Let's watch this next video and see how we can make a prediction once we've input our data and calculated the Least Squares Regression Line.
After you have used your linear model to make a prediction for a particular value for the independent variable, you now need to determine how valid or reliable it is.
To do this, you need to consider two things:
We already know all about correlation and the strength of a relationship. Let's talk about the second point.
Interpolation means you have used an $x$x value in your prediction that is within the available range of data that you were working with. In the video example, the $x$x values range between 35 and 98, so any x value you choose within this range would be considered an interpolation.
Extrapolation means you have used an $x$x value in your prediction that is outside the available range of data. So in the video example, anything below 35 or above 98 would be considered an extrapolation.
Let's put it all together.
During an alcohol education program, $10$10 adults were offered up to $6$6 drinks and were then given a simulated driving test where they scored a result out of a possible $100$100.
Number of drinks ($x$x) | $3$3 | $2$2 | $6$6 | $4$4 | $4$4 | $1$1 | $6$6 | $3$3 | $4$4 | $2$2 |
---|---|---|---|---|---|---|---|---|---|---|
Driving score ($y$y) | $66$66 | $61$61 | $43$43 | $58$58 | $56$56 | $73$73 | $31$31 | $64$64 | $55$55 | $62$62 |
Using a graphics calculator (or other technology), calculate the correlation coefficient between these variables.
Give your answer to two decimal places.
Choose the description which best describes the statistical relationship between these two variables.
Strong negative linear relationship
Moderate negative linear relationship
Weak relationship
Moderate positive linear relationship
Strong positive linear relationship
Use your graphing calculator to form an equation for the least squares regression line of $y$y on $x$x.
Give your answer in the form $y=mx+b$y=mx+b. Give all values to one decimal place.
Use your regression line to predict the driving score of a young adult who consumed $5$5 drinks.
Give your answer to one decimal place.
Choose the description which best describes the validity of the prediction in part (d).
Despite a strong correlation, unreliable due to extrapolation.
Despite an interpolated prediction, unreliable due to a moderate to weak correlation.
Very unreliable due to extrapolation and a moderate to weak correlation.
Reliable due to interpolation and a strong correlation.
A bivariate data set contains $10$10 data points with the following summary statistics:
$\overline{x}$x$=$=$5.13$5.13 | $s_x=2.85$sx=2.85 | $\overline{y}$y$=$=$18.81$18.81 | $s_y=7.54$sy=7.54 | $r=0.993$r=0.993 |
Calculate the slope of the least squares regression line.
Give your answer to two decimal places.
Using the rounded value of the previous part, calculate the vertical intercept of the least squares regression line.
Give your answer to two decimal places.
Hence state the equation of the least squares regression line.
The equation of the least squares regression line for a data set is given by $y=bx-0.44$y=bx−0.44.
Given that the mean of $x$x is $33$33 and the mean of $y$y is $108.46$108.46, solve for the value of $b$b.
Given that the $s_x=114$sx=114 and $s_y=396$sy=396, solve for the correlation coefficient $r$r.
What is the strength of the relationship?
moderate negative linear relationship
weak relationship
strong negative linear relationship
strong positive linear relationship
moderate positive linear relationship