topic badge
Standard Level

8.07 Bivariate data with technology

Lesson

This lesson will introduce using a GDC to perform analysis in bivariate statistics for us. You could also use a graphics calculator or statistical program to work through the exercise. We will look at how to use technology to:

  • Find the least-squares regression line
  • Calculate Pearson's correlation coefficient
  • Make predictions using the least squares regression line

The least-squares regression line

The images below show some examples of a line of best fit.  The line of best fit does not necessarily go through the data points.  It is positioned so that it minimises the overall distance from the points.

Constructing a line of best fit in this way can be very useful because the line summarises all of the data points in a way that allows us to predict the value of the response (dependent) variable that corresponds to a given value of the explanatory (independent) variable, or vice-versa.  However, it takes a certain amount of judgement to choose the position and angle of the line. Constructing the line by eye is not very reliable, especially if the scatter plot does not show a strong relationship.

Fortunately, there are mathematical calculations that we can use to determine a line of best fit with more consistency.  In this lesson, we will learn to find the line of best fit called the least-squares regression line using technology.

The best way to understand it is through a demonstration.  Try moving the points to see how the least-squares regression line follows the data.

 

Calculating the least-squares regression line

The least-squares regression line is a straight line, so it can be represented by a linear function, in gradient-intercept form:

Least squares regression line notation

$y$y$=mx+c$=mx+c

$y$y is the predicted value of the response (dependent) variable

$x$x is our explanatory (independent) variable

$m$m is the gradient of our line

$c$c is the vertical intercept (or $y$y-intercept) of our line.

 

Unless we are working with a very small data set, the calculations to determine the least-squares regression line are too tedious to do by hand.  

Select the brand of calculator you use below to work through an example of using a calculator to generate a scatter plot and find the equation of the least squares regression line.

 

Casio Classpad

How to use the CASIO Classpad to complete the following tasks regarding scatterplots and linear regression.

The average number of pages read to a child each day and the child’s growing vocabulary are measured. Consider the data set given below:

Pages read per day ($x$x) $25$25 $27$27 $29$29 $3$3 $13$13 $31$31 $18$18 $29$29 $29$29 $5$5
Total vocabulary ($y$y) $402$402 $440$440 $467$467 $76$76 $220$220 $487$487 $295$295 $457$457 $460$460 $106$106
  1. Use your calculator to generate a scatterplot of the data.

  2. Using your calculator, find an equation for the least squares regression line of $y$y on $x$x.

    Give your answer in the form $y=ax+b$y=ax+b. Give all values to two decimal places.

 

TI Nspire

How to use the TI Nspire to complete the following tasks regarding scatterplots and linear regression.

The average number of pages read to a child each day and the child’s growing vocabulary are measured. Consider the data set given below:

Pages read per day ($x$x) $25$25 $27$27 $29$29 $3$3 $13$13 $31$31 $18$18 $29$29 $29$29 $5$5
Total vocabulary ($y$y) $402$402 $440$440 $467$467 $76$76 $220$220 $487$487 $295$295 $457$457 $460$460 $106$106
  1. Use your calculator to generate a scatterplot of the data.

  2. Using your calculator, find an equation for the least squares regression line of $y$y on $x$x.

    Give your answer in the form $y=ax+b$y=ax+b. Give all values to two decimal places.

Note: Depending on your calculator the line of best fit shown may use different variables for the gradient, $m$m, and $y$y-intercept, $c$c. The gradient will always be the coefficient of $x$x and the $y$y-intercept will be the constant term in the equation.

 

Practice questions

Question 1

Use technology to find the line of best fit for the data below. Write the equation with the coefficient and constant term to the nearest two decimal places.

$x$x $24$24 $37$37 $19$19 $31$31 $32$32 $22$22 $14$14 $30$30 $23$23 $40$40
$y$y $-7$7 $-8$8 $-3$3 $-6$6 $-9$9 $-8$8 $-2$2 $-8$8 $-8$8 $-12$12

Question 2

The price of various second-hand Mitsubishi Lancers are shown below.

Age $1$1 $2$2 $0$0 $5$5 $7$7 $4$4 $3$3 $4$4 $8$8 $2$2
Value (dollars) $16000$16000 $13000$13000 $21990$21990 $10000$10000 $8600$8600 $12500$12500 $11000$11000 $11000$11000 $4500$4500 $14500$14500
  1. Find the equation of the Least Squares Regression Line for the price ($y$y) in terms of age ($x$x).

    Round all values to the nearest integer.

  2. State the value of the $y$y-intercept.

  3. The value of the $y$y-intercept indicates that when a Mitsubishi Lancer is brand new, its value is, on average, $\editable{}$ dollars.

  4. Does the interpretation in the previous part make sense in this context?

    Yes, when the explanatory variable has a value of zero, this is still within the data range and the value of the dependent variable makes sense.

    A

    No, when the explanatory variable has a value of zero, this is outside the data range and the value of the dependent variable does not make sense.

    B
  5. State the gradient of the line.

  6. Which of the following is true?

    If the explanatory variable increases by $1$1 unit, then the dependent variable increases by $1694$1694 units.

    A

    If the explanatory variable increases by $1$1 unit, then the dependent variable increases by $18407$18407 units.

    B

    If the explanatory variable increases by $1$1 unit, then the dependent variable decreases by $18407$18407 units.

    C

    If the explanatory variable increases by $1$1 unit, then the dependent variable decreases by $1694$1694 units.

    D

 

Correlation coefficient ($r$r)

The correlation coefficient was introduced in a previous lesson as a measure that tells us the strength of a relationship between two variables.  It is denoted by the letter $r$r.  The sign of $r$r also tells us the direction of the relationship.

Some key aspects of the correlation coefficient are summarised below, followed by examples that show how we can calculate the correlation coefficient with our GDC calculator.

 

Correlation coefficient
  • A perfect positive correlation has a value of $1$1. That means that if we graphed the variables on a scatter plot, it would show that all the data points lie exactly on a straight line with a positive gradient.
  • A perfect negative correlation has a value of $-1$1 in which case all data points lie exactly on a straight line with a negative gradient.
  • A value of $0$0 indicates that there is no relationship between the variables.
  • To be more descriptive about the relationship between variables, we can further divide up the line to indicate other values with descriptions like "weak", "moderate" and "strong", in the figure below.  

The strength of correlation depends on the size of the r value, so we can ignore the positive or negative sign:

  • if the size is less than $0.5$0.5, then we have a weak correlation; 
  • if the size of the correlation is greater than $0.8$0.8, then we have a strong correlation. 
  • if the size is between $0.5$0.5 and $0.8$0.8, then we a moderate correlation.

 

Important - correlation is not causation! 

Even when two variables have a strong relationship and $r$r is close to $1$1 or $-1$1 we cannot say that one variable causes change in the other variable.

 

Select the brand of calculator you use below to work through an example of using a calculator to find the correlation coefficient.

Casio Classpad

How to use the CASIO Classpad to complete the following task to find the correlation coefficient.

A café records the temperature and number of hot chocolates sold on the same day of the week for 10 weeks. Consider the data set given below:

Temperature  ($x\ ^\circ C$x °C) $14$14 $25$25 $18$18 $19$19 $27$27 $24$24 $16$16 $12$12 $26$26 $22$22
Hot chocolates sold ($y$y) $35$35 $8$8 $20$20 $22$22 $5$5 $23$23 $32$32 $34$34 $8$8 $14$14
  1. Calculate the value of the correlation coefficient ($r$r).

    Give your answer to three decimal places.

 

TI Nspire

How to use the TI Nspire to complete the following task to find the correlation coefficient.

A café records the temperature and number of hot chocolates sold on the same day of the week for 10 weeks. Consider the data set given below:

Temperature  ($x\ ^\circ C$x °C) $14$14 $25$25 $18$18 $19$19 $27$27 $24$24 $16$16 $12$12 $26$26 $22$22
Hot chocolates sold ($y$y) $35$35 $8$8 $20$20 $22$22 $5$5 $23$23 $32$32 $34$34 $8$8 $14$14
  1. Calculate the value of the correlation coefficient ($r$r).

    Give your answer to three decimal places.

 

Practice question

Question 3

Given the following data:

x $1$1 $4$4 $7$7 $10$10 $13$13 $16$16 $19$19
y $4$4 $4.25$4.25 $4.55$4.55 $4.4$4.4 $4.45$4.45 $4.75$4.75 $4.2$4.2
  1. Calculate the correlation coefficient and give your answer to two decimal places.

  2. Choose the best description of this correlation.

    Moderate negative

    A

    Strong positive

    B

    Weak negative

    C

    Moderate positive

    D

    Strong negative

    E

    Weak positive

    F

 

Making predictions

Once we have determined the least-squares regression line, we can use it as a model to predict the likely value of the response variable based on a given value of the explanatory variable. Let's look at some examples to see how we can make a prediction efficiently using a GDC once we've input our data and calculated the least-squares regression line. Select the brand of calculator you use below to work through an example.

 

Casio Classpad

How to use the CASIO Classpad to complete the following tasks regarding linear regression and making predictions.

Consider the data obtained from a chemical process where the yield of the process is thought to be linearly related to the reaction temperature.

Temperature ($x,^\circ C$x,°C) $50$50 $53$53 $54$54 $56$56 $59$59 $62$62 $65$65 $67$67 $71$71 $74$74
Yield ($y,g$y,g) $122$122 $118$118 $128$128 $125$125 $136$136 $144$144 $142$142 $149$149 $161$161 $168$168
  1. Using your calculator, find an equation for the least squares regression line of $y$y on $x$x.

    Give your answer in the form $y=ax+b$y=ax+b. Give all values to two decimal places.

  2. What yield is predicted at a reaction temperature of $60$60$^\circ$°C?

    Give your answer to the nearest gram.

  3. What yield is predicted at a reaction temperature of $85$85$^\circ$°C?

    Give your answer to the nearest gram.

  4. What temperature gives a predicted yield of $155$155 g?

    Give your answer to one decimal place.

 

TI Nspire

How to use the TI Nspire to complete the following tasks regarding linear regression and making predictions.

Consider the data obtained from a chemical process where the yield of the process is thought to be linearly related to the reaction temperature.

Temperature ($x,^\circ C$x,°C) $50$50 $53$53 $54$54 $56$56 $59$59 $62$62 $65$65 $67$67 $71$71 $74$74
Yield ($y,g$y,g) $122$122 $118$118 $128$128 $125$125 $136$136 $144$144 $142$142 $149$149 $161$161 $168$168
  1. Using your calculator, find an equation for the least squares regression line of $y$y on $x$x.

    Give your answer in the form $y=ax+b$y=ax+b. Give all values to two decimal places.

  2. What yield is predicted at a reaction temperature of $60$60$^\circ$°C?

    Give your answer to the nearest gram.

  3. What yield is predicted at a reaction temperature of $85$85$^\circ$°C?

    Give your answer to the nearest gram.

  4. What temperature gives a predicted yield of $155$155 g?

    Give your answer to one decimal place.

 

Practice questions

Question 4

Research on the number of cigarettes smoked during pregnancy and the birth weights of the newborn babies was conducted.

Average number of cigarettes per day ($x$x) $46.30$46.30 $13.00$13.00 $21.40$21.40 $25.00$25.00 $8.60$8.60 $36.50$36.50 $1.00$1.00 $17.90$17.90 $10.60$10.60 $13.40$13.40 $37.30$37.30 $18.50$18.50
Birth weight in kilograms ($y$y) $3.90$3.90 $5.80$5.80 $5.00$5.00 $4.80$4.80 $5.50$5.50 $4.50$4.50 $7.00$7.00 $5.10$5.10 $5.50$5.50 $5.10$5.10 $3.80$3.80 $5.70$5.70
  1. Using a graphics calculator (or other technology), calculate the correlation coefficient between the average number of cigarettes per day and birth weight.

    Give your answer to three decimal places.

  2. Choose the description which best describes the statistical relationship between these two variables.

    Strong negative linear relationship

    A

    Moderate positive linear relationship

    B

    Weak relationship

    C

    Strong positive linear relationship

    D

    Moderate negative linear relationship

    E
  3. Use your graphing calculator to form an equation for the least squares regression line of $y$y on $x$x.

    Give all values to two decimal places. Give the equation of the line in the form $y=mx+b$y=mx+b.

  4. Use your regression line to predict the birth weight of a newborn whose mother smoked on average $5$5 cigarettes per day.

    Give your answer to two decimal places.

  5. Choose the description which best describes the validity of the prediction in part (d).

    Despite an interpolated prediction, unreliable due to a moderate to weak correlation.

    A

    Reliable due to interpolation and a strong correlation.

    B

    Very unreliable due to extrapolation and a moderate to weak correlation.

    C

    Despite a strong correlation, unreliable due to extrapolation.

    D

Question 5

Research on the number of cigarettes smoked during pregnancy and the birth weights of the newborn babies was conducted.

Average number of cigarettes per day ($x$x) $46.30$46.30 $13.00$13.00 $21.40$21.40 $25.00$25.00 $8.60$8.60 $36.50$36.50 $1.00$1.00 $17.90$17.90 $10.60$10.60 $13.40$13.40 $37.30$37.30 $18.50$18.50
Birth weight in kilograms ($y$y) $3.90$3.90 $5.80$5.80 $5.00$5.00 $4.80$4.80 $5.50$5.50 $4.50$4.50 $7.00$7.00 $5.10$5.10 $5.50$5.50 $5.10$5.10 $3.80$3.80 $5.70$5.70
  1. Using a graphics calculator (or other technology), calculate the correlation coefficient between the average number of cigarettes per day and birth weight.

    Give your answer to three decimal places.

  2. Choose the description which best describes the statistical relationship between these two variables.

    Strong negative linear relationship

    A

    Moderate positive linear relationship

    B

    Weak relationship

    C

    Strong positive linear relationship

    D

    Moderate negative linear relationship

    E
  3. Use your graphing calculator to form an equation for the least squares regression line of $y$y on $x$x.

    Give all values to two decimal places. Give the equation of the line in the form $y=mx+b$y=mx+b.

  4. Use your regression line to predict the birth weight of a newborn whose mother smoked on average $5$5 cigarettes per day.

    Give your answer to two decimal places.

  5. Choose the description which best describes the validity of the prediction in part (d).

    Despite an interpolated prediction, unreliable due to a moderate to weak correlation.

    A

    Reliable due to interpolation and a strong correlation.

    B

    Very unreliable due to extrapolation and a moderate to weak correlation.

    C

    Despite a strong correlation, unreliable due to extrapolation.

    D

 

What is Mathspace

About Mathspace