topic badge

5.08 Bivariate data with technology

Lesson

In the Investigation in this chapter we learnt how to find the correlation coefficient and the line of best fit using spreadsheets.  In this lesson you will use those processes to answer more questions.

 

Practice question

Question 1

Use technology to find the line of best fit for the data below. Write the equation with the coefficient and constant term to the nearest two decimal places.

$x$x $24$24 $37$37 $19$19 $31$31 $32$32 $22$22 $14$14 $30$30 $23$23 $40$40
$y$y $-7$7 $-8$8 $-3$3 $-6$6 $-9$9 $-8$8 $-2$2 $-8$8 $-8$8 $-12$12

 

 

Correlation coefficient ($r$r)

The correlation coefficient is a measure that tells us the strength of a relationship between two variables. It is denoted by the letter $r$r. The sign of $r$r also tells us the direction of the relationship.

Some key aspects of the correlation coefficient are summarised below.

 

Correlation coefficient
  • A perfect positive correlation has a value of $1$1. That means that if we graphed the variables on a scatter plot, it would show that all the data points lie exactly on a straight line with a positive gradient.
  • A perfect negative correlation has a value of $-1$1 in which case all data points lie exactly on a straight line with a negative gradient.
  • A value of $0$0 indicates that there is no relationship between the variables.
  • To be more descriptive about the relationship between variables, we can further divide up the line to indicate other values with descriptions like "weak", "moderate" and "strong", in the figure below.

The strength of correlation depends on the size of the r value, so we can ignore the positive or negative sign:

  • if the size is less than $0.5$0.5, then we have a weak correlation;
  • if the size of the correlation is greater than $0.8$0.8, then we have a strong correlation.
  • if the size is between $0.5$0.5 and $0.8$0.8, then we a moderate correlation.

 

Important - correlation is not causation!

Even when two variables have a strong relationship and $r$r is close to $1$1 or $-1$1 we cannot say that one variable causes change in the other variable.

 

Practice question

Question 2

Given the following data:

x $1$1 $4$4 $7$7 $10$10 $13$13 $16$16 $19$19
y $4$4 $4.25$4.25 $4.55$4.55 $4.4$4.4 $4.45$4.45 $4.75$4.75 $4.2$4.2
  1. Calculate the correlation coefficient and give your answer to two decimal places.

  2. Choose the best description of this correlation.

    Moderate negative

    A

    Strong positive

    B

    Weak negative

    C

    Moderate positive

    D

    Strong negative

    E

    Weak positive

    F

 

Making predictions

Once we have determined the equation of the line of best fit, we can use it as a model to predict the likely value of the response variable based on a given value of the explanatory variable. 

Practice question

Question 3

Research on the number of cigarettes smoked during pregnancy and the birth weights of the newborn babies was conducted.

Average number of cigarettes per day ($x$x) $46.20$46.20 $13.60$13.60 $21.60$21.60 $25.00$25.00 $9.20$9.20 $37.50$37.50 $1.20$1.20 $17.60$17.60 $10.60$10.60 $13.00$13.00 $36.80$36.80 $19.40$19.40
Birth weight in kilograms ($y$y) $4.00$4.00 $5.90$5.90 $4.90$4.90 $4.90$4.90 $5.70$5.70 $4.40$4.40 $7.10$7.10 $5.10$5.10 $5.40$5.40 $5.20$5.20 $3.90$3.90 $5.90$5.90
  1. Using technology, calculate the correlation coefficient between the average number of cigarettes per day and birth weight.

    Give your answer to three decimal places.

  2. Choose the description which best describes the statistical relationship between these two variables.

    Strong negative linear relationship

    A

    Weak relationship

    B

    Strong positive linear relationship

    C

    Moderate positive linear relationship

    D

    Moderate negative linear relationship

    E
  3. Use your spreadsheet to form an equation for the line of best fit of $y$y on $x$x.

    Give all values to two decimal places. Give the equation of the line in the form $y=mx+b$y=mx+b.

  4. Use your equation to predict the birth weight of a newborn whose mother smoked on average $5$5 cigarettes per day.

    Give your answer to two decimal places.

  5. Choose the description which best describes the validity of the prediction in part (d).

    Despite an interpolated prediction, unreliable due to a moderate to weak correlation.

    A

    Despite a strong correlation, unreliable due to extrapolation.

    B

    Very unreliable due to extrapolation and a moderate to weak correlation.

    C

    Reliable due to interpolation and a strong correlation.

    D

 

Outcomes

4.1.3.3

use technology to find the line of best fit [complex]

4.1.3.5

use technology to find the correlation coefficient (an indicator of the strength of linear association) [complex]

What is Mathspace

About Mathspace