topic badge

1.05 Correlation using r-values

Lesson

Introduction

In addition to describing the correlation between two variables using words, we can also calculate the correlation as a number, which we call the r-value. By calculating this value, we can be more precise with our description of correlation.

Pearson's correlation coefficient

Pearson's correlation coefficient is a value that tells you the strength of the linear relationship between two variables. It is denoted by the letter r. It indicates how closely a scatterplot conforms to a straight line.

The value of r ranges from -1 to 1 on a continuum like this.

A number line from -1 to 1 describing correlation. Ask your teacher for more information.

If the r-value is 0, we say there is no correlation. If the r-value is 1 or -1 we say the correlation is perfect.

We looked at examples of the different descriptions of correlation such as positive, negative, weak and strong, in the  previous lesson  .

A weak correlation, indicates there is some correlation but it is not considered to be very significant. Values from 0 to 0.5 or from -0.5 to 0 are generally considered weak.

A strong correlation indicates that the connection between the variables is quite significant. Values from approximately 0.8 to 1 or from -1 to -0.8 are strong.

A moderate correlation falls between weak or strong. Values from approximately 0.5 to 0.8 or from -0.8 to -0.5 are considered moderate.

A number line with numbers indicating different strengths and directions of correlation. Ask your teacher for more information.

Exploration

Play with this applet to see how the correlation coefficient changes. Move each point. Try having one outlier and see how much that can change the correlation coefficient. Try moving the points so they are in a perfect straight line. What happens to the correlation coefficient value?

Loading interactive...

The closer the points are to being in a straight line, the closer r is to 1 or -1.

If the points are trending upwards from left to right, the correlation coefficient is positive. If the points are trending downwards from left to right, the correlation coefficient is negative.

Three key observations when commenting on the relationship between bivariate data:

  1. State the direction of the relationship. Use the words positive or negative. (Think about the gradient of the line).

  2. Describe the strength of the relationship. Use the r value to determine if the relationship is perfect, weak, moderate, strong or no correlation.

  3. State the shape of the relationship. Pearson's correlation coefficient gives a measure of how close the points are to being a straight line, so we almost always use the word linear. It is possible for two variables to be related in a non-linear way. For example, the scatterplot may resemble a parabola more than it resembles a line. If there seems to be a pattern but it does not look like a line we say the relationship appears to be non-linear.

Examples

Example 1

A pair of data sets have a correlation coefficient of \dfrac{1}{10} while a second pair of data sets have a correlation coefficient of \dfrac{3}{5}.

Choose the correct statement:

A
The first pair of data sets have a stronger correlation.
B
The second pair of data sets have a stronger correlation.
Worked Solution
Create a strategy

Compare the correlation coefficients, and choose the data set with the coefficient closer to 1 has the stronger correlation.

Apply the idea

\dfrac{3}{5} is closer to 1 than \dfrac{1}{10}. So the second pair of data sets have a stronger correlation.

The answer is option B.

Example 2

The scatter diagram shows data of the height of an object after it is pushed off a rooftop as a function of time.

1
2
3
4
5
6
7
8
9
x
100
200
300
400
500
600
700
800
900
y
a

Which type of model is appropriate for the data?

A
Linear
B
Quadratic
Worked Solution
Create a strategy

Consider the shape the points lie in.

Apply the idea

The points in the diagram lie in the shape of a parabola, so a quadratic model is appropriate. The answer is option B.

b

The most likely value of Pearson’s correlation coefficient (r) for this set of data is

A
0.93
B
-0.68
C
-0.11
D
0.34
Worked Solution
Create a strategy

Consider the direction of the data from left to right, and how close they lie in a line.

Apply the idea

While this data set appears to be non-linear, it is quite close to lying in a straight line. Since the data trends downwards from left to right, the correlation coefficient should be negative. So the value should be between -0.5 and -1.

So option B, -0.68, is the answer.

Idea summary

Three key observations when commenting on the relationship between bivariate data:

  1. State the direction of the relationship.

  2. Describe the strength of the relationship.

  3. State the shape of the relationship, either linear or non-linear.

Correlation and causation

If we determine that there is some correlation between variables, we can make conclusions about the scenario that is being modelled. However, we can only draw conclusions based on the data and do not want to assume anything about the relationship itself.

For this reason, when we make conclusions we should be careful to use wording that describes the data. For example, if there is a strong negative correlation between two variables, we can draw the conclusion that: "As the explanatory variable increases, the response variable increases".

Even when two variables have a strong relationship and r is close to 1 or -1, we cannot say that one variable causes change in the other variable. If asked "does change in the explanatory variable cause change in response variable?" we always write "No - correlation is not causation".

A strong correlation might seem to indicate a cause and effect relationships between the variables. However, we need to be careful to understand the situation, as this is not always the case.

These are common reasons for correlation between variables without a causal relationship:

  • Confounding due to a common response to another variable (also described as contributing variables), e.g. sales of ice-creams and sunscreens have a strong positive correlation because they both increase in response to hot summer weather.

  • Coincidence. It is possible that the data we are analysing shows a correlation purely by chance. A website containing graphs of variables with spurious correlations can be found here.

  • The causation is in the opposite direction, e.g. strong winds are correlated to tree branches waving. But the waving branches don't cause the strong winds, instead it's the other way around.

When we are asked to analyse a relationship between variables, we should consider whether a causal relationship can be justified. If not, we should say so, and identify possible non-causal reasons for the association.

Examples

Example 3

A research determines that there is a causal relationship between smoking and getting cancer. Will there be correlation between smoking and getting cancer?

A
Yes
B
No
C
Not enough information
Worked Solution
Create a strategy

If there is a causal relationship between two variables, it means that the explanatory variable has a direct effect on the response variable.

Apply the idea

Since there is a casual relatioship between smoking and getting cancer, there will be a correlation between them. The answer is option A.

Example 4

A study found a strong correlation between the approximate number of pirates out at sea and the average world temperature.

a

Does this mean that the number of pirates out at sea has an impact on world temperature?

Worked Solution
Apply the idea

Just because there is a correlation does not mean there is causation. It would not make sense that pirates being out to sea would effect world temperature. The answer is no.

b

Which of the following is the most likely explanation for the strong correlation?

A
Contributing variables- there are other casual relationships and variables that come in to play and these may lead to an indirect positive association between the approximate number of pirates out at sea and the average world temperature.
B
Coincidence- there are no other contributing factors or reasonable arguments to be made for the strong positive association between the approximate number of pirates out at sea and the average world temperature.
Worked Solution
Create a strategy

Think about whether there are any common factors between the variables.

Apply the idea

There does not seem to be any common factors that have a direct relationship with both the number of pirates out at sea and the average world temperature. It is more reasonable to put it down to coincidence. The answer is option B.

c

Which of the following is demonstrated by the strong correlation between the approximate number of pirates out at sea and the average world temperature?

A
If there is correlation between two variables, then there must be causation.
B
If there is correlation between two variables, there isn't necessarily causation.
C
If there is correlation between two variables, then there is no causation.
Worked Solution
Apply the idea

It demonstrated that if there is correlation between two variables, there isn't necessarily causation. The answer is option B.

Idea summary

These are common reasons for correlation between variables without a causal relationship:

  • The variables have a common response to another variable.

  • Coincidence.

  • The causation is in the opposite direction.

Outcomes

ACMGM053

describe an association between two numerical variables in terms of direction (positive/negative), form (linear/non-linear) and strength (strong/moderate/weak)

ACMGM056

use a scatterplot to identify the nature of the relationship between variables

ACMGM064

recognise that an observed association between two variables does not necessarily mean that there is a causal relationship between them

ACMGM065

identify possible non-causal explanations for an association, including coincidence and confounding due to a common response to another variable, and communicate these explanations in a systematic and concise manner

What is Mathspace

About Mathspace