In addition to describing the correlation between two variables using words, we can also calculate the correlation as a number, which we call the r-value. By calculating this value, we can be more precise with our description of correlation.
Pearson's correlation coefficient is a value that tells you the strength of the linear relationship between two variables. It is denoted by the letter r. It indicates how closely a scatterplot conforms to a straight line.
The value of r ranges from -1 to 1 on a continuum like this.
If the r-value is 0, we say there is no correlation. If the r-value is 1 or -1 we say the correlation is perfect.
We looked at examples of the different descriptions of correlation such as positive, negative, weak and strong, in the previous lesson .
A weak correlation, indicates there is some correlation but it is not considered to be very significant. Values from 0 to 0.5 or from -0.5 to 0 are generally considered weak.
A strong correlation indicates that the connection between the variables is quite significant. Values from approximately 0.8 to 1 or from -1 to -0.8 are strong.
A moderate correlation falls between weak or strong. Values from approximately 0.5 to 0.8 or from -0.8 to -0.5 are considered moderate.
Play with this applet to see how the correlation coefficient changes. Move each point. Try having one outlier and see how much that can change the correlation coefficient. Try moving the points so they are in a perfect straight line. What happens to the correlation coefficient value?
The closer the points are to being in a straight line, the closer r is to 1 or -1.
If the points are trending upwards from left to right, the correlation coefficient is positive. If the points are trending downwards from left to right, the correlation coefficient is negative.
Three key observations when commenting on the relationship between bivariate data:
State the direction of the relationship. Use the words positive or negative. (Think about the gradient of the line).
Describe the strength of the relationship. Use the r value to determine if the relationship is perfect, weak, moderate, strong or no correlation.
State the shape of the relationship. Pearson's correlation coefficient gives a measure of how close the points are to being a straight line, so we almost always use the word linear. It is possible for two variables to be related in a non-linear way. For example, the scatterplot may resemble a parabola more than it resembles a line. If there seems to be a pattern but it does not look like a line we say the relationship appears to be non-linear.
A pair of data sets have a correlation coefficient of \dfrac{1}{10} while a second pair of data sets have a correlation coefficient of \dfrac{3}{5}.
Choose the correct statement:
The scatter diagram shows data of the height of an object after it is pushed off a rooftop as a function of time.
Which type of model is appropriate for the data?
The most likely value of Pearson’s correlation coefficient (r) for this set of data is
Three key observations when commenting on the relationship between bivariate data:
State the direction of the relationship.
Describe the strength of the relationship.
State the shape of the relationship, either linear or non-linear.
If we determine that there is some correlation between variables, we can make conclusions about the scenario that is being modelled. However, we can only draw conclusions based on the data and do not want to assume anything about the relationship itself.
For this reason, when we make conclusions we should be careful to use wording that describes the data. For example, if there is a strong negative correlation between two variables, we can draw the conclusion that: "As the explanatory variable increases, the response variable increases".
Even when two variables have a strong relationship and r is close to 1 or -1, we cannot say that one variable causes change in the other variable. If asked "does change in the explanatory variable cause change in response variable?" we always write "No - correlation is not causation".
A strong correlation might seem to indicate a cause and effect relationships between the variables. However, we need to be careful to understand the situation, as this is not always the case.
These are common reasons for correlation between variables without a causal relationship:
Confounding due to a common response to another variable (also described as contributing variables), e.g. sales of ice-creams and sunscreens have a strong positive correlation because they both increase in response to hot summer weather.
Coincidence. It is possible that the data we are analysing shows a correlation purely by chance. A website containing graphs of variables with spurious correlations can be found here.
The causation is in the opposite direction, e.g. strong winds are correlated to tree branches waving. But the waving branches don't cause the strong winds, instead it's the other way around.
When we are asked to analyse a relationship between variables, we should consider whether a causal relationship can be justified. If not, we should say so, and identify possible non-causal reasons for the association.
A research determines that there is a causal relationship between smoking and getting cancer. Will there be correlation between smoking and getting cancer?
A study found a strong correlation between the approximate number of pirates out at sea and the average world temperature.
Does this mean that the number of pirates out at sea has an impact on world temperature?
Which of the following is the most likely explanation for the strong correlation?
Which of the following is demonstrated by the strong correlation between the approximate number of pirates out at sea and the average world temperature?
These are common reasons for correlation between variables without a causal relationship:
The variables have a common response to another variable.
Coincidence.
The causation is in the opposite direction.