As we have seen, a correlation is a way of expressing a relationship between two variables - in particular, how strongly pairs of variables are related.
In this chapter we will be looking at linear relationships and measuring the strength of a linear correlation between variables by a quantity called the Pearson correlation coefficient (or just correlation coefficient). This coefficient is given the symbol r, and takes a value between -1 and +1.
The correlation coefficient is a value that describes the strength and direction of a linear relationship between two variables.
The value of the correlation coefficient varies from -1 to +1, where -1 describes perfect negative correlation and +1 describes perfect positive correlation. Any other type of correlation corresponds to a value between these two extremes, with 0 describing no correlation.
We further divide up this range of values to indicate other strengths of correlation, using descriptions of weak, moderate and strong (for both positive and negative correlations).
If the correlation coefficient takes a value between 0 and 1, then it describes a positive correlation:
A value of r close to +1 indicates a strong positive linear correlation
A value of r that is positive but closer to 0 indicates a weak positive linear correlation
If the correlation coefficient takes a value between -1 and 0, then it describes a negative correlation:
A value of r close to -1 indicates a strong negative linear correlation
A value of r that is negative but closer to 0 indicates a weak negative linear correlation
If the correlation coefficient is 0, or very close to 0, it indicates that there is no linear correlation between the variables. This may be because the variables are unrelated, or it might be that they have a non-linear relation instead.
Even when two variables have a strong relationship and r is close to 1 or -1, we cannot say that one variable causes change in the other variable. If asked "does change in the explanatory variable cause change in response variable?" we always write "No - correlation is not causation".
For example, it has been shown that there is a strong, positive, linear relationship between sunglasses sold and ice-cream cone sales. But we cannot say that sunglasses sales cause ice-cream cone sales. There is a third variable at work here; increase in temperature causes both variables to also increase. Increase in temperature is called a confounding variable.
Coincidence is also a plausible reason an association occurs. It's possible to find variables with unlikely strong correlations. For example, per capita consumption of cheese and deaths from being strangled by a bedsheet have been shown to have a strong correlation. But we cannot say that one causes the other. It is a coincidence. A website containing graphs of variables with spurious correlations can be found here.
Explore this applet to see how the correlation coefficient changes. Move each point. Try having one outlier and see how much that can change the correlation coefficient. Try moving the points so they are in a perfect straight line. What happens to the correlation coefficient value?
The closer the points are to being in a straight line, the closer r is to 1 or -1.
If the points are trending upwards from left to right, the correlation coefficient is positive. If the points are trending downwards from left to right, the correlation coefficient is negative.
Identify the correlation between the temperature and the number of heaters sold.
A study found a strong correlation between the approximate number of pirates out at sea and the average world temperature.
Does this mean that the number of pirates out at sea has an impact on world temperature?
Which of the following is the most likely explanation for the strong correlation?
Which of the following is demonstrated by the strong correlation between the approximate number of pirates out at sea and the average world temperature?
If the correlation coefficient takes a value between 0 and 1, then it describes a positive correlation:
A value of r close to +1 indicates a strong positive linear correlation
A value of r that is positive but closer to 0 indicates a weak positive linear correlation
If the correlation coefficient takes a value between -1 and 0, then it describes a negative correlation:
A value of r close to -1 indicates a strong negative linear correlation
A value of r that is negative but closer to 0 indicates a weak negative linear correlation
If the correlation coefficient is 0, or very close to 0, it indicates that there is no linear correlation between the variables.
The calculation required to determine r is very tricky to do by hand, but can be easily done using technology. To do so, we enter the raw data into two separate lists, then perform a linear regression analysis. This will calculate a number of values, though the only one we are interested in right now is the r value.
For the graph depicted, choose the correlation coefficient that best represents it.
Given the following data:
x | 1 | 4 | 7 | 10 | 13 | 16 | 19 |
---|---|---|---|---|---|---|---|
y | 4 | 4.25 | 4.55 | 4.4 | 4.45 | 4.75 | 4.2 |
Calculate the correlation coefficient and give your answer to two decimal places.
Choose the best description of this correlation.
The correlation coefficient, r, tells us the strength and direction of the correlation between two variables.
If r is negative the direction of the correlation is negative. If r is positive the direction of the correlation is positive.