As we have seen, a correlation is a way of expressing a relationship between two variables - in particular, how strongly pairs of variables are related.
In this chapter we will be looking at linear relationships and measuring the strength of a linear correlation between variables by a quantity called the Pearson correlation coefficient (or just correlation coefficient). This coefficient is given the symbol $r$r, and takes a value between $-1$−1 and $+1$+1.
The correlation coefficient is a value that describes the strength and direction of a linear relationship between two variables.
The value of the correlation coefficient varies from $-1$−1 to $+1$+1, where $-1$−1 describes perfect negative correlation and $+1$+1 describes perfect positive correlation. Any other type of correlation corresponds to a value between these two extremes, with $0$0 describing no correlation.
We further divide up this range of values to indicate other strengths of correlation, using descriptions of weak, moderate and strong (for both positive and negative correlations).
If the correlation coefficient takes a value between $0$0 and $1$1, then it describes a positive correlation:
If the correlation coefficient takes a value between $-1$−1 and $0$0, then it describes a negative correlation:
If the correlation coefficient is $0$0, or very close to $0$0, it indicates that there is no linear correlation between the variables. This may be because the variables are unrelated, or it might be that they have a non-linear relation instead.
Even when two variables have a strong relationship and $r$r is close to $1$1 or $-1$−1, we cannot say that one variable causes change in the other variable. If asked "does change in the independent variable cause change in dependent variable?" we always write "No - correlation is not causation".
For example, it has been shown that there is a strong, positive, linear relationship between sunglasses sold and ice-cream cone sales. But we cannot say that sunglasses sales cause ice-cream cone sales. There is a third variable at work here; increase in temperature causes both variables to also increase. Increase in temperature is called a confounding variable.
Coincidence is also a plausible reason an association occurs. It's possible to find variables with unlikely strong correlations. For example, per capita consumption of cheese and deaths from being strangled by a bedsheet have been shown to have a strong correlation. But we cannot say that one causes the other! It is a coincidence.
Identify the correlation between the temperature and the number of heaters sold.
A positive correlation
A negative correlation
A study found a strong correlation between the approximate number of pirates out at sea and the average world temperature.
Does this mean that the number of pirates out at sea has an impact on world temperature?
Which of the following is the most likely explanation for the strong correlation?
Contributing variables - there are other causal relationships and variables that come in to play and these may lead to an indirect positive association between the approximate number of pirates out at sea and the average world temperature.
Coincidence - there are no other contributing factors or reasonable arguments to be made for the strong positive association between the approximate number of pirates out at sea and the average world temperature.
Which of the following is demonstrated by the strong correlation between the approximate number of pirates out at sea and the average world temperature?
If there is correlation between two variables, then there must be causation.
If there is correlation between two variables, there isn't necessarily causation.
If there is correlation between two variables, then there is no causation.
Explore this applet to see how the correlation coefficient changes. Move each point. Try having one outlier and see how much that can change the correlation coefficient. Try moving the points so they are in a perfect straight line. What happens to the correlation coefficient value?
The calculation required to determine $r$r is very tricky to do by hand, but can be easily done using technology. To do so, we enter the raw data into two separate lists, then perform a linear regression analysis. This will calculate a number of values, though the only one we are interested in right now is the $r$r value.
The table shows the number of fans sold at a store during days of various temperatures.
|Number of fans sold||$12$12||$13$13||$14$14||$17$17||$18$18||$19$19||$21$21||$23$23|
Consider the correlation coefficient $r$r for temperature and number of fans sold. In what range will $r$r be?
Is there a causal relationship?
For the graph depicted, choose the correlation coefficient that best represents it.
In a study, it was found that the correlation coefficient between heights of women and probability of being turned down for a promotion was found to be $-0.90$−0.90.
Which is the most appropriate statement?
There is no evidence of a linear relationship between heights of women and probability of being turned down for a promotion.
As the heights of women increases the probability of being turned down for a promotion increases.
As the heights of women increases the probability of being turned down for a promotion decreases.
Given the following data:
Calculate the correlation coefficient and give your answer to two decimal places.
Choose the best description of this correlation.
Pose and solve problems involving rates, percentages, and proportions in various contexts, including contexts connected to real-life applications of data, measurement, geometry, linear relations, and financial literacy.
Create a scatter plot to represent the relationship between two variables, determine the correlation between these variables by testing different regression models using technology, and use a model to make predictions when appropriate.