topic badge

6.08 Correlation and causation

Lesson

Correlation

This chapter revisits ideas about correlation that were discussed in a previous chapter, Scatter plots and lines of fit.  We now want to quantify the idea of correlation to a numerical value instead of a worded description. We will do this using the correlation coefficient.

Correlation Applet

Play with this applet to see how the correlation coefficient changes. Move each point. Try having one outlier and see how much that can change the correlation coefficient. Try moving the points so they are in a perfect straight line. What happens to the correlation coefficient value?

Guiding questions

  1. Does association imply causation? Which of the following are association, correlation, or causation?
    1. Smoking and lung cancer
    2. Vending machines in Schools and obesity
    3. Taking a placebo pill (inactive/fake treatment) and weight loss
  2. Ask students for their own examples of association, correlation, and causation.
  3. What is reasonable?

A correlation coefficient is a value that tells you the strength of a relationship between two variables. It is denoted by the letter $r$r.

A perfect positive correlation has a value of $r=1$r=1. That means that if we graphed the variables the $xy$xy-plane, it would show a perfect, positive linear relationship. A perfect negative correlation has a value of $r=-1$r=1. It's a perfect negative linear relationship. No correlation therefore has a value of $r=0$r=0, indicating there is no relationship between the variables.

So far, so reasonable. What if I have a correlation coefficient of $0.6$0.6? $-0.53$0.53? What do they show? 

Well consider the entire correlation extremes ranging from $-1$1 to $1$1 as a continuum like this. 

Right in the middle is $0$0, we call this no correlation.

We further divide up the line to indicate other values with descriptions like Weak, Moderate and Strong (positive or negative). 

Where we place these divisions can, in some ways, be a little arbitrary.  Ultimately the larger $|r|$|r| gets, the closer to perfect it is and the closer to $0$0, the more it reflects no correlation. 

A weak correlation indicates there is some correlation but it is not considered to be very significant. Values less than $0.5$0.5 are generally considered weak.

A strong correlation indicates that the connection between the variables is quite significant.  The exact value that is placed on where 'strong' begins is slightly different in different parts of the world ranging from statements that values larger than $0.7$0.7 are strong, or larger than $0.8$0.8 are strong.  But ultimately it's the idea that the larger the value the stronger the relationship that really matters here! 

A moderate correlation falls between weak or strong.  

 

Remember!

Remember to always state if the correlation is positive or negative by using phrases like "weak negative", "moderate positive", or "strong positive" to describe the relationships between variables. 

 

Calculating the correlation coefficient

For this course will only calculate the correlation coefficient ($r$r) using technology.  As you study more mathematics, you might learn how to calculate the correlation coefficient on your own.

There are lots of tools we can use to calculate the value or $r$r. We can use Excel, Google Sheets, a TI-calculator or many other options. This investigation on the line of best fit touches on how to calculate it using Google Sheets. 

If you are using a TI-83 or TI-84 here are the instructions:

  1. Ensure your calculator is set to DiagnosticOn by pressing [2nd] and then [0], scrolling to DiagnosticOn and pressing [Enter].
  2. Enter all of your data by pressing [STAT] and then selecting 1:Edit. Remember that your independent variable should go in L1 and your dependent variable in L2.
  3. Once your data is in, press [STAT] then select CALC and 4:LinReg(ax+b)
     

Practice questions

QUESTION 1

Identify the correlation between the temperature and the number of heaters sold.

  1. A positive correlation

    A

    A negative correlation

    B

    No correlation

    C

Question 2

For the graph depicted, choose the correlation coefficient that best represents it.

Loading Graph...
A scatter plot graph with a Cartesian coordinate system. The x-axis is labeled 'x' and the y-axis is labeled 'y'. The scatter plot displays data points that suggest a decreasing linear trend, indicating that as the x-value increases, the y-value tends to decrease. The points are $\left(0,13\right)$(0,13), $\left(1,9\right)$(1,9), $\left(2,5\right)$(2,5), $\left(3,1\right)$(3,1), $\left(4,-3\right)$(4,3), $\left(5,-7\right)$(5,7), $\left(6,-11\right)$(6,11), $\left(7,-15\right)$(7,15), $\left(8,-19\right)$(8,19), and $\left(9,-23\right)$(9,23).
  1. $-1$1

    A

    $1$1

    B

    $0.67$0.67

    C

    $0$0

    D

Question 3

Sean is a hotdog vendor. He records the maximum temperature of the day and the number of hotdog sold. The results are in the table given.

Maximum Temperature ($^\circ$°C) $30$30 $34$34 $33$33 $35$35 $33$33 $28$28 $27$27 $31$31 $37$37 $29$29
Number of hotdogs $18$18 $38$38 $26$26 $40$40 $24$24 $8$8 $20$20 $35$35 $43$43 $38$38
  1. Plot the information on a scatter plot.

    Loading Graph...

  2. Calculate the correlation coefficient.

    Give your answer to two decimal places.

  3. Using the correlation coefficient you calculated in part (b) and the graph you created in part (a), which of the following statements is correct:

    There is no evidence of a linear relationship between sales and temperature

    A

    As the temperature increases the sales tend to decrease

    B

    As the temperature increases the sales increase

    C

    As the temperature increases the sales tend to increase.

    D

    As the temperature increases the sales decrease

    E

Correlation versus causation

When a change in the value of one variable quantity seems to be associated with a proportional change in another variable, we say there is a correlation (or a relationship) between the two variables.

A correlation between variables may be discovered in the course of an experiment or through an analysis of observational data.

In a typical experiment, a researcher sets one variable, called the independent or explanatory variable, to various levels and observes the corresponding values of the other variable, called the dependent or response variable.

In the case of an observational study, more so than in an experiment, care must be taken not to assume that correlation implies causation.

Remember!

Association or correlation does not imply causation.

In an experiment, it is usually reasonable to think that if values of the independent variable are deliberately chosen and the dependent variable is observed to change accordingly, then there is a causal relation between the independent and dependent variables. However, in an observational study, the values of both variables in the pair are merely observed, not chosen.

Contributing variable: When two variables have an association, they may be connected through a third variable. For example, it was found that there was a strong, positive correlation between ice cream sales and the number of drownings. Does this mean that ice cream causes drowning? Absolutely not, there is a third variable, temperature, which would likely increase both ice cream sales and trips to the beach, hence drownings. 

Coincidence: It is possible in an observational study for variables to be correlated purely by chance, such as the example below.

Credit: http://tylervigen.com/spurious-correlations

Thus, care is needed lest a correlation is wrongly taken to imply a causal relationship. To move from the discovery of a correlation to the claim that a causal effect has been found, researchers need to gather evidence external to the data and control all variables possible. 

 

Practice questions

QUESTION 4

The table shows the number of fans sold at a store during days of various temperatures.

Temperature ($^\circ$°C) $6$6 $8$8 $10$10 $12$12 $14$14 $16$16 $18$18 $20$20
Number of fans sold $12$12 $13$13 $14$14 $17$17 $18$18 $19$19 $21$21 $23$23
  1. Consider the correlation coefficient $r$r for temperature and number of fans sold. In what range will $r$r be?

    $r=0$r=0

    A

    $r>0$r>0

    B

    $r<0$r<0

    C
  2. Is there a causal relationship?

    Yes

    A

    No

    B

QUESTION 5

A study found a strong correlation between the approximate number of pirates out at sea and the average world temperature.

  1. Does this mean that the number of pirates out at sea has an impact on world temperature?

    Yes

    A

    No

    B
  2. Which of the following is the most likely explanation for the strong correlation?

    Contributing variables - there are other causal relationships and variables that come in to play and these may lead to an indirect positive association between the approximate number of pirates out at sea and the average world temperature.

    A

    Coincidence - there are no other contributing factors or reasonable arguments to be made for the strong positive association between the approximate number of pirates out at sea and the average world temperature.

    B
  3. Which of the following is demonstrated by the strong correlation between the approximate number of pirates out at sea and the average world temperature?

    If there is correlation between two variables, then there must be causation.

    A

    If there is correlation between two variables, there isn't necessarily causation.

    B

    If there is correlation between two variables, then there is no causation.

    C

Outcomes

S-ID.9

Distinguish between correlation and causation.

What is Mathspace

About Mathspace