8. Two Variable Data Analysis

Lesson

Apparently, eating chocolate makes you smarter. Not only that, but the more you eat, the smarter you become.

This might sound far-fetched, but it is essentially the conclusion of an article in the New England Journal of Medicine discussing the apparent relationship between chocolate and Nobel Prizes. The article is based on the data in the chart below, which shows an obvious relationship between the amount of chocolate consumed per capita in a particular country, and the number of Nobel Prizes awarded to that particular country per ten million people:

Notice the "$P$`P` value" in the top left corner. In statistics, a $P$`P` value is the probability that the relationship between two things would happen by chance alone. It is expressed as a decimal, with $0.1$0.1 being a $10%$10% chance, $0.01$0.01 being a $1%$1% chance, and so on. Generally, anything with $P<0.05$`P`<0.05 is considered a **statistically significant*** *finding, which means that it is unlikely to be just a coincidence. This particular chart has $P<0.0001$`P`<0.0001, which means there is only a $0.01%$0.01% chance that this finding would happen by chance alone. This means that the notion that higher chocolate consumption is linked with a larger number of Nobel Prizes is probably true.

Sounds great, right? Eat chocolate before every exam, and your marks should go up in no time!

Actually, it's not that simple. All that this graph shows is that countries which eat a lot of chocolate also have a lot of Nobel Prizes. It **does not** show that eating chocolate actually causes Nobel Prizes. The link between chocolate (variable $A$`A`) and Nobel Prizes (variable $B$`B`) could happen in a number of different ways:

- $A$
`A`causes $B$`B`: Chocolate consumption causes Nobel Prizes. We might suggest that substances found in chocolate improve brain activity, thus leading to more research breakthroughs by scientists. In this way, we get the relationship depicted in the graph: countries like Switzerland eat a lot of chocolate and win a lot of Nobel Prizes. - $B$
`B`causes $A$`A`: Nobel Prizes cause chocolate consumption. We might suggest that when people from a particular country win a Nobel Prize, the people in that country celebrate by eating lots of chocolate. We would still see the same relationship depicted in the graph: countries like Switzerland win a lot of Nobel Prizes and eat lots of chocolate. - $C$
`C`causes both $A$`A`and $B$`B`: Some other factor causes both Nobel Prizes and Chocolate consumption. For example, we might notice that Nobel Prizes are more frequent in countries which are quite cold (like Norway and Sweden). It is easier to store and eat chocolate when it is cold than when it is hot, so people living in cold countries might eat more chocolate. It is also easier to concentrate in cold weather than in hot weather, so people living in cold countries might be able to do better research and win more Nobel Prizes. In this way, we would still get the same kind of graph: Switzerland is a cold country, and so people eat lots of chocolate and win lots of Nobel Prizes.

As we can see, just because two things are correlated **does not** mean that one definitely causes the other. To sum up this important message, statisticians try to teach as many people who will listen to them that correlation does not imply causation*. *It is the most important lesson in all of statistics, so make sure to share this knowledge with your friends!

The link between chocolate and Nobel Prizes is most likely a case where $C$`C` causes both $A$`A` and $B$`B`. The variable $C$`C` in this case is wealth - people in richer countries have more money to spend on luxury foods like chocolate, and also have more money to spend on research. To show this, the authors of this article performed a similar study on sales of luxury cars, and found that this too was well correlated with Nobel Prizes:

In this case, it seems that the link between chocolate and Nobel Prizes was really just because rich countries eat more chocolate and do more research than poorer countries.

Have a look at the following graphs, and see if you can explain the different possibilities for the relationship between them.

We see a very significant correlation here: as lemons increase, the number of car crashes decreases. There are a number of ways that this correlation could happen:

- $A$
`A`causes $B$`B`: Could lemons prevent car crashes? (Be creative!) - $B$
`B`causes $A$`A`: Could reduced car crashes cause the number of lemons imported to increase? - $C$
`C`causes both $A$`A`and $B$`B`: Could some third factor cause both increased lemon imports and reduced car crashes?

- Which of the above three explanations do you think is the most likely explanation for the correlation?

Again, this graph shows a very clear correlation: as pirate numbers have decreased, the global temperature has increased. Once again, this could be explained in a number of ways:

- $A$
`A`causes $B$`B`: Could low numbers of pirates cause global warming? - $B$
`B`causes $A$`A`: Could global warming reduce the number of pirates? - $C$
`C`causes both $A$`A`and $B$`B`: Could some third factor cause both reduced numbers of pirates and increased global temperatures?

- Which of the above three explanations do you think is the most likely explanation for the correlation?

Hopefully the above examples have illustrated the way in which correlation does not imply causation. Therefore, when reading news articles which declare that "scientists have discovered a link between" two things, one needs to be skeptical. A "link" usually only means a correlation, and the jump from this correlation to causation can often be flawed.

We could easily publish an article with the title "Scientists discover a link between ice cream and drowning". Both ice cream and drowning occur more during hot weather, and so a "link" will definitely exist. We could also say "Scientists discover a link between walking sticks and heart attacks", since both tend to occur in elderly people.

- Make up your own news headline based on a "link" that you have discovered, to mislead your readers into thinking that one causes the other. Who came up with the most believable fake headline? Who has the most extreme?

Well, you could become a die-hard skeptic. However, there is no need to be too pessimistic. There are ways to prove causation, which are used often by scientists to ensure that the relationship between two things really is a case of one causing the other. The most commonly used method is a randomised controlled trial, which is used frequently in medicine to prove that a particular treatment really does cause patients to get better.

The most important lesson to take out of this is that "links" that you read about in the popular press may not be as scientific as they may seem, and often need a skeptical eye in evaluating their claims. This is particularly true when the two things which are "linked" have an obvious relation to wealth, such as "Scientists discover that eating lobster makes you live longer".

If you're really curious about whether chocolate will improve your test results, there's only one way to find out. Run a randomised controlled trial between you and your friends: randomly select half to eat chocolate before the test, and half to not eat chocolate, and see who gets the best test results!

Create a scatter plot to represent the relationship between two variables, determine the correlation between these variables by testing different regression models using technology, and use a model to make predictions when appropriate.