# 1.04 Associations between numerical variables

Lesson

### Displaying bivariate data with a scattergraph

Bivariate data is the name for numerical data consisting of two sets of individual data. We are often interested in whether there seems to be any connection between the two sets of data. A scattergraph (or scatterplot) provides a visual representation of the numerical data which can help to determine whether there is a relationship between the two sets.

The explanatory variable is plotted on the horizontal axis and the response variable is plotted on the vertical axis. A single data point in a bivariate data set is written in the form $\left(x,y\right)$(x,y), with the first number $x$x being the explanatory variable and the second number $y$y being the response variable.

#### Worked example

Scientists want to see how quickly a plant grows under controlled conditions. They start with ten seedlings of the same height and give each a different measure of weekly fertiliser. They then measure the height of the plants after $6$6 weeks and record the data in the table below.

 Weekly amount of Height (cm) fertiliser (in cups) $1$1 cup $=250$=250 ml $1$1 $2$2 $3$3 $4$4 $5$5 $6$6 $7$7 $8$8 $9$9 $10$10 $1.55$1.55 $2.32$2.32 $3.32$3.32 $4.51$4.51 $5.75$5.75 $6.91$6.91 $7.86$7.86 $8.58$8.58 $9.09$9.09 $9.43$9.43

Create a scatterplot and describe the relationship between the two variables.

Think: We are interested in what happens to the height as the number of cups of fertiliser increases. In other words, the fertiliser explains the change in height. So fertiliser is the explanatory variable (plotted on the $x$x axis) and height is the response variable (plotted on the $y$y axis).

We can write these data points as ordered pairs, $\left(1,1.55\right),\left(2,2.32\right),\dots$(1,1.55),(2,2.32),

Do: To make a scatterplot we plot each of the data points on a cartesian plane.

For example, to plot the first data point, $\left(1,1.55\right)$(1,1.55) we plot the point where $x=1$x=1 and $y=1.55$y=1.55.

By creating this scatterplot, we can more easily see the relationship between the number of cups of fertiliser and the height of the plant. As the number of cups of fertiliser increases, the height of the plant also increases. We could draw an approximate line with a positive gradient that shows the general trend of the points

When two variables have a relationship we say they correlate.

Just by observation we can describe the relationship shown in the scattergraph above in three ways.
We say there is a strong, positive, linear correlation between the two variables.

### Pearson's correlation coefficient

Pearson's correlation coefficient is a value that tells you the strength of the linear relationship between two variables. It is denoted by the letter $r$r. It indicates how closely a scatterplot conforms to a straight line.

The value of $r$r ranges from $-1$1 to $1$1 on a continuum like this.

If the $r$r value is $0$0, we say there is no correlation. If the $r$rvalue is $1$1 or $-1$1 we say the correlation is perfect.

Correlation can be described as perfect, weak, moderate or strong (positive or negative).

#### Negative correlations

We can use Pearson's correlation coefficient to describe the strength of a correlation as follows:

• A weak correlation indicates that there is some correlation, but it is not considered to be very significant. Values between $-0.5$0.5 and $0.5$0.5 are generally considered weak. Values particularly close to $0$0 are considered to indicate no correlation.
• A strong correlation indicates that the connection between the variables is quite significant. Values from approximately $0.8$0.8 to $1$1 or from $-1$1 to $-0.8$0.8 are strong.
• A moderate correlation falls between weak or strong. Values from approximately $0.5$0.5 to $0.8$0.8 or from $-0.8$0.8 to $-0.5$0.5 are considered moderate.

#### Correlation applet

Play with this applet to see how the correlation coefficient changes. Move each point. Try having one outlier and see how much that can change the correlation coefficient. Try moving the points so they are in a perfect straight line. What happens to the correlation coefficient value?

 Created with Geogebrauser5166
Three key observations when commenting on the relationship between bivariate data.

1. State the direction of the relationship. Use the words positive or negative. (Think about the gradient of the line).

2. Describe the strength of the relationship. Use the $r$r value to determine if the relationship is perfect, weak, moderate, strong or no correlation.

3. State the shape of the relationship. Pearson's correlation coefficient gives a measure of how close the points are to being a straight line, so we almost always use the word linear. It is possible for two variables to be related in a non-linear way. For example, the scatterplot may resemble a parabola more than it resembles a line. If there seems to be a pattern but it does not look like a line we say the relationship appears to be non-linear.

Important - correlation is not causation!

Even when two variables have a strong relationship and $r$r is close to $1$1 or $-1$1, we cannot say that one variable causes change in the other variable. If asked to assess "does change in the explanatory variable cause change in response variable?" based solely on a strong correlation we can respond "No - correlation does not imply causation".

For example, it has been shown that there is a strong, positive, linear relationship between sunglasses sold and ice-cream cone sales. But we cannot say that sunglasses sales cause ice-cream cone sales. There is a third variable at work here; increase in temperature causes both variables to also increase. Increase in temperature is called a confounding variable.

Coincidence is also a plausible reason an association occurs. It's possible to find variables with unlikely strong correlations. For example, per capita consumption of cheese and deaths from being strangled by a bedsheet have been shown to have a strong correlation. But we cannot say that one causes the other! It is a coincidence. A quick search of the internet for "correlation vs. causation" will result in many websites containing examples of variables with spurious correlations.

To conclude a causal relationship between variables with a strong correlation confounding factors must be eliminated and a causal mechanism found - such as a carcinogenic ingredients of cigarettes providing a causal link to cancer.

#### Practice questions

##### Question 1

The scatter plot shows the relationship between sea temperatures and the amount of healthy coral.

1. Describe the correlation between sea temperature the amount of healthy coral.

Select all descriptions that apply.

Negative

A

Strong

B

Positive

C

Weak

D

Negative

A

Strong

B

Positive

C

Weak

D
2. Which variable is the response variable?

Sea temperature

A

Level of healthy coral

B

Sea temperature

A

Level of healthy coral

B
3. Which variable is the explanatory variable?

Level of healthy coral

A

Sea temperature

B

Level of healthy coral

A

Sea temperature

B

##### Question 2

The following table has data results from an experiment.

 $X$X $2$2 $4$4 $7$7 $9$9 $12$12 $15$15 $17$17 $20$20 $Y$Y $2$2 $4$4 $6$6 $8$8 $12$12 $18$18 $28$28 $38$38
1. Plot the data from the table on the graph below.

2. What is the type of correlation between the data points? Select the best answer.

Linear Positive

A

Linear Negative

B

Nonlinear

C

No Correlation

D

Linear Positive

A

Linear Negative

B

Nonlinear

C

No Correlation

D

##### Question 3

The following table shows the number of traffic accidents associated with a sample of drivers of different age groups.

Age Accidents
$20$20 $41$41
$25$25 $44$44
$30$30 $39$39
$35$35 $34$34
$40$40 $30$30
$45$45 $25$25
$50$50 $22$22
$55$55 $18$18
$60$60 $19$19
$65$65 $17$17
1. Which of the following scatter plots correctly represents the above data?

A

B

C

A

B

C
2. Is the correlation between a person's age and the number of accidents they are involved in positive or negative?

Positive

A

Negative

B

Positive

A

Negative

B
3. Is the correlation between a person's age and the number of accidents they are involved in strong or weak?

Strong

A

Weak

B

Strong

A

Weak

B
4. Which age group's data represent an outlier?

30-year-olds

A

None of them

B

65-year-olds

C

20-year-olds

D

30-year-olds

A

None of them

B

65-year-olds

C

20-year-olds

D

### Outcomes

#### ACMGM052

construct a scatterplot to identify patterns in the data suggesting the presence of an association

#### ACMGM053

describe an association between two numerical variables in terms of direction (positive/negative), form (linear/non-linear) and strength (strong/moderate/weak)

#### ACMGM056

use a scatterplot to identify the nature of the relationship between variables