topic badge

6.02 Scatterplots and correlation

Lesson

Most of the time, we want to use scatterplots of bivariate data to find patterns in data and make inferences about a possible relationship between the two variables. To do this, we look at both the form (or shape) and the strength of the association. We use the words association, relationship or correlation to describe the pattern in the data.

 

Linear and non-linear relationships

The first way to analyse scatterplots is to describe the shape that the bivariate data takes. Sometimes the data clusters around some kind of curve, so the relationship is:

  • linear (a straight line), or
  • non-linear (not a straight line) and the two variables have a non-linear relationship. Non-linear data could have a quadratic (parabolic), exponential or hyperbolic shape.
Linear relationship Quadratic relationship
Exponential relationship Hyperbolic relationship

 

Positive and negative relationships

We can further describe linear relationships by whether they are increasing (positive gradient) or decreasing (negative gradient).

  • A linear relationship where the dependent variable increases as the independent variable increases is called a positive linear relationship or positive linear correlation.
  • A linear relationship where the dependent variable decreases as the independent variable increases is called a negative linear relationship or negative linear correlation.

Positive relationship/correlation

Negative relationship/correlation

What does it mean if the line is close to horizontal (zero gradient)? In terms of data, the dependent variable does not change as the independent variable increases. In other words, the dependent variable doesn't actually depend on the other variable, so we say there is likely no relationship.

Careful!

The words positive and negative only apply when describing linear relationships. For non-linear relationships (like a quadratic relationship) there can be a mix of gradients - positive in one part and negative in the other.

 

Strong and weak relationships

The second way to analyse scatterplots is to describe the strength of the relationship. If the data points cluster very closely around a curve, we say that there is evidence of a strong relationship. If the data points are very spread out but there is still an overall curve we say there is evidence of a weak relationship. If the data points are somewhere in the middle we say that there is evidence of a moderate relationship.

For example, almost all data points of a strong linear relationship will lie on or very close to a straight line. If the data points are arbitrarily spread out, then there is probably no linear relationship at all. This could mean that there is a non-linear relationship, or that the two variables are completely unrelated.

Strong relationship/correlation Moderate relationship/correlation
Weak relationship/correlation No relationship/correlation

 

Careful!

We can never be completely certain that there's a linear relationship between any two variables from a scatterplot. This is why we say that that "there is evidence of a linear relationship" or that "there is probably no relationship". To be brief with our words, we often say "there is a linear relationship" and "there is no relationship", but is important to keep in mind what is meant by this.

 

Practice questions

Question 1

Identify the type of correlation in the following scatter plot.

The data points are plotted in a coordinate plane. The scatterplot shows a negative direction with data points closely clustering in a manner that suggests a linear relationship.
  1. Weak positive correlation

    A

    Weak negative correlation

    B

    No correlation

    C

    Strong negative correlation

    D

    Strong positive correlation

    E

Question 2

Describe the relationship between the variables observed in the scatterplot below.

Loading Graph...

  1. Strong positive linear relationship

    A

    Strong negative linear relationship

    B

    No linear relationship

    C

    Weak positive linear relationship

    D

    Weak negative linear relationship

    E

Question 3

Describe the relationship between the variables observed in the scatterplot below.

Loading Graph...

  1. Weak parabolic relationship

    A

    Strong parabolic relationship

    B

    No relationship

    C

    Weak linear relationship

    D

    Strong linear relationship

    E

Question 4

Which of the following scatterplots demonstrates no relationship?

  1. Loading Graph...

    A

    Loading Graph...

    B

    Loading Graph...

    C

    Loading Graph...

    D

    Loading Graph...

    E

QUESTION 5

Identify the correlation between the temperature and the number of heaters sold.

  1. A positive correlation

    A

    A negative correlation

    B

    No correlation

    C

 

Outliers in bivariate data

 

In bivariate data an outlier is a point that is away from the general trend of the data. The point highlighted in red above is an example of an outlier where we see the point is far above the general trend of the data. You will have to inspect the scatter plot and imagine the dotted line following the trend of the majority of the data to try and identify outliers.

The point highlighted in purple is away from the main cluster of data but is close to the general trend of the data. This point is not an outlier but will be highly influential on the line of best fit and you should ensure it is correct.

Outliers should always be checked carefully as they will be highly influential on the line of best fit and may highlight a special case.

 

Practice questions

Question 6

The following table shows the number of traffic accidents associated with a sample of drivers of different age groups.

Age Accidents
$20$20 $41$41
$25$25 $44$44
$30$30 $39$39
$35$35 $34$34
$40$40 $30$30
$45$45 $25$25
$50$50 $22$22
$55$55 $18$18
$60$60 $19$19
$65$65 $17$17
  1. Which of the following scatter plots correctly represents the above data?

    A

    B

    C
  2. Is the correlation between a person's age and the number of accidents they are involved in positive or negative?

    Positive

    A

    Negative

    B
  3. Is the correlation between a person's age and the number of accidents they are involved in strong or weak?

    Strong

    A

    Weak

    B
  4. Which age group's data represent an outlier?

    30-year-olds

    A

    None of them

    B

    65-year-olds

    C

    20-year-olds

    D

Question 7

The table lists the time taken to sprint $400$400 metres by runners who all run in different temperatures as part of a study.

Temperature ($^\circ$°C) Sprint time (s)
$5$5 $60$60
$2$2 $67$67
$10$10 $48$48
$8$8 $69$69
$1$1 $65$65
$7$7 $49$49
$6$6 $57$57
$4$4 $53$53
$3$3 $59$59
$9$9 $52$52
  1. Which of the following scatter plots correctly represents the data in the table?

    A

    B

    C
  2. How many runners were tested in the study?

  3. Is the correlation between temperature and sprint time positive or negative?

    Positive

    A

    Negative

    B
  4. Is the correlation between temperature and sprint time strong or weak?

    Strong

    A

    Weak

    B
  5. Which combination of temperature and sprint time represents an outlier?

    $8$8$^\circ$°C, $69$69 seconds

    A

    $2$2$^\circ$°C, $67$67 seconds

    B

    $7$7$^\circ$°C, $49$49 seconds

    C

    None of the data points.

    D

Outcomes

3.4.12

describe the patterns and features of bivariate data

3.4.13

describe the association between two numerical variables in terms of direction (positive/negative), form (linear/non-linear) and strength(strong/moderate/weak)

3.4.16

interpret relationships in terms of the variables, for example, describe trend as increasing or decreasing

3.4.19

distinguish between causality and association through examples

What is Mathspace

About Mathspace