topic badge

6.02 Scatterplots and correlation

Lesson

Most of the time, we want to use scatterplots of bivariate data to find patterns in data and make inferences about a possible relationship between the two variables. To do this, we look at both the form (or shape) and the strength of the association. We use the words association, relationship or correlation to describe the pattern in the data.

 

Linear and non-linear relationships

The first way to analyse scatterplots is to describe the shape that the bivariate data takes. Sometimes the data clusters around some kind of curve, so the relationship is:

  • linear (a straight line), or
  • non-linear (not a straight line) and the two variables have a non-linear relationship. Non-linear data could have a quadratic (parabolic), exponential or hyperbolic shape.
Linear relationship Quadratic relationship
Exponential relationship Hyperbolic relationship

 

Positive and negative relationships

We can further describe linear relationships by whether they are increasing (positive gradient) or decreasing (negative gradient).

  • A linear relationship where the dependent variable increases as the independent variable increases is called a positive linear relationship or positive linear correlation.
  • A linear relationship where the dependent variable decreases as the independent variable increases is called a negative linear relationship or negative linear correlation.

Positive relationship/correlation

Negative relationship/correlation

What does it mean if the line is close to horizontal (zero gradient)? In terms of data, the dependent variable does not change as the independent variable increases. In other words, the dependent variable doesn't actually depend on the other variable, so we say there is likely no relationship.

Careful!

The words positive and negative only apply when describing linear relationships. For non-linear relationships (like a quadratic relationship) there can be a mix of gradients - positive in one part and negative in the other.

 

Strong and weak relationships

The second way to analyse scatterplots is to describe the strength of the relationship. If the data points cluster very closely around a curve, we say that there is evidence of a strong relationship. If the data points are very spread out but there is still an overall curve we say there is evidence of a weak relationship. If the data points are somewhere in the middle we say that there is evidence of a moderate relationship.

For example, almost all data points of a strong linear relationship will lie on or very close to a straight line. If the data points are arbitrarily spread out, then there is probably no linear relationship at all. This could mean that there is a non-linear relationship, or that the two variables are completely unrelated.

Strong relationship/correlation Moderate relationship/correlation
Weak relationship/correlation No relationship/correlation

 

Careful!

We can never be completely certain that there's a linear relationship between any two variables from a scatterplot. This is why we say that that "there is evidence of a linear relationship" or that "there is probably no relationship". To be brief with our words, we often say "there is a linear relationship" and "there is no relationship", but is important to keep in mind what is meant by this.

 

Practice questions

Question 1

What of the following graphs have a linear correlation?

  1. A

    B

    C

    D

Question 2

Identify the type of correlation in the following scatter plot.

The data points are plotted in a coordinate plane. The scatterplot shows a negative direction with data points closely clustering in a manner that suggests a linear relationship.
  1. Weak positive correlation

    A

    Weak negative correlation

    B

    No correlation

    C

    Strong negative correlation

    D

    Strong positive correlation

    E

Question 3

Identify the type of relationship in the following scatter plot.

  1. Positive linear

    A

    Negative linear

    B

    No relationship

    C

Question 4

The scatter plot shows the relationship between air and sea temperatures.

Data points are plotted in a coordinate plane. The x-axis represents the air temperature and the y-axis represents the sea temperature. The scatterplot has a positive direction with data points closely clustering in a manner that suggests a linear relationship.
  1. The graph shows that:

    Air and sea temperatures are the same.

    A

    As air temperature increases, sea temperature increases.

    B

    As air temperature increases, sea temperature decreases.

    C
  2. How could we describe the correlation between air and sea temperatures?

    Select all which apply.

    Negative

    A

    Strong

    B

    Weak

    C

    Positive

    D

Question 5

Identify the correlation between the temperature and the number of heaters sold.

  1. A positive correlation

    A

    A negative correlation

    B

    No correlation

    C

 

Outliers in bivariate data

 

In bivariate data an outlier is a point that is away from the general trend of the data. The point highlighted in red above is an example of an outlier where we see the point is far above the general trend of the data. You will have to inspect the scatter plot and imagine the dotted line following the trend of the majority of the data to try and identify outliers.

The point highlighted in purple is away from the main cluster of data but is close to the general trend of the data. This point is not an outlier but will be highly influential on the line of best fit and you should ensure it is correct.

Outliers should always be checked carefully as they will be highly influential on the line of best fit and may highlight a special case.

 

Practice questions

Question 6

The following table shows the number of traffic accidents associated with a sample of drivers of different age groups.

Age Accidents
$20$20 $41$41
$25$25 $44$44
$30$30 $39$39
$35$35 $34$34
$40$40 $30$30
$45$45 $25$25
$50$50 $22$22
$55$55 $18$18
$60$60 $19$19
$65$65 $17$17
  1. Which of the following scatter plots correctly represents the above data?

    A

    B

    C
  2. Is the correlation between a person's age and the number of accidents they are involved in positive or negative?

    Positive

    A

    Negative

    B
  3. Is the correlation between a person's age and the number of accidents they are involved in strong or weak?

    Strong

    A

    Weak

    B
  4. Which age group's data represent an outlier?

    30-year-olds

    A

    None of them

    B

    65-year-olds

    C

    20-year-olds

    D

Question 7

The table lists the time taken to sprint $400$400 metres by runners who all run in different temperatures as part of a study.

Temperature ($^\circ$°C) Sprint time (s)
$5$5 $60$60
$2$2 $67$67
$10$10 $48$48
$8$8 $69$69
$1$1 $65$65
$7$7 $49$49
$6$6 $57$57
$4$4 $53$53
$3$3 $59$59
$9$9 $52$52
  1. Which of the following scatter plots correctly represents the data in the table?

    A

    B

    C
  2. How many runners were tested in the study?

  3. Is the correlation between temperature and sprint time positive or negative?

    Positive

    A

    Negative

    B
  4. Is the correlation between temperature and sprint time strong or weak?

    Strong

    A

    Weak

    B
  5. Which combination of temperature and sprint time represents an outlier?

    $8$8$^\circ$°C, $69$69 seconds

    A

    $2$2$^\circ$°C, $67$67 seconds

    B

    $7$7$^\circ$°C, $49$49 seconds

    C

    None of the data points.

    D

Outcomes

ACMEM138

describe the patterns and features of bivariate data

ACMEM139

describe the association between two numerical variables in terms of direction (positive/negative), form (linear/non-linear) and strength (strong/moderate/weak)

What is Mathspace

About Mathspace