Most of the time, we want to use scatterplots of bivariate data to find patterns in data and make inferences about a possible relationship between the two variables. To do this, we look at both the form (or shape) and the strength of the association. We use the words association, relationship or correlation to describe the pattern in the data.
The first way to analyse scatterplots is to describe the shape that the bivariate data takes. Sometimes the data clusters around some kind of curve, so the relationship is:
Linear relationship | Quadratic relationship |
Exponential relationship | Hyperbolic relationship |
We can further describe linear relationships by whether they are increasing (positive gradient) or decreasing (negative gradient).
What does it mean if the line is close to horizontal (zero gradient)? In terms of data, the dependent variable does not change as the independent variable increases. In other words, the dependent variable doesn't actually depend on the other variable, so we say there is likely no relationship.
The words positive and negative only apply when describing linear relationships. For non-linear relationships (like a quadratic relationship) there can be a mix of gradients - positive in one part and negative in the other.
The second way to analyse scatterplots is to describe the strength of the relationship. If the data points cluster very closely around a curve, we say that there is evidence of a strong relationship. If the data points are very spread out but there is still an overall curve we say there is evidence of a weak relationship. If the data points are somewhere in the middle we say that there is evidence of a moderate relationship.
For example, almost all data points of a strong linear relationship will lie on or very close to a straight line. If the data points are arbitrarily spread out, then there is probably no linear relationship at all. This could mean that there is a non-linear relationship, or that the two variables are completely unrelated.
Strong relationship/correlation | Moderate relationship/correlation |
Weak relationship/correlation | No relationship/correlation |
We can never be completely certain that there's a linear relationship between any two variables from a scatterplot. This is why we say that that "there is evidence of a linear relationship" or that "there is probably no relationship". To be brief with our words, we often say "there is a linear relationship" and "there is no relationship", but is important to keep in mind what is meant by this.
Identify the type of correlation in the following scatter plot.
Weak positive correlation
Weak negative correlation
No correlation
Strong negative correlation
Strong positive correlation
Describe the relationship between the variables observed in the scatterplot below.
Strong positive linear relationship
Strong negative linear relationship
No linear relationship
Weak positive linear relationship
Weak negative linear relationship
Describe the relationship between the variables observed in the scatterplot below.
Weak parabolic relationship
Strong parabolic relationship
No relationship
Weak linear relationship
Strong linear relationship
Which of the following scatterplots demonstrates no relationship?
Identify the correlation between the temperature and the number of heaters sold.
A positive correlation
A negative correlation
No correlation
In bivariate data an outlier is a point that is away from the general trend of the data. The point highlighted in red above is an example of an outlier where we see the point is far above the general trend of the data. You will have to inspect the scatter plot and imagine the dotted line following the trend of the majority of the data to try and identify outliers.
The point highlighted in purple is away from the main cluster of data but is close to the general trend of the data. This point is not an outlier but will be highly influential on the line of best fit and you should ensure it is correct.
Outliers should always be checked carefully as they will be highly influential on the line of best fit and may highlight a special case.
The following table shows the number of traffic accidents associated with a sample of drivers of different age groups.
Age | Accidents |
---|---|
$20$20 | $41$41 |
$25$25 | $44$44 |
$30$30 | $39$39 |
$35$35 | $34$34 |
$40$40 | $30$30 |
$45$45 | $25$25 |
$50$50 | $22$22 |
$55$55 | $18$18 |
$60$60 | $19$19 |
$65$65 | $17$17 |
Which of the following scatter plots correctly represents the above data?
Is the correlation between a person's age and the number of accidents they are involved in positive or negative?
Positive
Negative
Is the correlation between a person's age and the number of accidents they are involved in strong or weak?
Strong
Weak
Which age group's data represent an outlier?
30-year-olds
None of them
65-year-olds
20-year-olds
The table lists the time taken to sprint $400$400 metres by runners who all run in different temperatures as part of a study.
Temperature ($^\circ$°C) | Sprint time (s) |
---|---|
$5$5 | $60$60 |
$2$2 | $67$67 |
$10$10 | $48$48 |
$8$8 | $69$69 |
$1$1 | $65$65 |
$7$7 | $49$49 |
$6$6 | $57$57 |
$4$4 | $53$53 |
$3$3 | $59$59 |
$9$9 | $52$52 |
Which of the following scatter plots correctly represents the data in the table?
How many runners were tested in the study?
Is the correlation between temperature and sprint time positive or negative?
Positive
Negative
Is the correlation between temperature and sprint time strong or weak?
Strong
Weak
Which combination of temperature and sprint time represents an outlier?
$8$8$^\circ$°C, $69$69 seconds
$2$2$^\circ$°C, $67$67 seconds
$7$7$^\circ$°C, $49$49 seconds
None of the data points.