To help us identify any correlation between the two variables, there are three things we focus on when analysing a scatterplot:
Form: linear or non-linear, what shape the data has
If it is linear:
Direction: positive or negative, whether a line drawn through the data has a positive or negative gradient
Strength: strong, moderate, weak - how tightly the points model a line
When we are looking at the form of a scatterplot we are looking to see if the data points show a pattern that has a linear form. If the data points lie on or close to a straight line, we can say the scatterplot has a linear form.
Forms other than a line may be apparent in a scatterplot. If the data points lie on or close to a curve, it may be appropriate to infer a non-linear form between the variables. We will only be using linear models in this course.
The direction of the scatterplot refers to the pattern shown by the data points. We can describe the direction of the pattern as having positive correlation, negative correlation, or no correlation:
Positive correlation
From a graphical perspective, this occurs when the y-coordinate increases as the x-coordinate increases, which is similar to a line with a positive gradient.
Negative correlation
From a graphical perspective, this occurs when the y-coordinate decreases as the x-coordinate increases, which is similar to a line with a negative gradient.
No correlation
No correlation describes a data set that has no relationship between the variables.
The strength of a linear correlation relates to how closely the points reassemble a straight line.
If the points lie exactly on a straight line then we can say that there is a perfect correlation.
If the points are scattered randomly then we can say there is no correlation.
Most scatterplots will fall somewhere in between these two extremes and will display a weak, moderate or strong correlation.
Consider the table of values that show four excerpts from a database comparing the income per capita of a country and the child mortality rate of the country. If a scatter plot was created from the entire database, what relationship would you expect it to have?
Income per capita | Child Mortality rate |
---|---|
1465 | 67 |
11\,428 | 16 |
2621 | 35 |
32\,468 | 9 |
The scatter plot shows the relationship between sea temperature and the amount of healthy coral.
Describe the correlation between sea temperature the amount of healthy coral.
Which variable is the dependent variable?
Which variable is the independent variable?
There are three things we focus on when analysing a scatterplot:
Form: linear or non-linear, what shape the data has
If it is linear:
Direction: positive or negative, whether a line drawn through the data have a positive or negative gradient
Strength: strong, moderate, weak - how tightly the points model a line
If there is no connection between the two variables we say there is no correlation.
If our data does appear linear, to make it easier to analyse and to have numeric answers to the strength and direction of the correlation, we need a line to compare the data against.
A line of best fit (or "trend" line) is a straight line that best represents the data on a scatter plot. This line may pass through some of the points, none of the points, or all of the points. However, it always represents the general trend of the points, which then determines whether there is a positive, negative or no linear relationship between the two variables.
Lines of best fit are really handy as they help determine whether there is a relationship between two variables, which can then be used to make predictions.
To draw a line of best fit, we want to minimise the vertical distances from the points to the line. This will roughly create a line that passes through the centre of the points.
Given a set of data relating two variables x and y, it may be possible to form a linear model. This model can then be used to understand the relationship between the variables and make predictions about other possible ordered pairs that fit this relationship.
The following scatter plot shows the data for two variables, x and y.
Draw a line of best fit for the data.
To draw a line of best fit, we want to minimise the vertical distances from the points to the line. This will roughly create a line that passes through the centre of the points. The points above and below the line should be equal.
Given the equation of the line of best fit, we are able to make important observations about the original data. When we are analysing data, it is important that we consider the context.
When we interpret the vertical intercept, we need to consider if it makes sense for the independent variable to be zero and the dependent variable to have a value of c.
Here is the line of best fit from the first example. The equation of the line is y=1.2x+2 where y represents the height of the plant and x represents the time passed in weeks.
Interpret the gradient and y-intercept of the line for this situation.
A line of best fit has an equation of the form: y=mx+c
The m value shows the gradient, this shows whether the correlation is positive or negative:
If the gradient is positive, when the independent increases by 1 unit, the dependent variable increases by m units. The correlation is positive too.
If the gradient is negative, when the independent increases by 1 unit, the dependent variable decreases by m units. The correlation is negative too.
The c value shows the vertical intercept (also known as the y-intercept):
When the independent variable is 0, the value of the dependent variable is c.