To help us identify any correlation between the two variables, there are three things we focus on when analysing a scatterplot:
If it is linear:
When we are looking at the form of a scatterplot we are looking to see if the data points show a pattern that has a linear form. If the data points lie on or close to a straight line, we can say the scatterplot has a linear form.
Forms other than a line may be apparent in a scatterplot. If the data points lie on or close to a curve, it may be appropriate to infer a non-linear form between the variables. We will only be using linear models in this course.
The direction of the scatterplot refers to the pattern shown by the data points. We can describe the direction of the pattern as having positive correlation, negative correlation or no correlation:
The strength of a linear correlation relates to how closely the points reassemble a straight line.
Most scatterplots will fall somewhere in between these two extremes and will display a weak, moderate or strong correlation.
Identify the type of correlation in the following scatter plot.
Think: This dataset looks to be linear so we can examine its direction and strength.
Do: The line fits quite closely to all of the points, so it is a strong correlation. A line drawn through these points would have a positive gradient. In summary, we would say that this scatterplot indicates a strong, positive correlation.
There are three things we focus on when analysing a scatterplot:
If linear we can ask:
If there is no connection between the two variables we say there is no correlation.
If our data does appear linear, to make it easier to analyse and to have numeric answers to the strength and direction of the correlation, we need a line to compare the data against.
A line of best fit (or "trend" line) is a straight line that best represents the data on a scatter plot. This line may pass through some of the points, none of the points, or all of the points. However, it always represents the general trend of the points, which then determines whether there is a positive, negative or no linear relationship between the two variables.
Lines of best fit are really handy as they help determine whether there is a relationship between two variables, which can then be used to make predictions.
To draw a line of best fit, we want to minimise the vertical distances from the points to the line. This will roughly create a line that passes through the centre of the points.
Given a set of data relating two variables $x$x and $y$y, it may be possible to form a linear model. This model can then be used to understand the relationship between the variables and make predictions about other possible ordered pairs that fit this relationship.
Say we gathered several measurements on the height of a plant $h$h over an $8$8 week period, where $t$t is time measured in weeks. We can then plot the data on the $xy$xy-plane as shown below.
|Height of a plant measured at several instances.|
We can fit a model through the observed data to make predictions about the height at certain times after planting.
|Linear graph modelling height of a plant over time.|
Notice how the model has the line minimise its total distance from all the points, and follows the trend of the data, however it is impossible for the line to go through all points. This dataset has positive correlation as the line has a positive gradient, and is a strong correlation as the points lie quite close to the line.
Given the equation of the line of best fit, we are able to make important observations about the original data. When we are analysing data, it is important that we consider the context. Below is the equation of the line of best fit for the graph above:
In this example, $y$y represents the height of the plant (the response variable) and $x$x represents the time passed in weeks (the explanatory variable).
The value $1.2$1.2, displayed in front of the $x$x, is the gradient of the line. Since this is a positive number, it indicates that there is a positive relationship between the variables. In the context of this example, this tells us that for each week of time, the plant's height increases by approximately $1.2$1.2 cm.
The value $2$2, is the vertical intercept of the line of best fit and, in the context of this example, tells us that initially the plant will be $2$2 cm high.
A line of best fit in the form $y=mx+c$y=mx+c
The $m$m value shows the gradient, this shows whether the correlation is positive or negative:
The $c$c value shows the vertical intercept (also known as the $y-$y−intercept):
When we interpret the vertical intercept, we need to consider if it makes sense for the explanatory variable to be zero and the response variable to have a value indicated by $c$c.
Consider the table of values that show four excerpts from a database comparing the income per capita of a country and the child mortality rate of the country. If a scatter plot was created from the entire database, what relationship would you expect it to have?
|Income per capita||Child Mortality rate|
The following scatter plot shows the data for two variables, $x$x and $y$y.
Determine which of the following graphs contains the line of best fit.
The scatter plot shows the relationship between sea temperatures and the amount of healthy coral.
Describe the correlation between sea temperature the amount of healthy coral.
Select all that apply.
Which variable is the dependent variable?
Which variable is the independent variable?
Use digital technology to investigate bivariate numerical data sets. Where appropriate use a straight line to describe the relationship allowing for variation, make predictions based on this straight line and discuss limitations