topic badge
AustraliaNSW
Stage 5.1-2

9.06 Fitting lines to bivariate data

Lesson

Describe correlation

To help us identify any correlation between the two variables, there are three things we focus on when analysing a scatterplot:

  • Form: linear or non-linear, what shape the data has

If it is linear:

  • Direction: positive or negative, whether a line drawn through the data has a positive or negative gradient

  • Strength: strong, moderate, weak - how tightly the points model a line

When we are looking at the form of a scatterplot we are looking to see if the data points show a pattern that has a linear form. If the data points lie on or close to a straight line, we can say the scatterplot has a linear form.

Forms other than a line may be apparent in a scatterplot. If the data points lie on or close to a curve, it may be appropriate to infer a non-linear form between the variables. We will only be using linear models in this course.

The direction of the scatterplot refers to the pattern shown by the data points. We can describe the direction of the pattern as having positive correlation, negative correlation, or no correlation:

  • Positive correlation

    • From a graphical perspective, this occurs when the y-coordinate increases as the x-coordinate increases, which is similar to a line with a positive gradient.

  • Negative correlation

    • From a graphical perspective, this occurs when the y-coordinate decreases as the x-coordinate increases, which is similar to a line with a negative gradient.

  • No correlation

    • No correlation describes a data set that has no relationship between the variables.

The strength of a linear correlation relates to how closely the points reassemble a straight line.

  • If the points lie exactly on a straight line then we can say that there is a perfect correlation.

  • If the points are scattered randomly then we can say there is no correlation.

Most scatterplots will fall somewhere in between these two extremes and will display a weak, moderate or strong correlation.

A perfect positive correlation graph where the data points line up on a straight line with a positive gradient.
A perfect negative correlation graph where the data points line up on a straight line with a negative gradient.
A strong positive correlation graph where the points are close to a straight line with a positive gradient.
A strong negative correlation graph where points are close to a straight line with a negative gradient.
A weak positive correlation graph where the relationship is still positive but the points do not lie on a line
A weak negative correlation graph where the relationship is still negative but the points do not lie on a line
A no correlation graph where data points are randomly scattered in the graph.
A no correlation graph where data points are closely clustered and resemble a horizontal line.

Examples

Example 1

Consider the table of values that show four excerpts from a database comparing the income per capita of a country and the child mortality rate of the country. If a scatter plot was created from the entire database, what relationship would you expect it to have?

Income per capitaChild Mortality rate
146567
11\,42816
262135
32\,4689
A
Strongly positive
B
No relationship
C
Strongly negative
Worked Solution
Create a strategy

Consider how the mortality rate changes as the income increases.

Apply the idea

As the income per capita increases, the child mortality rate decreases, so there is a negative correlation between the two sets of data. The correct answer is C.

Example 2

The scatter plot shows the relationship between sea temperature and the amount of healthy coral.

A scatter plot showing a negative correlation between sea temperature on the x axis and coral on the y axis.
a

Describe the correlation between sea temperature the amount of healthy coral.

Worked Solution
Create a strategy

Describe what happens to the coral (dependent variable) as the sea temperature (independent variable) increases.

Apply the idea

The sea temperature increases from left to right. We can see from the graph that the coral decreases (falls) from left to right.

So as the sea temperature of bananas increases, the coral decreases. So there is a negative linear relationship between the variables.

b

Which variable is the dependent variable?

A
Level of healthy coral
B
Sea temperature
Worked Solution
Create a strategy

The dependent variable is placed on the vertical axis and is affected by the independent variable.

Apply the idea

The level of healthy coral is determined by sea temperature and is on the vertical axis, making it the dependent variable. So, the correct answer is A.

c

Which variable is the independent variable?

A
Level of healthy coral
B
Sea temperature
Worked Solution
Create a strategy

An independent variable is a variable that stands alone and is not changed by the other variables you are measuring.

Apply the idea

From the previous problem, we know that the level of healthy coral is the dependent variable, so this means the sea temperature is the independent variable. The correct answer is B.

Idea summary

There are three things we focus on when analysing a scatterplot:

  • Form: linear or non-linear, what shape the data has

If it is linear:

  • Direction: positive or negative, whether a line drawn through the data have a positive or negative gradient

  • Strength: strong, moderate, weak - how tightly the points model a line

If there is no connection between the two variables we say there is no correlation.

Line of best fit

If our data does appear linear, to make it easier to analyse and to have numeric answers to the strength and direction of the correlation, we need a line to compare the data against.

A line of best fit (or "trend" line) is a straight line that best represents the data on a scatter plot. This line may pass through some of the points, none of the points, or all of the points. However, it always represents the general trend of the points, which then determines whether there is a positive, negative or no linear relationship between the two variables.

Lines of best fit are really handy as they help determine whether there is a relationship between two variables, which can then be used to make predictions.

To draw a line of best fit, we want to minimise the vertical distances from the points to the line. This will roughly create a line that passes through the centre of the points.

Given a set of data relating two variables x and y, it may be possible to form a linear model. This model can then be used to understand the relationship between the variables and make predictions about other possible ordered pairs that fit this relationship.

2
4
6
8
10
12
14
\text{Weekly growth}
2
4
6
8
10
12
14
\text{Height (cm)}

Say we gathered several measurements on the height of a plant h over an 8 week period, where t is time measured in weeks. We can then plot the data on the xy-plane as shown.

2
4
6
8
10
12
14
\text{Weekly growth}
2
4
6
8
10
12
14
\text{Height (cm)}

We can fit a model through the observed data to make predictions about the height at certain times after planting.

The graph shows a linear function modelling height of a plant over time.

Notice how the model has the line minimise its total distance from all the points, and follows the trend of the data, however it is impossible for the line to go through all points. This dataset has positive correlation as the line has a positive gradient, and is a strong correlation as the points lie quite close to the line.

Examples

Example 3

The following scatter plot shows the data for two variables, x and y.

1
2
3
4
5
6
7
8
9
10
x
1
2
3
4
5
6
7
8
9
10
y

Draw a line of best fit for the data.

Worked Solution
Create a strategy

Draw a line that follows the trend of the points and have the same number of points above and below the line.

Apply the idea
1
2
3
4
5
6
7
8
9
10
x
1
2
3
4
5
6
7
8
9
10
y

Here is an example of a line of best fit that follows the trend of the data and has the same number of points above and below the line.

Idea summary

To draw a line of best fit, we want to minimise the vertical distances from the points to the line. This will roughly create a line that passes through the centre of the points. The points above and below the line should be equal.

Interpret line of best fit

Given the equation of the line of best fit, we are able to make important observations about the original data. When we are analysing data, it is important that we consider the context.

When we interpret the vertical intercept, we need to consider if it makes sense for the independent variable to be zero and the dependent variable to have a value of c.

Examples

Example 4

Here is the line of best fit from the first example. The equation of the line is y=1.2x+2 where y represents the height of the plant and x represents the time passed in weeks.

2
4
6
8
10
12
14
\text{Weekly growth}
2
4
6
8
10
12
14
\text{Height (cm)}

Interpret the gradient and y-intercept of the line for this situation.

Worked Solution
Create a strategy

Read the gradient and y-intercept from the equation of the line of the form y=mx+b.

Apply the idea

The value 1.2, is the gradient of the line. Since this is a positive number, it indicates that there is a positive relationship between the variables. This tells us that for each week, the plant's height increased by 1.2 cm.

The value 2, is the vertical intercept of the line and tells us that initially the plant will be 2 cm high.

Idea summary

A line of best fit has an equation of the form: y=mx+c

The m value shows the gradient, this shows whether the correlation is positive or negative:

  • If the gradient is positive, when the independent increases by 1 unit, the dependent variable increases by m units. The correlation is positive too.

  • If the gradient is negative, when the independent increases by 1 unit, the dependent variable decreases by m units. The correlation is negative too.

The c value shows the vertical intercept (also known as the y-intercept):

  • When the independent variable is 0, the value of the dependent variable is c.

Outcomes

MA5.2-16SP

investigates relationships between two statistical variables, including their relationship over time

What is Mathspace

About Mathspace