Victorian Curriculum Year 10A - 2020 Edition 11.08 Fitting lines to bivariate data
Lesson

## Describing correlation

To help us identify any correlation between the two variables, there are three things we focus on when analysing a scatterplot:

• Form: linear or non-linear, what shape the data has

If it is linear:

• Direction: positive or negative, whether a line drawn through the data have a positive or negative gradient
• Strength: strong, moderate, weak - how tightly the points model a line

### Form

When we are looking at the form of a scatterplot we are looking to see if the data points show a pattern that has a linear form. If the data points lie on or close to a straight line, we can say the scatterplot has a linear form

Forms other than a line may be apparent in a scatterplot. If the data points lie on or close to a curve, it may be appropriate to infer a non-linear form between the variables. We will only be using linear models in this course.

### Direction

The direction of the scatterplot refers to the pattern shown by the data points. We can describe the direction of the pattern as having positive correlationnegative correlation or no correlation:

• Positive correlation
• From a graphical perspective, this occurs when the $y$y-coordinate increases as the $x$x-coordinate increases, which is similar to a line with a positive gradient.
• Negative correlation
• From a graphical perspective, this occurs when the $y$y-coordinate decreases as the $x$x-coordinate increases, which is similar to a line with a negative gradient.
• No correlation
• No correlation describes a data set that has no relationship between the variables.

### Strength

The strength of a linear correlation relates to how closely the points reassemble a straight line.

• If the points lie exactly on a straight line then we can say that there is a perfect correlation.
• If the points are scattered randomly then we can say there is no correlation.

Most scatterplots will fall somewhere in between these two extremes and will display a weak, moderate or strong correlation.        #### Worked example

Identify the type of correlation in the following scatter plot. Think: This dataset looks to be linear so we can examine its direction and strength.

Do: The line fits quite closely to all of the points, so it is a strong correlation. A line drawn through these points would have a positive gradient. In summary, we would say that this scatterplot indicates a strong, positive correlation.

Remember!

There are three things we focus on when analysing a scatterplot:

• Form: linear or non-linear - whether the shape of the data is linear (a straight line)

• Direction: positive or negative - whether a line drawn through the data have a positive or negative gradient.
• Strength: strong, moderate, weak - how closely is the data grouped into a straight line

If there is no connection between the two variables we say there is no correlation.

## Line of best fit

If our data does appear linear, to make it easier to analyse and to have numeric answers to the strength and direction of the correlation, we need a line to compare the data against.

A line of best fit (or "trend" line) is a straight line that best represents the data on a scatter plot. This line may pass through some of the points, none of the points, or all of the points. However, it always represents the general trend of the points, which then determines whether there is a positive, negative or no linear relationship between the two variables.

Lines of best fit are really handy as they help determine whether there is a relationship between two variables, which can then be used to make predictions.

To draw a line of best fit, we want to minimise the vertical distances from the points to the line. This will roughly create a line that passes through the centre of the points.

### Using a linear model to make predictions

Given a set of data relating two variables $x$x and $y$y, it may be possible to form a linear model. This model can then be used to understand the relationship between the variables and make predictions about other possible ordered pairs that fit this relationship.

#### Exploration

Say we gathered several measurements on the height of a plant $h$h over an $8$8 week period, where $t$t is time measured in weeks. We can then plot the data on the $xy$xy-plane as shown below. Height of a plant measured at several instances.

We can fit a model through the observed data to make predictions about the height at certain times after planting. Linear graph modelling height of a plant over time.

Notice how the model has the line minimise its total distance from all the points, and follows the trend of the data, however it is impossible for the line to go through all points. This dataset has positive correlation as the line has a positive gradient, and is a strong correlation as the points lie quite close to the line.

## Interpreting the line of best fit

Given the equation of the line of best fit, we are able to make important observations about the original data.  When we are analysing data, it is important that we consider the context. Below is the equation of the line of best fit for the graph above:

$y=1.2x+2$y=1.2x+2

In this example, $y$y represents the height of the plant (the response variable) and $x$x represents the time passed in weeks (the explanatory variable).

The value $1.2$1.2, displayed in front of the $x$x,  is the gradient of the line.  Since this is a positive number, it indicates that there is a positive relationship between the variables.  In the context of this example, this tells us that for each week of time, the plant's height increases by approximately $1.2$1.2 cm.

The value $2$2, is the vertical intercept of the line of best fit and, in the context of this example, tells us that initially the plant will be $2$2 cm high.

Interpreting the line of best fit

A line of best fit in the form $y=mx+c$y=mx+c

The $m$m value shows the gradient, this shows whether the correlation is positive or negative:

• if the gradient is positive, when the explanatory increases by $1$1 unit, the response variable increases by $m$m units. The correlation is positive too.
• if the gradient is negative, when the explanatory increases by $1$1 unit, the response variable decreases by $m$m units. The correlation is negative too.

The $c$c value shows the vertical intercept (also known as the $y-$yintercept):

• when the explanatory variable is $0$0, the value of the response variable is $c$c.

When we interpret the vertical intercept, we need to consider if it makes sense for the explanatory variable to be zero and the response variable to have a value indicated by $c$c.

#### Practice questions

##### Question 1

Consider the table of values that show four excerpts from a database comparing the income per capita of a country and the child mortality rate of the country. If a scatter plot was created from the entire database, what relationship would you expect it to have?

Income per capita Child Mortality rate
$3041$3041 $80$80
$10841$10841 $20$20
$12997$12997 $33$33
$32262$32262 $8$8
1. No relationship

A

Strongly positive

B

Strongly negative

C

No relationship

A

Strongly positive

B

Strongly negative

C

##### Question 2

The following scatter plot shows the data for two variables, $x$x and $y$y. 1. Determine which of the following graphs contains the line of best fit. A B C D A B C D

##### Question 3

The scatter plot shows the relationship between sea temperatures and the amount of healthy coral. 1. Describe the correlation between sea temperature the amount of healthy coral.

Select all that apply.

Positive

A

Negative

B

Positive

A

Negative

B
2. Which variable is the dependent variable?

Level of healthy coral

A

Sea temperature

B

Level of healthy coral

A

Sea temperature

B
3. Which variable is the independent variable?

Level of healthy coral

A

Sea temperature

B

Level of healthy coral

A

Sea temperature

B

### Outcomes

#### VCMSP373 (10a)

Use digital technology to investigate bivariate numerical data sets. Where appropriate use a straight line to describe the relationship allowing for variation, make predictions based on this straight line and discuss limitations