topic badge

6.07 Scatter plots and lines of fit

Lesson

Creating scatter plots

When trying to determine if there is a relationship between two variables, we will collect data for various values of the independent variable and the corresponding values for the dependent variable.

The first step in determining the presence and type of relationship is to plot the data on a scatter plot. The independent variable will go on the horizontal $x$x axis and the dependent variable will go on the vertical $y$y axis.

Once we have a scatter plot, we can start to perform analysis such as determining correlation and a line of best fit.

Correlations

A correlation is a way of expressing a relationship between two variables and, more specifically, how strongly pairs of data are related. We describe the correlation from data using language like positive correlationnegative correlation or no correlation.  We might even say that two variables have strong or weak correlation.

Watch out!

Just because two variables correlate, even to a high degree, it does not imply that one causes the other. For example, there is a high degree of correlation between height and stride length. However, it doesn't mean that if you take big steps you'll grow taller!

 

Linear patterns and scatter plots

A pattern in a scatter plot can reveal whether or not two measurements are connected to each other. In other words, the presence of a pattern signals that the two sets of data correlate. We will focus on linear patterns in this chapter.

This linear relationship can be seen through close and consistent grouping in a scatter plot. The more closely the dots resemble a straight line, the stronger the correlation between the variables.

Positive Correlations

A positive correlation is when the data appears to gather in a positive relationship.  Similar to a straight line with a positive slope.  

In other words, as one variable increases, the other variables also increases.

There are varying degrees of positive correlation:

  • Perfect positive correlation, where it lines up on a straight line exactly. 
  • Strong positive correlation, where it closely resembles a straight line with a positive slope. 
  • Weak positive correlation, where the relationship is still positive. 

For example, the scatter plot below shows a strong positive correlation between a person's height and arm span. You can see that as the first variable increases, the second increases too. 

Linear Scatter

http://www.learner.org/courses/learningmath/data/session7/part_c/using.html

Here is a table of positive style correlations

 

Negative Correlations

A negative correlation is when the data appears to gather in a negative relationship.  Similar to a straight line with a negative slope.  

In other words, as one variable increases, the other one decreases.

Like positive correlation, there are varying degrees of negative correlation:

  • Perfect negative correlation, where it lines up on a decreasing line perfectly
  • Strong negative correlation, where the data presents strongly in a negative direction
  • Weak negative correlation.

The next scatter plot shows a strong negative correlation. You can see that as the first variable increases, the second variable decreases.

 

Here is a table of negative style correlations

 

No Correlation

No correlation is when there is no relationship between the variables.

This means that there is a random relationship between the two sets of data.

Here is a diagram of no correlation

Practice questions

Question 1

Identify the type of correlation in the following scatter plot.

The data points are plotted in a coordinate plane. The scatterplot shows a negative direction with data points closely clustering in a manner that suggests a linear relationship.
  1. Weak positive correlation

    A

    Weak negative correlation

    B

    No correlation

    C

    Strong negative correlation

    D

    Strong positive correlation

    E

Question 2

Consider the two variables: time spent studying and exam performance.

  1. Is there likely to be a relationship between the two?

    Yes

    A

    No

    B
  2. Do you think the correlation is positive or negative?

    Positive

    A

    Negative

    B

Question 3

The following table shows the number of traffic accidents associated with a sample of drivers of different age groups.

Age Accidents
$20$20 $41$41
$25$25 $44$44
$30$30 $39$39
$35$35 $34$34
$40$40 $30$30
$45$45 $25$25
$50$50 $22$22
$55$55 $18$18
$60$60 $19$19
$65$65 $17$17
  1. Which of the following scatter plots correctly represents the above data?

    A

    B

    C
  2. Is the correlation between a person's age and the number of accidents they are involved in positive or negative?

    Positive

    A

    Negative

    B
  3. Is the correlation between a person's age and the number of accidents they are involved in strong or weak?

    Strong

    A

    Weak

    B
  4. Which age group's data represent an outlier?

    30-year-olds

    A

    None of them

    B

    65-year-olds

    C

    20-year-olds

    D

 

Lines of best fit

line of best fit (or trend line) is a straight line that best represents the data on a scatter plot. Depending on the strength of the correlation, this line may pass exactly through all of the points, some of the points, or none of the points. However, it always represents the general trend of the of the data.

Lines of best fit are really handy as we can use them to help us make predictions or conclusions about the data. 

To draw a line of best fit by eye, balance the number of points above the line with the number of points below the line. You should generally ignore outliers as they can skew the line of best fit. Later we will look at how we can calculate a line of best fit's equation.

Below is an example of what a good line of best fit might look like.

Making predictions

If the points appear to lie close to a line, we conclude that a relationship probably exists and it is safe to make predictions using a line of best fit.  Making predictions inside the range of the data is called interpolation.

In a well-designed experiment, a researcher is careful not to use the fitted line to make predictions about the response that would be observed to values of the independent variable that are outside the range of the values used in the experiment. For example, if in the experiment the smallest value of the independent variable was 10 and the largest 85, then it would be unwise to try to predict what the response would be when the independent variable was smaller than 10 or larger than 85.

To make such predictions beyond the range of the data is called extrapolation and is considered unsafe.

Example

The data points illustrated in the graph below show the sale price of an item of goods measured against the age of the item. Given the value of either the independent or dependent variable, we can determine the corresponding value using the line of best fit as seen below.

The regression line can be used to predict that at 23 months the value of the goods will be approximately $\$1250$$1250. Considering the amounts by which the data points are above and below the regression line, it could easily happen that the estimated value of the goods at 23 months is $\$100$$100 too low or too high.

Practice questions

Question 4

The following scatter plot shows the data for two variables, $x$x and $y$y.

A scatter plot with an x-axis labeled from 0 to 10 and a y-axis labeled from 0 to 10. Both axes are in increments of 1. Gray gridlines divide the plane into square units. Nine points are plotted on the grid:$\left(1,2\right)$(1,2), $\left(2,1\right)$(2,1), $\left(3,3\right)$(3,3), $\left(4,5\right)$(4,5), $\left(5,6\right)$(5,6), $\left(6,5\right)$(6,5), $\left(7,7\right)$(7,7), and $\left(8,7\right)$(8,7). The coordinates are not explicitly labeled.
  1. Determine which of the following graphs contains the line of best fit.

    A scatter plot with an x-axis labeled from 0 to 10 and a y-axis labeled from 0 to 10. Both axes are in increments of 1. Gray gridlines divide the plane into square units. Nine points are plotted on the grid: $\left(1,2\right)$(1,2), $\left(2,1\right)$(2,1), $\left(3,3\right)$(3,3), $\left(4,5\right)$(4,5), $\left(5,6\right)$(5,6), $\left(6,5\right)$(6,5), $\left(7,7\right)$(7,7), and $\left(8,7\right)$(8,7). A green line passes through the graph at an upward diagonal slope that follows the trend of the points, starting near the origin and extending near the top-right corner. The points are near the green line (some are above and some are below the line). The coordinates are not explicitly labeled.
    A
    A scatter plot with an x-axis labeled from 0 to 10 and a y-axis labeled from 0 to 10. Both axes are in increments of 1. Gray gridlines divide the plane into square units. Nine points are plotted on the grid: $\left(1,2\right)$(1,2), $\left(2,1\right)$(2,1), $\left(3,3\right)$(3,3), $\left(4,5\right)$(4,5), $\left(5,6\right)$(5,6), $\left(6,5\right)$(6,5), $\left(7,7\right)$(7,7), and $\left(8,7\right)$(8,7). A green line passes through the graph at an upward diagonal slope, starting near the origin and extending near the top-right corner. Most of the points are below the green line. All coordinates are not explicitly labeled.
    B
    A scatter plot with an x-axis labeled from 0 to 10 and a y-axis labeled from 0 to 10. Both axes are in increments of 1. Gray gridlines divide the plane into square units. Nine points are plotted on the grid: $\left(1,2\right)$(1,2), $\left(2,1\right)$(2,1), $\left(3,3\right)$(3,3), $\left(4,5\right)$(4,5), $\left(5,6\right)$(5,6), $\left(6,5\right)$(6,5), $\left(7,7\right)$(7,7), and $\left(8,7\right)$(8,7). A green line passes through the graph at an upward diagonal slope, starting near the origin and extending near the top-right corner. Most of the points are above the green line. All coordinates are not explicitly labeled.
    C
    A scatter plot with an x-axis labeled from 0 to 10 and a y-axis labeled from 0 to 10. Both axes are in increments of 1. Gray gridlines divide the plane into square units. Nine points are plotted on the grid: $\left(1,2\right)$(1,2), $\left(2,1\right)$(2,1), $\left(3,3\right)$(3,3), $\left(4,5\right)$(4,5), $\left(5,6\right)$(5,6), $\left(6,5\right)$(6,5), $\left(7,7\right)$(7,7), and $\left(8,7\right)$(8,7). A green line passes through the graph at an upward diagonal slope but does not follow the trend of the points, starting near the origin and extending near the top-right corner. Points are plotted above and below the green line. All coordinates are not explicitly labeled.
    D
  2. Use the line of best fit to estimate the value of $y$y when $x=4.5$x=4.5.

    A scatter plot with an x-axis labeled from 0 to 10 and a y-axis labeled from 0 to 10. Both axes are in increments of 1. Gray gridlines divide the plane into square units. Nine points are plotted on the grid: $\left(1,2\right)$(1,2), $\left(2,1\right)$(2,1), $\left(3,3\right)$(3,3), $\left(4,5\right)$(4,5), $\left(5,6\right)$(5,6), $\left(6,5\right)$(6,5), $\left(7,7\right)$(7,7), and $\left(8,7\right)$(8,7). The best fit line, which is green in color, passes through the graph at an upward diagonal slope, starting near the origin and extending near the top-right corner. The points are near the best fit line but their coordinates are not explicitly labeled.

    $4.5$4.5

    A

    $5$5

    B

    $5.5$5.5

    C

    $6$6

    D
  3. Use the line of best fit to estimate the value of $y$y when $x=9$x=9.

    $6.5$6.5

    A

    $7$7

    B

    $8.4$8.4

    C

    $9.5$9.5

    D

Question 5

The number of fish in a river is measured over a five year period.

The results are shown in the following table and plotted below.

Time in years ($t$t) $0$0 $1$1 $2$2 $3$3 $4$4 $5$5

Number of fish ($F$F)

$1903$1903 $1998$1998 $1900$1900 $1517$1517 $1693$1693 $1408$1408

Loading Graph...

  1. Which line best approximates the data?

    Loading Graph...

    A

    Loading Graph...

    B

    Loading Graph...

    C

    Loading Graph...

    D
  2. Use this line to predict the number of years until there are no fish left in the river.

  3. Now predict the number of fish remaining in the river after $7$7 years.

  4. Predict how long it will be before there are $900$900 fish left in the river.

Outcomes

I.S.ID.6

Represent data on two quantitative variables on a scatter plot, and describe how the variables are related.

I.S.ID.6.a

Fit a function to the data; use functions fitted to data to solve problems in the context of the data. Use given functions or choose a function suggested by the context. Emphasize linear, quadratic, and exponential models.

I.S.ID.6.c

Fit a linear function for a scatter plot that suggests a linear association.

I.S.ID.7

Interpret the slope (rate of change) and the intercept (constant term) of a linear model in the context of the data.

What is Mathspace

About Mathspace