topic badge

7.02 Scatterplots and lines of fit

Scatterplots and lines of fit

When looking at bivariate data, a scatterplot can be used to display the relationship between the two variables.

If the data has a linear trend, a line of best fit can be used to model the relationship of the data. We can use technology to find the line of best fit for a scatterplot, then use the line to to help us make predictions or conclusions about the data.

There are mathematical calculations we can use to measure the strength of the linear correlation between two variables.

Exploration

Move each point in the applet to see how the correlation coefficient changes.

Loading interactive...
  1. Arrange these points so the value of r is as large as possible. What do you notice?

  2. Arrange these points so the value of r is as close to zero as possible. What do you notice?

  3. Move the points so they are in a straight line, then move one point so it is an outlier. What happens to the correlation coefficient value?

The correlation coefficient, r, is a statistic that describes both the strength and direction of a linear correlation.

A  correlation coefficient model. Ask your teacher for more information.
10
20
30
40
50
60
70
80
90
x
1
2
3
4
5
6
7
8
9
y
Perfect positive correlation, r=1
5
10
15
20
25
30
35
40
45
x
5
10
15
20
25
30
35
40
45
y
Perfect negative correlation, r=-1
10
20
30
40
50
60
70
80
90
x
1
2
3
4
y
Strong negative correlation, r=-0.974
0.2
0.3
x
254
255
256
257
258
259
260
y
Weak positive correlation, r=0.306
1
2
3
4
5
6
7
8
9
x
1
2
3
4
5
6
7
8
9
y
Moderate negative correlation, r=-0.684
1
2
3
4
5
6
7
8
9
x
1
2
3
4
5
6
7
8
9
y
No correlation, r=0.072

It is important to be able to distinguish between causal relationships (when changes in one variable cause changes in the other variable) and correlation where the two variables are related, but one variable does not necessarily influence the other.

Correlation

A relationship between two variables

Causation

A relationship between two events where one event causes the other

Even when two variables have a strong relationship and r is close to -1 or 1, we cannot say that one variable causes change in the other variable. Causation can only be determined from an appropriately designed statistical experiment.

When the correlation coefficient is close to -1 or 1, we can have more confidence in using the model to make predictions and draw conclusions.

A large sample can also give us more confidence in our conclusions because a large sample is more likely to be representative of the population. However, some types of data can be hard to collect and we will have to do the best we can with a smaller sample, knowing that our conclusions may not be as valid.

Examples

Example 1

Data was collected on the number of concert tickets sold and the gross revenue generated by those ticket sales. The data is given in the table.

Tickets SoldGross Revenue (in million USD)
75\,9808.7
71\,7148.3
66\,5177.9
63\,0277.7
74\,0009.1
68\,0008
72\,8058.6
70\,5008.4
73\,1179
65\,5007.6
69\,2008.2
71\,3008.5
76\,0129.2
a

Formulate a question that could be answered by the data.

Worked Solution
Create a strategy

We are given the number of tickets sold and the corresponding gross revenue from selling those tickets. Think of a question that could be answered by analyzing the relationship between these two variables.

Apply the idea

One example would be, "What is the relationship between the number of tickets sold and the gross revenue from the tickets?" or more specifically, "Does a high number of tickets sold correspond to a high gross revenue?"

Reflect and check

We could also formulate questions about predicting the gross revenue, such as "What is the predicted gross revenue if 80\,000 tickets are sold?"

b

Create a scatterplot of the data.

Worked Solution
Create a strategy

We can use technology to construct the scatterplot by following these steps:

  1. In the GeoGebra Statistics calculator, enter the data into two columns, one column for the independent variable and one column for the dependent variable.

  2. Select all of the cells containing data and choose "Two Variable Analysis."

  3. Use the settings to adjust the scatterplot as needed.

In this scenario, the number of tickets sold is the independent variable and gross revenue is the dependent variable.

Apply the idea
  1. Enter the data into two columns.

    A screenshot of the GeoGebra statistics tool showing how to enter a given set of data. Speak to your teacher for more details.
  2. Select all of the cells containing data and choose "Two Variable Regression Analysis."

    A screenshot of the GeoGebra statistics tool showing how to select the Two Variable Regression Analysis option. Speak to your teacher for more details.
A screenshot of the GeoGebra statistics tool showing how to construct the scatter plot of a given set of data. Speak to your teacher for more details.
Reflect and check

After the scatterplot is generated, we can adjust the scales by clicking the settings icon.

  • Select "Graph" in the upper right corner, then deselect the box that says "Automatic Dimensions".

  • Checking the box that says "Show Grid" will add grid lines to the graph.

  • We could set the X Step (horizontal scale) to 1000 and the Y step (vertical scale) to 0.1, then select "Show Grid" to easily identify which point represents which pair of values.

A screenshot of the GeoGebra statistics tool showing how to adjust the scales used in a scatter plot. Speak to your teacher for more details.
c

Find the line of best fit.

Worked Solution
Create a strategy

We can use technology to find the line of best fit. After following the steps in part (b), we can find the equation of the best fit line by changing the regression model at the bottom of the screen to "Linear".

Apply the idea

Find the Regression model dropdown list at the bottom part of the screen, and select "Linear".

A screenshot of the GeoGebra statistics tool showing how to display the equation of the line of best fit. Speak to your teacher for more details

The equation of the line of best fit is y=0.0001x-0.0701.

Reflect and check

Although the coefficients seem like very small numbers, remember that our dependent variable is in millions. If we had written out all the place values, the coefficients in the equation would have been very large.

d

Use the correlation coefficient to evaluate the strength of the model.

Worked Solution
Create a strategy

We can use technology to find the correlation coefficient. After following the steps in part (c), select the \Sigma x icon.

We can use this diagram to evaluate the strength of the model.

A  correlation coefficient model. Ask your teacher for more information.
Apply the idea

Select "Show statistics". The correlation coefficient is represented by r.

A screenshot of the GeoGebra statistics tool showing how to display the statistics of a given set of data. Speak to your teacher for more details.

The correlation coefficient is r=0.9253, which means there is strong positive correlation between the variables. This tells us that the model can make relatively accurate predictions.

Reflect and check

Even when two variables have a strong relationship and r is close to 1 or -1, we cannot say that one variable causes change in the other variable. Causation can only be determined from an appropriately designed statistical experiment.

e

Use the model to answer the question you formulated in part (a).

Worked Solution
Create a strategy

The statistical question from part (a) is, "Does a high number of tickets sold correspond to a high gross revenue?" To answer this question, we can describe the correlation in context and discuss the strength of the correlation that we found in the previous part.

Apply the idea

The scatterplot shows a strong, positive linear correlation between the variables. This means that a high number of tickets sold does correspond to a high gross revenue.

Reflect and check

Note that the term "high" is relative, so a more specific description of the number of tickets sold or the amount of gross revenue may be better when communicating the results of the analysis.

For example, we might say, "When the number of tickets sold increased by about 14\,000 tickets, the gross revenue from those ticket sales increased by about \$1.6 million."

f

Predict the gross revenue if a concert sells 77\,000 tickets.

Worked Solution
Create a strategy

We can use technology and the line of best fit from part (c) to predict the gross revenue given the number of tickets sold. Remember that the number of tickets will be the input, and the output will be the gross revenue in millions of dollars.

To use the graph to predict the gross revenue, we can draw a vertical line at x=77\,000 until it intersects with the line of best fit. Then, we can visually trace a straight line to determine the corresponding y-value.

Apply the idea

At the bottom of the graph, we can enter x=77\,000, and it will use the line of best fit to calculate the y-value.

A screenshot of the GeoGebra statistics tool showing how to use the scatter plot to predict the value of y given a value of x. Speak to your teacher for more details

The predicted gross revenue if a concert sells 77\,000 tickets is \$9.17 \text{ million}.

Reflect and check

If we had used the line of best fit from part (c) and calculated the gross revenue by hand, we would have gotten a very different result.

\displaystyle y\displaystyle =\displaystyle 0.0001x-0.0701Equation of the line of best fit
\displaystyle =\displaystyle 0.0001\left(77\,000\right)-0.0701Substitute x=77\,000
\displaystyle =\displaystyle 7.7-0.0701Evaluate the multiplication
\displaystyle =\displaystyle 7.6299Evaluate the subtraction

The reason this is different from the value we got when using technology is because the coefficients were rounded in our line of best fit. When calculated with the actual line of best fit, the calculator does not round the coefficients, making its result more accurate.

Example 2

A school principal was investigating the effect of class size on the amount of time a teacher can spend with small groups of students, where each student belonged to a group of 4 or fewer students. Their statistical question was, "What size should a class be for a teacher to be able to spend at least 10 minutes with students in small groups?"

2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
42
44
46
\text{Class size}
2
4
6
8
10
12
14
16
18
\text{Time per small group (minutes)}
a

Describe a possible method that the principal could use to collect the data.

Worked Solution
Create a strategy

The data collection methods we have studied are surveys, observations, scientific experiments, polls, or questionnaires. Another possible method is researching existing data on class sizes and attention time per small group.

Apply the idea

As a school principal, it is possible that the data was collected firsthand. Teachers are most likely not tracking the time they spend with small groups since they are focused on teaching the material, so the principal probably would not use a survey, poll, or questionnaire.

One way the principal may have collected the data is throughout observation. They may have been able to observe various class periods of varying class sizes and tracked the amount of time each teacher was able to spend with small groups of students.

Reflect and check

Have you ever noticed a principal or assistant principal observing one of your classes? What kind of data do you think they were collecting?

b

The equation of the line of best fit shown is y=-0.401x+18.3, and the correlation coefficient is r=-0.95. Could this line of best fit be used to make reasonable predictions? Explain.

Worked Solution
Create a strategy

When considering the reasonableness of a model, we want to analyze the correlation coefficient. The closer the correlation coefficient is to -1 or 1, the stronger the linear correlation between the variables.

Apply the idea

The correlation coefficient of this model is -0.95 which indicates a strong, negative linear correlation between the variables. Because the correlation is strong, the actual data values are close to the line which makes it a relatively reliable model for making predictions.

c

Describe the relationship between the variables based on the model. Include the values of the domain for which the model is appropriate.

Worked Solution
Create a strategy

To describe the relationship between the variables, we will describe the strength and direction of the correlation in context.

To determine the domain for which the model is appropriate, we will look at the domain of the actual data values and the clustering of the points to the line and consider what is reasonable within the given context.

Apply the idea

The model shows that as the class size increases, the time a teacher is able to spend with small groups of students decreases.

This model is based on data where the class sizes range from 6 students to 26 students, so it is most appropriate to use for predictions within this domain.

Reflect and check

The model can still be used to make predictions for class sizes outside of this domain, but the predictions become less reliable. Notice that the data on class sizes smaller than 10 are further from the line, making any prediction on smaller class sizes less reliable.

And while class sizes beyond 26 are likely to be clustered near the line based on the given data, we cannot be sure that this trend will continue without collecting more data.

For example, there may be a maximum class size for which this trend applies and after a class gets large enough the amount of time a teacher can spend may drop more significantly and be better modeled by a different type of function, like an exponential function.

d

Use the graph to answer the principal's statistical question.

Worked Solution
Create a strategy

The principal's question is, "What size should a class be for a teacher to be able to spend at least 10 minutes with students in small groups?"

Apply the idea

A teacher is able to spend 10 minutes per small group when the class has about 21 students or less.

2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
42
44
46
\text{Class size}
2
4
6
8
10
12
14
16
18
\text{Time per small group (minutes)}
Reflect and check

We could also have used the equation of the line that was given in part (b).

\displaystyle y\displaystyle =\displaystyle -0.401x+18.3Equation of the line of best fit
\displaystyle 10\displaystyle =\displaystyle -0.401x+18.3Substitute y=10
\displaystyle -8.3\displaystyle =\displaystyle -0.401xSubtract 18.3 from both sides
\displaystyle 20.698\displaystyle \approx\displaystyle xDivide both sides by -0.401

This shows us that the number is actually lower than 21 so a class size of 20 or fewer students may be a better prediction.

Idea summary

A line of best fit for a set of data can be used to interpret a given situation and make predictions about values not represented by the data. We can use technology to perform the linear regression analysis.

The correlation coefficient, r, is a statistic that describes both the strength and direction of a linear correlation.

A  correlation coefficient model. Ask your teacher for more information.

Correlation does not imply causation.

Outcomes

A2.ST.2

The student will apply the data cycle (formulate questions; collect or acquire data; organize and represent data; and analyze data and communicate results) with a focus on representing bivariate data in scatterplots and determining the curve of best fit using linear, quadratic, exponential, or a combination of these functions.

A2.ST.2a

Formulate investigative questions that require the collection or acquisition of bivariate data and investigate questions using a data cycle.

A2.ST.2b

Collect or acquire bivariate data through research, or using surveys, observations, scientific experiments, polls, or questionnaires.

A2.ST.2c

Represent bivariate data with a scatterplot using technology.

A2.ST.2e

Determine the equation(s) of the function(s) that best models the relationship between two variables using technology. Curves of best fit may include a combination of linear, quadratic, or exponential (piecewise-defined) functions.

A2.ST.2f

Use the correlation coefficient to designate the goodness of fit of a linear function using technology.

A2.ST.2g

Make predictions, decisions, and critical judgments using data, scatterplots, or the equation(s) of the mathematical model.

A2.ST.2h

Evaluate the reasonableness of a mathematical model of a contextual situation.

What is Mathspace

About Mathspace