topic badge

9.03 Linear regression

Linear regression

Exploration

Consider the graph shown:

Car value over time
1
2
3
4
5
\text{Time since purchase (years) } x
5
10
15
20
25
30
\text{Value (in thousands of dollars) }y
  1. Is there a relationship between the years since purchased and the value in thousands of dollars? Explain.

  2. Which of the lines on the graph is the line of best fit?

A line of best fit (or trend line) is a straight line that best represents the data on a scatterplot. We can use lines of best fit to help us make predictions or conclusions about the data.

We previously approximated a line of best fit by trying to balance the number of points above the line with the number of points below the line. This can result in multiple different models.

1
2
3
4
5
6
7
8
9
\text{Weekly growth}
2
4
6
8
10
12
14
\text{Height (cm)}
3 points above, 3 points below
1
2
3
4
5
6
7
8
9
\text{Weekly growth}
2
4
6
8
10
12
14
\text{Height (cm)}
5 points above, 4 points below
1
2
3
4
5
6
7
8
9
\text{Weekly growth}
2
4
6
8
10
12
14
\text{Height (cm)}

We get a more accurate line of best fit when we use technology, referred to as linear regression analysis.

Once we have found the line of best fit for a scatterplot, we can interpret the key features and use the line to predict values that don't appear in the data set.

In the context of a line of best fit, the slope-intercept form represents

\displaystyle y=mx+b
\bm{m}
the rate of change for y with respect to x
\bm{b}
the starting value of y when x is 0

For example, this graph models a plant's growth over several weeks.

1
2
3
4
5
6
7
8
9
\text{Weekly growth}
2
4
6
8
10
12
14
\text{Height (cm)}

The slope of the line y=1.21x+2.14 means that the plant is growing at a rate of 1.21 centimeters per week.

The y-intercept of 2.14 means the plant was 2.14 centimeters tall at week 0. This is feasible if the plant was not a seed when measurements began.

These terms describe the range in which we make predictions:

  • Interpolation: Prediction within the range of x-values in the data

  • Extrapolation: Prediction outside the range of x-values in the data

A scatterplot with interpolation from a line of good fit. Ask your teacher for more information.
A scatterplot with extrapolation from a line of good fit. Ask your teacher for more information.

Using the previous example of the plant height over time:

1
2
3
4
5
6
7
8
9
\text{Weekly growth}
2
4
6
8
10
12
14
\text{Height (cm)}

Interpolating which week the plant was 9 centimeters tall, we will solve 9=1.21x+2.14. The plant was 9 centimeters tall at 5.67 weeks.

Extrapolating the plant's height at 10 weeks, we will evaluate y=1.21\left(10\right)+2.14. The plant will be 14.24 centimeters tall at 10 weeks.

The reliability of predictions depends on the strength of the relationship, whether the data is interpolated or extrapolated, and the number of points in the data set.

  • A larger sample size increases reliability.

  • Interpolation with a strong correlation implies a reliable prediction.

  • Interpolation with a moderate or weak correlation leads to a less reliable prediction.

  • Extrapolation generally leads to an unreliable prediction. The further outside the range of known values, the less reliable it is.

Examples

Example 1

Natalia collected data to answer the question, "What is the relationship between the years since purchasing a car and its value?" Her data is shown in the table.

Time since purchase (years)0.50.81.21.31.51.71.82.122.5
Value (thousands of dollars)2928.528.527.428.52725.925.924.726.4
Time since purchase (years)2.62.83.13.43.63.94.054.64.8
Value (thousands of dollars)24.623.524.623.32121222120.1
a

Find the equation of the line of best fit.

Worked Solution
Create a strategy

To find the equation using technology, we can follow these steps:

  1. Enter the x-values and y-values in two separate columns.

  2. Highlight the data and select Two Variable Regression Analysis.

  3. Under the Regression Model drop down menu, choose Linear.

Apply the idea
  1. Enter the x-values and y-values in two separate columns.

    A screenshot of the GeoGebra statistics tool showing how to enter a given set of data. Speak to your teacher for more details.
  2. Highlight the data and select Two Variable Regression Analysis.

    A screenshot of the GeoGebra statistics tool showing how to select the Two Variable Regression Analysis option. Speak to your teacher for more details.
  3. Under the Regression Model drop down menu, choose Linear.

    A screenshot of the GeoGebra statistics tool showing how to select the linear regression model option. Speak to your teacher for more details.

If we round the coefficients to two decimal places, the equation of the line of best fit is y=-2.2x+30.46.

Reflect and check

The points are tightly clustered around the line, indicating that the relationship between the years since the car was purchased and the value of the car is strong. This means the line of best fit can be used to make relatively reliable predictions.

Remember, a strong relationship does not imply that one variable causes changes in the other. We cannot say that the year since the car was purchased causes the value of the car to decrease, as there may be other factors that affect the value of the car.

b

Interpret the slope and y-intercept of the line.

Worked Solution
Create a strategy

Use the independent and dependent variables to determine the units of the slope and y-intercept.

Car value over time
1
2
3
4
5
\text{Time since purchase (years)}
5
10
15
20
25
30
\text{Value (thousands of dollars)}

To help us visualize the relationship better, we can sketch the scatterplot and line of best fit, and add labels on the axes of the graph.

Remember that the y-values are in thousands of dollars. This means we will need to multiply the y-value of the slope and y-intercept by 1000 when interpreting them in context.

Apply the idea

The y-intercept of \left(0,30.46\right) means that at the time of purchasing the car, it would have a value of \$30\,460.

The slope of -2.2 means that each year, the car's value would decrease by \$2200.

c

Make a prediction about the value of a car after 3 years.

Worked Solution
Create a strategy

We are given the years since the car was purchased, which is the indpendent variable \left(x\right), and we are looking for the value of the car, which is the dependent variable \left(y\right).

We can use the graph to estimate the y-value at x=3 or use the line of best fit to get a more accurate prediction.

Apply the idea

When we substitute x=3 into the equation, we get \begin{aligned}y&=-2.2\left(3\right)+30.46\\&=23.86\end{aligned} Based on the equation of the line of best fit, a car that is initially valued at \$30\,460 will be worth \$23\,860 three years after it was purchased.

Reflect and check

When using technology to evaluate x=3, we will get a slightly different answer. This is because the coefficients were rounded in our line of best fit. The calculator does not round the coefficients, making its result more accurate.

A screenshot of the GeoGebra statistics tool showing how to use the scatter plot to predict the value of y given a value of x. Speak to your teacher for more details.
d

Make a prediction about the value of a car after 10 years.

Worked Solution
Create a strategy

Since 10 years after purchase is not shown on the graph, we can use the equation of the line of best fit to determine the value of a car at that time.

Apply the idea

We can use technology to find the value of y when x=10.

A screenshot of the GeoGebra statistics tool showing how to use the scatter plot to predict the value of y given a value of x. Speak to your teacher for more details

A car that is initially valued at \$30\,460 will be worth \$8513 ten years after it was purchased.

e

Is the prediction for the car's value after 3 years or after 10 years more reliable?

Worked Solution
Create a strategy

To determine the reliability of the predictions, consider whether interpolation or extrapolation was used to make the prediction. Interpolation leads to a more reliable outcome than extrapolation.

Apply the idea

The given data ranges between x=0.5 and x=4.8. This means the prediction of the car's value after 3 years falls within the range of known data, while the prediction after 10 years falls outside of that range.

The prediction of the car's value after 3 years is more reliable.

Reflect and check

Interpolation is more reliable than extrapolation because the predictions follows the same pattern as the known data values. With extrapolation, we assume that the trend continues beyond the known data values. Realistically, the trend may not continue which makes extrapolation less reliable.

Example 2

A teacher recorded the number of days since a student last studied for an exam and their score out of a possible 80 points on the exam.

Days since studying3264416342
Exam score64594257587233635562
a

Formulate an investigative question that can be answered by the data.

Worked Solution
Create a strategy

The question should be focused on the relationship between the variables represented by the data. The independent variable is the number of days since studying, and the dependent variable is the score on the exam.

Apply the idea

One possible question is, "How does the number of days since a student last studied impact their exam score?"

Reflect and check

Other possible questions are:

  • What is the relationship between the number of days since a student last and their exam score?

  • How many days prior to the exam should a student study to increase their exam score?

  • If a student studies on the same day as the exam, what is their expected score on the exam?

b

Was the data most likely collected through measurement, observation, a survey or an experiment?

Worked Solution
Apply the idea

The teacher did not measure, observe, or control the time since a student studied. Instead, it is more likely that the teacher asked the students how many days it has been since they last studied.

The data was most likely collected through a survey.

Reflect and check

Although the teacher may have had access to the students' exam scores (assuming the teacher was the one that assigned the exam), they could have still included a survey question about the exam score to keep the data organized.

For example, their survey questions could have been, "How many days has it been since you last studied for this subject?" and "What was your score on the exam?"

c

Describe the relationship between the number of days since studying and the exam score.

Worked Solution
Create a strategy

To describe the relationship, we should construct a scatterplot to get a visual of the data. Then, we will consider the form (linear or nonlinear), strength (strong, moderate or weak), and direction (positive or negative).

Apply the idea
1
2
3
4
5
6
7
\text{Days since studying}
10
20
30
40
50
60
70
80
\text{Score }

The data appears to have a strong, negative, linear relationship.

Relating this back to the context, we can say that as the number of days since a student last studied increases, and their score on the exam tends to decrease.

d

Calculate the line of best fit using technology.

Worked Solution
Create a strategy

To find the equation using technology, we can follow these steps:

  1. Enter the x-values and y-values in two separate columns.

  2. Hightlight the data and select Two Variable Regression Analysis.

  3. Under the Regression Model drop down menu, choose Linear.

Apply the idea
  1. Enter the x- and y-values in two separate columns:

    A screenshot of the GeoGebra statistics tool showing the numbers 3, 2, 6, 4, 4, 1, 6, 3, 4, and 2 entered in column A, rows 1 to 10 and the numbers 64, 59, 42, 57, 58, 72, 33, 63, 55, and 62 entered in column B, rows 1 to 10. Speak to your teacher for more details.
  2. Highlight the data and select Two Variable Regression Analysis:

    A screenshot of the GeoGebra statistics tool showing the numbers 3, 2, 6, 4, 4, 1, 6, 3, 4, and 2 in column A, rows 1 to 10 and the numbers 64, 59, 42, 57, 58, 72, 33, 63, 55, and 62 in column B, rows 1 to 10. The cells from column A, rows 1 to 10, and column B, rows 1 to 10, are selected. The menu from the second leftmost icon is shown. Speak to your teacher for more details.
  3. Choose Linear under the Regression Model drop down menu to find the line of best fit:

    A screenshot of the GeoGebra statistics tool showing the following: On the left side: the numbers 3, 2, 6, 4, 4, 1, 6, 3, 4, and 2 in column A, rows 1 to 10 and the numbers 64, 59, 42, 57, 58, 72, 33, 63, 55, and 62 in column B, rows 1 to 10. The cells from column A, rows 1 to 10, and column B, rows 1 to 10, are selected. On the right side: a scatterplot and the line of best fit are shown. Speak to your teacher for more details.

The equation of the line of best fit is y=-6.2245x+78.2857

Reflect and check

If the instructions do not specify to round the coefficients, it is best to include all the digits given by the calculator. This increases the accuracy of the model and the predictions.

e

Answer the question formulated in part (a).

Worked Solution
Create a strategy

To answer the question, "How does the number of days since a student last studied impact their exam score?", we can describe the direction of the linear relationship. To be more specific, we can interpret the slope of the line in context.

In the previous part, we found the equation of the line of best fit to be y=-6.2245x+78.2857, which tells us the slope is -6.2245.

Apply the idea

As the number of days since a student last studied increases, their exam score decreases. More specifically, for each additional day since a student last studied, their exam score is expected to decrease by about 6 points.

Reflect and check

Matching the rise and run of the slope to their respective units can help us interpret its meaning in context.\text{slope}=\dfrac{\text{rise}}{\text{run}}=\dfrac{\text{change in }y}{\text{change in }x}=\dfrac{-6.2245}{1}

1
2
3
4
5
6
7
\text{Days since studying}
10
20
30
40
50
60
70
80
\text{Score }

The y-values represent the exam score, which is the "rise" of the slope. The x-values represent the number of days since studying, which is the "run" of the slope.

Since the slope is negative, it represents a decrease of 6.2245 in the exam score for every 1 day since studying.

f

If a student studied the same day as the exam, what would we expect their score to be?

Worked Solution
Create a strategy

If the number of days since a student last studied is 0, then their exam score is the y-value of the y-intercept.

Apply the idea

The y-intercept tells us that a student who has studied on the day of the exam has a predicted score of 78.2857, according to the linear model.

Reflect and check

Although this value was found through extrapolation, x=0 is not very far outside of the range of known values. Since the relationship is strong, this prediction is relatively reliable.

Idea summary

A line of best fit for a set of data can be used to interpret a given situation and make predictions about values not represented by the data.

A line of best fit has an equation of the form y=mx+b. We can use technology to perform the linear regression analysis.

In the context of a line of best fit, the slope-intercept form represents

\displaystyle y=mx+b
\bm{m}
the rate of change for y with respect to x
\bm{b}
The starting value of y when x is 0

These terms describe the range in which we make predictions:

  • Interpolation: Prediction within the range of x-values in the data

  • Extrapolation: Prediction outside the range of x-values in the data

The reliability of predictions depends on the strength of the relationship, whether the data is interpolated or extrapolated, and the number of points in the data set. In general, interpolation is more reliable than extrapolation.

Outcomes

A.ST.1

The student will apply the data cycle (formulate questions; collect or acquire data; organize and represent data; and analyze data and communicate results) with a focus on representing bivariate data in scatterplots and determining the curve of best fit using linear and quadratic functions.

A.ST.1a

Formulate investigative questions that require the collection or acquisition of bivariate data.

A.ST.1b

Determine what variables could be used to explain a given contextual problem or situation or answer investigative questions.

A.ST.1c

Determine an appropriate method to collect a representative sample, which could include a simple random sample, to answer an investigative question.

A.ST.1d

Given a table of ordered pairs or a scatter plot representing no more than 30 data points, use available technology to determine whether a linear or quadratic function would represent the relationship, and if so, determine the equation of the curve of best fit.

A.ST.1e

Use linear and quadratic regression methods available through technology to write a linear or quadratic function that represents the data where appropriate and describe the strengths and weaknesses of the model.

A.ST.1f

Use a linear model to predict outcomes and evaluate the strength and validity of these predictions, including through the use of technology.

A.ST.1g

Investigate and explain the meaning of the rate of change (slope) and y-intercept (constant term) of a linear model in context.

A.ST.1h

Analyze relationships between two quantitative variables revealed in a scatterplot.

A.ST.1i

Make conclusions based on the analysis of a set of bivariate data and communicate the results.

What is Mathspace

About Mathspace