2. Bivariate Data & Linear Models

2.01 Bivariate data and line of best fit

2.02 Making predictions

2.03 Residuals

INVESTIGATION: You are the smartest generation

2.04 Bivariate data - calculator free

Year 12

2.04 Bivariate data - calculator free

Lesson

Worksheet

Practice

Lesson

Calculator free exam style questions

Bivariate data type questions involving numeric data in the calculator free paper tend to be on interpreting the meaning, in context, of the least-squares regression line, the correlation coefficient $r$r, the coefficient of determination $r^2$r2, and the residual plot.

The worked examples below are typical of questions from Western Australian non-calculator exam papers.

Worked example

Example 1

An international health agency has collected data for infant mortality rates ($M$M, deaths per $1000$1000infants) and literacy rates $\left(L\right)%$(L)% in $25$25countries. The data is presented as the scatter graph below.

a) Describe the relationship between literacy and infant mortality rates.

Think: When we describe a relationship we should refer to strength, direction, and linearity.

Solution: Moderate negative linear relationship.

b) It is determined that $49%$49% of the variation of infant mortality rate that is explained by the variation in literacy rates. What is the correlation coefficient? (Write your answer as a decimal value.)

Think: The statement indicates that the coefficient of determination is $49%$49%, which means that$r^2=0.49$r2=0.49. This is the square of the correlation coefficient.

Do:

Express the coefficient of determination as a decimal;
Calculate the square root to determine the size of the correlation coefficient;
Using the answer to part (a), decide if the relationship is positive or negative to determine the sign of the correlation coefficient.

Solution:

$r^2$`r`2	$=$=	$49%$49%
	$=$=	$0.49$0.49

Since the scatter plot shows a negative relationship, the correlation coefficient will be negative.

$r$`r`	$=$=	$-\sqrt{0.49}$−√0.49
	$=$=	$-0.7$−0.7

c) the equation of the least-squares line for the graph is $M=-2L+250$M=−2L+250.

i) Explain the significance of the slope of the least-squares regression.

Think: The slope (or gradient) of the line is the coefficient of the explanatory variable $L$L in the equation of the line. This indicates how much the vertical variable will change (increase or decrease) for every $1$1 unit of change in the horizontal variable.

Do: Interpret the slope (gradient) in the context of the question. Take care to use appropriate units.

Solution: For every $1%$1% increase in literacy rates, the infant mortality rate decreases by $2$2 (per $1000$1000infants).

ii) Explain the significance of the vertical intercept of the least-squares regression line.

Think: The vertical intercept is the point where the line crosses over the vertical axis and is represented by the constant term in the equation of the line.

Do: Interpret the vertical intercept in the context of the question. The vertical intercept gives us an answer to the question "According to the least-squares regression line, what is the estimated infant mortality rate ($M$M) when the literacy rate ($L$L) is $0$0?"

Solution: In a country where the literacy rate is$0%$0%, the expected infant mortality rate is $250$250 deaths per $1000$1000 infants.

iii) Is your answer to part ii) reasonable in the given context? Explain your answer.

Think: Is it reasonable for the explanatory variable $L$L to have a value of $0%$0%. Is it reasonable for the response variable M to have a value of $250$250?

Do: Answer the question with YES/NO, and provide an explanation.

Solution: YES, because the vertical intercept indicates a valid value for literacy rate (it is possible that nobody in a population is literate) and infant mortality rate.

iv) Predict the infant mortality rate in a country with a literacy rate of $80%$80%.

Think: The least-squares regression line can be used to predict the value of the response variable for the given explanatory variable value.

Do: Substitute$L=80$L=80 into the equation for the least-squares regression line. State the answer in context, with appropriate units.

Solution:

$M$`M`	$=$=	$-2\times80+250$−2×80+250
	$=$=	$90$90

The predicted infant mortality rate is $90$90 deaths per $1000$1000infants.

v) Comment on the reliability of the prediction in part (iv).

Think: Reliability is affected by the strength of the relationship and interpolation vs extrapolation.

Do: State the reliability, providing reasons based on the strength of the relationship and interpolation or extrapolation.

Solution: The prediction is reliable because it is an interpolation. However, the correlation coefficient indicates that the relationship is only moderately strong so we should use this result with some caution.

vi) One expert claims that the low literacy rates in 3rd world countries cause high infant mortality rates. Discuss this statement

Think: Correlation does not imply causation. Even if there is a high correlation, we need to consider if a change in the explanatory variable, literacy rates, will actually cause a change in the response variable, infant mortality.

Do: Provide a YES/NO answer and explain with a possible non-causal reason for the correlation if you answer NO.

Solution: No, literacy rates do not cause infant mortality rates. The correlation could be due to a third confounding variable, such as low-income levels.

Example 2

The table shows a company's profit $P$P (in $millions) for total monthly sales $S$S. The equation $P=0.4S-10$P=0.4S−10 is being used to model the data.

a) Complete table with predicted profit and residuals, based on the linear model.

Sales $S$`S`	Profit $P$`P`	Predicted profit $\hat{P}$^`P`	Residual $P-\hat{P}$`P`−^`P`
30	-8
80	24
50	12
100	23
60	17
70	23
90	24
40	3

Think: calculate the predicted value and residual value of $P$P for each of the given $S$S values.

Do: The residual is calculated using the formula, $\text{residual}=y-\hat{y}$residual=y−^y

calculate the predicted values of $\hat{P}$^P:
- substitute each value of $S$S into the equation of the least-squares regression line;
calculate the residual values:
- subtract the predicted value of $\hat{P}$^P from the corresponding actual $P$P value.

Solution:

The required substitutions and calculations for the first row are:

$\hat{P}$^`P`	$=$=	$0.4\times30-10$0.4×30−10
	$=$=	$2$2
$\text{residual}$residual	$=$=	$-0.8-0.2$−0.8−0.2
	$=$=	$-10$−10

The remaining values are shown in the completed table:

Sales $S$`S`	Profit $P$`P`	Predicted profit $\hat{P}$^`P`	Residual $P-\hat{P}$`P`−^`P`
30	-8	2	-10
80	24	22	2
50	12	10	2
100	23	30	-7
60	17	14	3
70	23	18	5
90	24	26	-2
40	3	6	-3

b) Construct a residual plot for the data in part (a).

Think: Each value of $S$S and the corresponding residual value will make up the coordinates for each point on the residual plot.

Do: Construct the graph, choosing appropriate scales and labelling the axes. Take care to place each point accurately.

Solution:

c) Is this model a good fit for the data? Justify your answer.

Think: if the linear model is a good fit, the residual plot should show a random scattering of points values, above and below $0$0, with no obvious pattern.

Solution: No, a linear model is not a good fit for this data as there is a pattern present in the residual plot.

d) A scatter plot produced from the data includes a single point with the coordinates$(100,23)$(100,23). What does this point represent?

Think: Each point represents the explanatory variable (first) and the response variable (second) for one piece of data.

Do: State the interpreted values, in context with appropriate units.

Solution: In when the monthly sales were $\$100$$100 million, the company profit was $\$23$$23 million.

2.04 Bivariate data - calculator free

Calculator free exam style questions

Worked example

Example 1

Example 2

What is Mathspace

About Mathspace