8. Two Variable Data Analysis

Worksheet

1

Find the equation of the least squares regression line if:

a

An x-value of 5 gives a predicted value of y = 9, and an x-value of 8 gives a predicted value of y = 3.

b

An x-value of 4 gives a predicted value of y = 3, and an x-value of 6 gives a predicted value of y = 7.

2

Consider the following set of data:

x | 0.8 | 1.77 | 2.7 | 3.62 | 4.9 | 5.7 | 7 | 8.6 | 10.1 |
---|---|---|---|---|---|---|---|---|---|

y | 3.4 | 2.9 | 2.5 | 2.4 | 1.9 | 1.9 | 1.8 | 1.7 | 1.54 |

The equation of the least squares line fitted to this data is y = 3.15 - 0.18 x.

a

i

Predict the value of y when x = 3.

ii

Is this an interpolation or extrapolation?

b

i

Predict the value of y when x = 30.

ii

Is this an interpolation or extrapolation?

3

Consider the following set of data:

Number of tests | 2 | 5 | 8 | 11 | 14 | 17 | 20 | 23 | 26 |
---|---|---|---|---|---|---|---|---|---|

Average test score | 72.9 | 60.8 | 56.6 | 41.8 | 38.3 | 35.5 | 32.9 | 27.4 | 25 |

The equation of the least squares line fitted to this data is

\text{Average test score} = 70.34 - 1.92 \times \text{Number of tests}a

Predict the average test score when the number of tests is 4.

b

Is this an interpolation or extrapolation?

4

A least squares regression line is given by y = 3.59 x + 6.72.

a

State the slope of the regression line.

b

Interpret the meaning of the slope.

c

State the value of the y-intercept.

5

A least squares regression line is given by y = - 3.67 x + 8.42.

a

State the slope of the regression line.

b

Interpret the meaning of the slope.

c

State the value of the y-intercept.

6

The price of various new and second-hand Mitsubishi Lancers are shown in the table:

a

Find the equation of the Least Squares Regression Line for the price \left( y \right) in terms of age \left( x \right). Round all values to the nearest integer.

b

State the value of the y-intercept.

c

Interpret the meaning of the y-intercept.

d

State the slope of the line.

e

Interpret the meaning of the slope in this context.

\text{Age} | \text{Price (\$)} |
---|---|

1 | 16\,000 |

2 | 13\,000 |

0 | 21\,990 |

5 | 10\,000 |

7 | 8600 |

4 | 12\,500 |

3 | 11\,000 |

4 | 11\,000 |

8 | 4500 |

2 | 14\,500 |

7

Concern over student use of the social media app SnappyChatty leads to a study of student marks in Mathematics versus minutes spent using the app. Data collected from ten students is displayed in the table below:

\text{Minutes} | 292 | 153 | 354 | 253 | 11 | 42 | 195 | 7 | 162 | 254 |
---|---|---|---|---|---|---|---|---|---|---|

\text{Mark } (\%) | 26 | 63 | 13 | 37 | 97 | 89 | 51 | 98 | 59 | 36 |

a

Find the equation of the Least Squares Regression Line for the mark as a percentage \left( y \right) in terms of minutes spent using SnappyChatty \left( x \right). Round all values to one decimal place.

b

State the value of the y-intercept.

c

Interpret the meaning of the y-intercept in this context.

d

State the slope of the line.

e

Interpret the meaning of the slope in this context.

8

The average number of pages read to a child each day and the child’s total vocabulary are measured. Data collected from ten children is displayed in the table below:

Pages read per day | 25 | 27 | 29 | 3 | 13 | 31 | 18 | 29 | 29 | 5 |
---|---|---|---|---|---|---|---|---|---|---|

Total vocabulary | 402 | 440 | 467 | 76 | 220 | 487 | 295 | 457 | 460 | 106 |

a

Find the equation of the Least Squares Regression Line for the Total vocabulary \left ( y \right) in terms of Pages read per day \left( x \right). Round all values to one decimal place.

b

State the value of the y-intercept.

c

Interpret the meaning of the y-intercept in this context.

d

State the slope of the line.

e

Interpret the meaning of the slope in this context.

9

The amount of money households spend on dining out each week, D, is measured against their weekly income, I. The following linear regression model is fitted to the data:

D = 0.1 I + 25

a

Interpret the meaning of the y-intercept in this model.

b

State the slope of the line.

c

If the weekly income of a family increases by \$100, what effect would we expect this to have on the amount of money spent on dining out?

10

The number of hours spent watching TV each evening, h, is measured against the percentage results, m, achieved in the Economics exam.

The following linear regression model is fitted to the data:

m = - 15 h + 97a

Interpret the meaning of the y-intercept in this model.

b

Does the interpretation in the previous part make sense in this context?

c

State the slope of the line.

d

If a student increases the amount of TV they watch by 3.5 hours, what effect do we expect this to have on their Economics exam mark?

11

Consider the following scatter plots of bivariate data. Given the coefficient of determination, calculate the correlation coefficient, r, to two decimal places.

a

The coefficient of determination is 0.77.

b

The coefficient of determination is 0.94.

12

Consider the following set of data:

x | - 1.2 | 0.5 | 1.9 | 0.1 | 1.1 | 20 | 9 | 10.5 | 1.1 |
---|---|---|---|---|---|---|---|---|---|

y | 33.3 | 35.6 | 34.6 | 41.5 | 21.2 | 42.3 | 36.5 | 40.2 | 32.1 |

a

Calculate r^{2}. Round your answer to two decimal places.

b

Interpret your result.

13

For each of the following sets of data, calculate the percentage of variation in y that can be explained by the variation in x. Round your answers to the nearest percent.

a

x | - 1.3 | 0.1 | 8.1 | 3.2 | 3.1 | 6.7 | 15.9 | 11.2 | 10.9 |
---|---|---|---|---|---|---|---|---|---|

y | 66.9 | 49.7 | 29.6 | 37.7 | 26 | 19.9 | 16 | 13.2 | 7.7 |

b

x | - 1.5 | 0.3 | 7.2 | 4.2 | 4.1 | 7.5 | 17.7 | 12.3 | 10.2 |
---|---|---|---|---|---|---|---|---|---|

y | 54.3 | 48.3 | 28.9 | 35.8 | 27.2 | 19.8 | 16.1 | 13 | 7.8 |

14

For each set of summary statistics, along with the equation of the least squares line:

i

Find the coefficient of determination. Round your answer to two decimal places.

ii

Interpret your result.

a

\overline{x} = 180, \quad s_{x} = 5.3, \quad \overline{y} = 169, \quad s_{y} = 4.8, \quad y = 30.44 + 0.77 x

b

\overline{x} = 180,\quad s_{x} = 4.7,\quad \overline{y} = 169,\quad s_{y} = 3.8,\quad y = 21.54 + 0.79 x

15

The following tables show the sets of data \left( x, y \right) and the predicted \hat{y} values based on a least-squares regression line. Complete the tables by finding the residuals. Round all values to one decimal place.

a

x\text{-values} | 1 | 3 | 5 | 7 | 9 |
---|---|---|---|---|---|

y\text{-values} | 22.7 | 22.3 | 24.2 | 21.8 | 21.5 |

\hat{y} | 25.2 | 23.4 | 21.6 | 19.8 | 18 |

\text{Residuals} |

b

x\text{-values} | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|

y\text{-values} | 37.7 | 37.2 | 21.1 | 27.1 | 44 |

\hat{y} | 28.9 | 30.4 | 31.9 | 33.4 | 34.9 |

\text{Residuals} |

16

For each residual plot, comment on whether the association between the variables can be described as linear:

a

b

17

The table shows a company's revenue, y in millions, in week x. The equation \\ y = 2 x + 14 can be used to model the data.

a

Complete the table.

b

Plot the residuals as a scatter plot.

c

Comment on the suitability of this model for the data.

x | y | \text{Value generated} \\ \text{ by model} | \text{Residual} |
---|---|---|---|

2 | 19 | ||

3 | 21 | ||

5 | 22 | ||

6 | 27 | ||

7 | 30 | ||

10 | 32 | ||

13 | 39 | ||

14 | 40 |

18

The table shows a company's costs, y in millions, in week x. The equation \\ y = 5 x + 12 can be used to model the data.

a

Complete the table.

b

Plot the residuals as a scatter plot.

c

Comment on the suitability of this model for the data.

x | y | \text{Value generated} \\ \text{ by model} | \text{Residual} |
---|---|---|---|

1 | 22 | ||

2 | 25 | ||

4 | 33 | ||

6 | 39 | ||

9 | 53 | ||

12 | 69 | ||

14 | 81 | ||

17 | 99 |

19

The results (as percentages) for a practice spelling test and the real spelling test were collected for 8 students:

\text{Practice } (x) | 56.30 | 79.00 | 59.40 | 77.00 | 71.60 | 64.40 | 61.60 | 68.20 |
---|---|---|---|---|---|---|---|---|

\text{Real } (y) | 48.90 | 69.30 | 51.90 | 66.00 | 62.50 | 56.10 | 54.50 | 59.20 |

a

Calculate the correlation coefficient for the scores. Round your answer to three decimal places.

b

Describe the statistical relationship between these two variables.

c

Using technology, find the equation for the least squares regression line of y on x. Round all values to two decimal places.

d

Use your regression line to predict the real spelling test result of a student who scored 60\% in their practice spelling test. Round the answer to two decimal places.

e

Comment on the validity of this prediction.

20

The forecast maximum temperature, in degrees Celsius, and the observed maximum temperature are recorded to determine the accuracy in the temperature prediction models used by the weather bureau.

a

Calculate the correlation coefficient for these temperatures. Round your answer to two decimal places.

b

Describe the statistical relationship between these two variables.

c

Use your graphing calculator to find the equation for the least squares regression line of y on x.

d

Use your regression line to predict the observed maximum temperature on a day in the same month when the forecast was 25 \degree \text{C}.

\text{Forecast } (x) | \text{Observed } (y) |
---|---|

30.10 | 31.20 |

27.20 | 27.80 |

26.90 | 24.30 |

29.00 | 27.60 |

30.20 | 31.60 |

34.20 | 31.50 |

33.20 | 33.70 |

36.90 | 34.30 |

27.10 | 25.00 |

30.10 | 31.90 |

e

Which of the following forecast temperatures would give the most reliable predicted temperature? Explain your answer.

39 \degree \text{C}, 25 \degree \text{C}, \text{ or } 32 \degree \text{C}21

During an alcohol education programme, 10 adults were offered up to 6 drinks and were then given a simulated driving test where they scored a result out of a possible 100. The results are displayed in the following table:

\text{Number of drinks } (x) | 3 | 2 | 6 | 4 | 4 | 1 | 6 | 3 | 4 | 2 |
---|---|---|---|---|---|---|---|---|---|---|

\text{Driving score } (y) | 66 | 61 | 43 | 58 | 56 | 73 | 31 | 64 | 55 | 62 |

a

Calculate the correlation coefficient for the data. Round your answer to two decimal places.

b

Describe the statistical relationship between the two variables.

c

Use your graphing calculator to find the equation of the least squares regression line of y on x. Round all values to one decimal place.

d

Use your regression line to predict the driving score of a young adult who consumed 5 drinks. Round your answer to one decimal place.

e

Comment on the validity of your prediction.

22

Research on the number of cigarettes smoked during pregnancy and the birth weights of the newborn babies was conducted and results displayed in the table below:

a

Calculate the correlation coefficient for the data. Round your answer to three decimal places.

b

Describe the statistical relationship between these two variables.

c

Use your graphing calculator to find the equation of the least squares regression line of y on x. Round all values to two decimal places where necesssary.

d

Use your regression line to predict the birth weight of a newborn whose mother smoked on average 5 cigarettes per day.

e

Comment on the reliability of your prediction.

\text{Average number of} \\ \text{ cigarettes per day } (x) | \text{Birth weight} \\ \text{ in kilograms } (y) |
---|---|

46.30 | 3.90 |

13.00 | 5.80 |

21.40 | 5.00 |

25.00 | 4.80 |

8.60 | 5.50 |

36.50 | 4.50 |

1.00 | 7.00 |

17.90 | 5.10 |

10.60 | 5.50 |

13.40 | 5.10 |

37.30 | 3.80 |

18.50 | 5.70 |

23

A sample of families were interviewed about their annual family income, x, and their average monthly expenditure, y. Results are displayed in the table below:

a

Calculate the correlation coefficient between the two variables. Round your answer to two decimal places.

b

Describe the statistical relationship between the two variables.

c

Use your graphing calculator to find the equation for the least squares regression line of y on x. Round all values to three decimal places.

d

Use your regression line to predict the monthly expenditure for a family whose annual income is \$80\,000.

e

Which of the following annual incomes would give the more reliable prediction of average monthly expenditure? \$99\,000, \$51\,000 \text{ or } \$80\,000Explain your answer.

\text{Annual income } \\ (x) | \text{Average monthly} \\ \text{ expenditure } (y) |
---|---|

66\,000 | 1100 |

75\,000 | 1700 |

65\,000 | 1400 |

73\,000 | 1300 |

54\,000 | 600 |

90\,000 | 1800 |

87\,000 | 1100 |

87\,000 | 1500 |

94\,000 | 1800 |

96\,000 | 2200 |

24

The data in the table and scatter plot below show the frequencies per month of online marketing emails sent out to subscribers, compared with the proportion of subscribers who click and open the email:

\text{Frequency } (x) | \text{Proportion} \\ \text{click and open } (y) |
---|---|

3 | 0.41 |

4 | 0.59 |

1 | 1.28 |

5 | 0.47 |

4 | 0.26 |

7 | 0.62 |

10 | 0.75 |

a

Calculate the correlation coefficient between these two variables. Round answer to two decimal places.

b

Which piece of data appears to be an outlier?

c

Remove the outlier and recalculate the correlation coefficient.

d

Describe the statistical relationship between the two variables.

e

Use your graphing calculator to find the equation for the least squares regression line of y on x (with the outlier removed). Round values to two decimal places.

f

Use your regression line to predict the proportion of emails opened if they're sent 20 times a month.

g

Comment on the validity of your prediction.

25

Hospital patients aged between 18 and 65 years of age had their ages in years \left( x \right) and their blood pressures in millimeters of mercury \left( y \right) recorded. The following summary statistics are available:

\overline{x} = 51, \quad s_{x} = 11.53, \quad \overline{y} = 138, \quad s_{y} = 14.07, \quad r = 0.87a

Calculate the slope of the least squares regression line. Round your answer to two decimal places.

b

Calculate the vertical intercept of the least squares regression line.

c

Hence, state the equation that can be used to predict a patient’s blood pressure from their age.

d

Predict the blood pressure for a patient who is 20 years old.

e

Comment on the validity of your prediction.

26

12 states in the USA with populations ranging from 700\,000 to 10\,000\,000 residents were asked for their budget expenditure. Summary statistics on the expenditure (y in billions of dollars) and the population (x in millions) are presented below:

\overline{x} = 5.649, \quad s_{x} = 4.734, \quad \overline{y} = 7.227, \quad s_{y} = 6.633, \quad r = 0.963a

Which variable is the independent variable?

b

Calculate the slope of the least squares regression line. Round your answer to two decimal places.

c

Calculate the vertical intercept of the least squares regression line.

d

Predict the expenditure for a state with 600\,000 residents.

e

Comment on the reliability of your prediction.

27

The climates of various cities were studied for their latitude (x in degrees), altitude (w in metres) and mean daily temperature (y in degrees Celsius).

a

The coefficient of determination for latitude against temperature was 0.565. Calculate the correlation coefficient. Round your answer to three decimal places.

b

Consider the following summary statistics:

\overline{w} = 201.9, \quad s_{w} = 103.519, \quad \overline{y} = 9.49, \quad s_{y} = 10.91, \quad r = - 0.827Which is the better predictor of temperature, latitude or altitude?

c

Hence, calculate the slope of the best least squares regression line for predicting the temperature of a city. Round your answer to two decimal places.

d

Calculate the vertical intercept of this least squares regression line. Round your answer to two decimal places.

e

Predict the temperature for a city with an altitude of 100 \text{ m}. Round your answer to two decimal places.

f

Given the lowest recorded altitude in this data set was 90 \text{ m}, comment on the validity of the prediction.

E1

*Use the following information to answer Questions E1 - E3.*

The scatterplot below displays the *resting pulse rate*, in beats per minute, and the *time spent exercising*, in hours per week, of 16 students. A least squares line has been fitted to the data.

\text{}

Using this least squares line to model the association between *resting pulse rate* and *time spent exercising*, the residual for the student who spent four hours per week exercising is closest to

A

-2.0 beats per minute.

B

-1.0 beats per minute.

C

-0.3 beats per minute.

D

1.0 beats per minute.

E

2.0 beats per minute.

Source: Q7 Part A, 2018 VCE Further Maths (1)

E2

The equation of this least squares line is closest to

A

\text{resting pulse rate } = 67.2 - 0.91 \times \text{ time spent exercising}

B

\text{resting pulse rate }= 67.2 - 1.10 \times \text{time spent exercising }

C

\text{resting pulse rate } = 68.3 - 0.91 \times \text{ time spent exercising}

D

\text{resting pulse rate } = 68.3 - 1.10 \times \text{ time spent exercising}

E

\text{resting pulse rate } = 67.2 + 1.10 \times \text{ time spent exercising}

Source: Q8 Part A, 2018 VCE Further Maths (1)

E3

The coefficient of determination is 0.8339.

The correlation coefficient r is closest to

A

-0.913

B

–0.834

C

–0.695

D

0.834

E

0.913

Source: Q9 Part A, 2018 VCE Further Maths (1)

E4

In a study of the association between a person’s height, in centimetres and body surface area, in square metres, the following least squares line was obtained.

\text{body surface area}= -1.1+0.019 \times \text{height}

Which one of the following is a conclusion that can be made from this least squares line?

A

An increase of 1 \text{ m}^{2} in body surface area is associated with an increase of 0.019 \text{ cm} in height.

B

An increase of 1 \text{ cm} in height is associated with an increase of 0.019 \text{ m}^{2} in body surface area.

C

The correlation coefficient is 0.019

D

A person’s body surface area, in square metres, can be determined by adding 1.1 \text{ cm} to their height.

E

A person’s height, in centimetres, can be determined by subtracting 1.1 from their body surface area, in square metres.

Source: Q10 Part A, 2018 VCE Further Maths (1)

E5

The congestion level in a city can also be recorded as the percentage increase in travel time due to traffic congestion in peak periods (compared to non-peak periods).

This is called the percentage congestion level.

The percentage congestion levels for the morning and evening peak periods for 19 large cities are plotted on the scatterplot below.

a

Determine the median percentage congestion level for the morning peak period and the evening peak period.

Write your answers in the appropriate boxes provided below.

Median percentage congestion level for morning peak period \enspace ⬚ \%

Median percentage congestion level for evening peak period \enspace ⬚ \%

A least squares line is to be fitted to the data with the aim of predicting evening congestion level from morning congestion level.

The equation of this line is\text{evening congestion level} = 8.48 + 0.922 \times \text{morning congestion level}

b

Name the dependent variable in this equation.

c

Use the equation of the least squares line to predict the evening congestion level when the morning congestion level is 60\%.

d

Determine the residual value when the equation of the least squares line is used to predict the evening congestion level when the morning congestion level is 47\%.

Round your answer to one decimal place.

e

The value of the correlation coefficient r is 0.92.

What percentage of the variation in the evening congestion level can be explained by the variation in the morning congestion level?

Round your answer to the nearest whole number.

Source: Q2 Part A, 2018 VCE Further Maths (2)

Sign up to access worksheet

Get full access to our content with a Mathspace account.

Create a scatter plot to represent the relationship between two variables, determine the correlation between these variables by testing different regression models using technology, and use a model to make predictions when appropriate.

Describe the value of mathematical modelling and how it is used in real life to inform decisions.