7. Probability & Statistics

When looking at **bivariate data**, a **scatterplot** can be used to display the relationship between the two variables.

If the data has a linear trend, a **line of best fit** can be used to model the relationship of the data. We can use technology to find the line of best fit for a scatterplot, then use the line to to help us make predictions or conclusions about the data.

There are mathematical calculations we can use to measure the strength of the linear correlation between two variables.

Move each point in the applet to see how the correlation coefficient changes.

Arrange these points so the value of r is as large as possible. What do you notice?

Arrange these points so the value of r is as close to zero as possible. What do you notice?

Move the points so they are in a straight line, then move one point so it is an outlier. What happens to the correlation coefficient value?

The **correlation coefficient**, r, is a statistic that describes both the strength and direction of a linear correlation.

Perfect positive correlation, r=1

Perfect negative correlation, r=-1

Strong negative correlation, r=-0.974

Weak positive correlation, r=0.306

Moderate negative correlation, r=-0.684

No correlation, r=0.072

It is important to be able to distinguish between **causal** relationships (when changes in one variable *cause* changes in the other variable) and **correlation** where the two variables are related, but one variable does not necessarily influence the other.

Even when two variables have a strong relationship and r is close to -1 or 1, we cannot say that one variable causes change in the other variable. Causation can only be determined from an appropriately designed statistical experiment.

When the correlation coefficient is close to -1 or 1, we can have more confidence in using the model to make predictions and draw conclusions.

A large sample can also give us more confidence in our conclusions because a large sample is more likely to be representative of the population. However, some types of data can be hard to collect and we will have to do the best we can with a smaller sample, knowing that our conclusions may not be as valid.

Data was collected on the number of concert tickets sold and the gross revenue generated by those ticket sales. The data is given in the table.

Tickets Sold | Gross Revenue (in million USD) |
---|---|

75\,980 | 8.7 |

71\,714 | 8.3 |

66\,517 | 7.9 |

63\,027 | 7.7 |

74\,000 | 9.1 |

68\,000 | 8 |

72\,805 | 8.6 |

70\,500 | 8.4 |

73\,117 | 9 |

65\,500 | 7.6 |

69\,200 | 8.2 |

71\,300 | 8.5 |

76\,012 | 9.2 |

a

Formulate a question that could be answered by the data.

Worked Solution

b

Create a scatterplot of the data.

Worked Solution

c

Find the line of best fit.

Worked Solution

d

Use the correlation coefficient to evaluate the strength of the model.

Worked Solution

e

Use the model to answer the question you formulated in part (a).

Worked Solution

f

Predict the gross revenue if a concert sells 77\,000 tickets.

Worked Solution

A school principal was investigating the effect of class size on the amount of time a teacher can spend with small groups of students, where each student belonged to a group of 4 or fewer students. Their statistical question was, "What size should a class be for a teacher to be able to spend at least 10 minutes with students in small groups?"

a

Describe a possible method that the principal could use to collect the data.

Worked Solution

b

The equation of the line of best fit shown is y=-0.401x+18.3, and the correlation coefficient is r=-0.95. Could this line of best fit be used to make reasonable predictions? Explain.

Worked Solution

c

Describe the relationship between the variables based on the model. Include the values of the domain for which the model is appropriate.

Worked Solution

d

Use the graph to answer the principal's statistical question.

Worked Solution

Idea summary

A line of best fit for a set of data can be used to interpret a given situation and make predictions about values not represented by the data. We can use technology to perform the linear regression analysis.

The **correlation coefficient**, r, is a statistic that describes both the strength and direction of a linear correlation.

Correlation does not imply causation.