6. Descriptive Statistics

Lesson

When trying to determine if there is a relationship between two variables, we will collect data for various values of the **independent variable** and the corresponding values for the **dependent variable**.

The first step in determining the presence and type of relationship is to plot the data on a scatter plot. The independent variable will go on the horizontal $x$`x` axis and the dependent variable will go on the vertical $y$`y` axis.

Once we have a scatter plot, we can start to perform analysis such as determining correlation and a line of best fit.

A correlation is a way of expressing a relationship between two variables and, more specifically, how strongly pairs of data are related. We describe the correlation from data using language like positive correlation, negative correlation or no correlation. We might even say that two variables have strong or weak correlation.

Watch out!

Just because two variables correlate, even to a high degree, it does not imply that one *causes* the other. For example, there is a high degree of correlation between height and stride length. However, it doesn't mean that if you take big steps you'll grow taller!

A pattern in a scatter plot can reveal whether or not two measurements are connected to each other. In other words, the presence of a pattern signals that the two sets of data correlate. We will focus on linear patterns in this chapter.

This linear relationship can be seen through close and consistent grouping in a scatter plot. The more closely the dots resemble a straight line, the stronger the correlation between the variables.

A positive correlation is when the data appears to gather in a positive relationship. Similar to a straight line with a positive slope.

In other words, as one variable increases, the other variables also increases.

There are varying degrees of positive correlation:

- Perfect positive correlation, where it lines up on a straight line exactly.
- Strong positive correlation, where it closely resembles a straight line with a positive slope.
- Weak positive correlation, where the relationship is still positive.

For example, the scatter plot below shows a strong positive correlation between a person's height and arm span. You can see that as the first variable increases, the second increases too.

Here is a table of positive style correlations

A negative correlation is when the data appears to gather in a negative relationship. Similar to a straight line with a negative slope.

In other words, as one variable increases, the other one decreases.

Like positive correlation, there are varying degrees of negative correlation:

- Perfect negative correlation, where it lines up on a decreasing line perfectly
- Strong negative correlation, where the data presents strongly in a negative direction
- Weak negative correlation.

The next scatter plot shows a strong negative correlation. You can see that as the first variable increases, the second variable decreases.

Here is a table of negative style correlations

No correlation is when there is no relationship between the variables.

This means that there is a random relationship between the two sets of data.

Here is a diagram of no correlation

Identify the type of correlation in the following scatter plot.

Weak positive correlation

AWeak negative correlation

BNo correlation

CStrong negative correlation

DStrong positive correlation

EWeak positive correlation

AWeak negative correlation

BNo correlation

CStrong negative correlation

DStrong positive correlation

E

Consider the two variables: time spent studying and exam performance.

Is there likely to be a relationship between the two?

Yes

ANo

BYes

ANo

BDo you think the correlation is positive or negative?

Positive

ANegative

BPositive

ANegative

B

The following table shows the number of traffic accidents associated with a sample of drivers of different age groups.

Age |
Accidents |
---|---|

$20$20 | $41$41 |

$25$25 | $44$44 |

$30$30 | $39$39 |

$35$35 | $34$34 |

$40$40 | $30$30 |

$45$45 | $25$25 |

$50$50 | $22$22 |

$55$55 | $18$18 |

$60$60 | $19$19 |

$65$65 | $17$17 |

Which of the following scatter plots correctly represents the above data?

ABCABCIs the correlation between a person's age and the number of accidents they are involved in positive or negative?

Positive

ANegative

BPositive

ANegative

BIs the correlation between a person's age and the number of accidents they are involved in strong or weak?

Strong

AWeak

BStrong

AWeak

BWhich age group's data represent an outlier?

30-year-olds

ANone of them

B65-year-olds

C20-year-olds

D30-year-olds

ANone of them

B65-year-olds

C20-year-olds

D

A line of best fit (or trend line) is a straight line that best represents the data on a scatter plot. Depending on the strength of the correlation, this line may pass exactly through all of the points, some of the points, or none of the points. However, it always represents the general trend of the of the data.

Lines of best fit are really handy as we can use them to help us make predictions or conclusions about the data.

To draw a line of best fit by eye, balance the number of points above the line with the number of points below the line. You should generally ignore outliers as they can skew the line of best fit. Later we will look at how we can calculate a line of best fit's equation.

Below is an example of what a good line of best fit might look like.

If the points appear to lie close to a line, we conclude that a relationship probably exists and it is safe to make predictions using a line of best fit. Making predictions inside the range of the data is called interpolation.

In a well-designed experiment, a researcher is careful not to use the fitted line to make predictions about the response that would be observed to values of the independent variable that are outside the range of the values used in the experiment. For example, if in the experiment the smallest value of the independent variable was 10 and the largest 85, then it would be unwise to try to predict what the response would be when the independent variable was smaller than 10 or larger than 85.

To make such predictions beyond the range of the data is called extrapolation* *and is considered unsafe.

The data points illustrated in the graph below show the sale price of an item of goods measured against the age of the item. Given the value of either the independent or dependent variable, we can determine the corresponding value using the line of best fit as seen below.

The regression line can be used to predict that at 23 months the value of the goods will be approximately $\$1250$$1250. Considering the amounts by which the data points are above and below the regression line, it could easily happen that the estimated value of the goods at 23 months is $\$100$$100 too low or too high.

The following scatter plot shows the data for two variables, $x$`x` and $y$`y`.

Determine which of the following graphs contains the line of best fit.

ABCDABCDUse the line of best fit to estimate the value of $y$

`y`when $x=4.5$`x`=4.5.$4.5$4.5

A$5$5

B$5.5$5.5

C$6$6

D$4.5$4.5

A$5$5

B$5.5$5.5

C$6$6

DUse the line of best fit to estimate the value of $y$

`y`when $x=9$`x`=9.$6.5$6.5

A$7$7

B$8.4$8.4

C$9.5$9.5

D$6.5$6.5

A$7$7

B$8.4$8.4

C$9.5$9.5

D

The number of fish in a river is measured over a five year period.

The results are shown in the following table and plotted below.

Time in years ($t$t) |
$0$0 | $1$1 | $2$2 | $3$3 | $4$4 | $5$5 |
---|---|---|---|---|---|---|

Number of fish ($F$ |
$1903$1903 | $1994$1994 | $1995$1995 | $1602$1602 | $1695$1695 | $1311$1311 |

Loading Graph...

Which line best fits the data?

Loading Graph...ALoading Graph...BLoading Graph...CLoading Graph...DLoading Graph...ALoading Graph...BLoading Graph...CLoading Graph...DPredict the number of years until there are no fish left in the river.

Predict the number of fish remaining in the river after $7$7 years.

According to the line of best fit, how many years are there until there are $900$900 fish left in the river?

Represent data on two quantitative variables on a scatter plot, and describe how the variables are related. '[Linear focus; discuss general principle.]

Fit a function to the data; use functions fitted to data to solve problems in the context of the data. Use given functions or choose a function suggested by the context. Emphasize linear, quadratic, and exponential models. '[Linear focus; discuss general principle.]

Fit a linear function for a scatter plot that suggests a linear association.

Interpret the slope (rate of change) and the intercept (constant term) of a linear model in the context of the data.