When trying to determine if there is a relationship between two variables, we will collect data for various values of the independent variable and the corresponding values for the dependent variable.
The first step in determining the presence and type of relationship is to plot the data on a scatter plot. The independent variable will go on the horizontal $x$x axis and the dependent variable will go on the vertical $y$y axis.
Once we have a scatter plot, we can start to perform analysis such as determining correlation and a line of best fit.
A correlation is a way of expressing a relationship between two variables and, more specifically, how strongly pairs of data are related. We describe the correlation from data using language like positive correlation, negative correlation or no correlation. We might even say that two variables have strong or weak correlation.
Just because two variables correlate, even to a high degree, it does not imply that one causes the other. For example, there is a high degree of correlation between height and stride length. However, it doesn't mean that if you take big steps you'll grow taller!
A pattern in a scatter plot can reveal whether or not two measurements are connected to each other. In other words, the presence of a pattern signals that the two sets of data correlate. We will focus on linear patterns in this chapter.
This linear relationship can be seen through close and consistent grouping in a scatter plot. The more closely the dots resemble a straight line, the stronger the correlation between the variables.
A positive correlation is when the data appears to gather in a positive relationship. Similar to a straight line with a positive slope.
In other words, as one variable increases, the other variables also increases.
There are varying degrees of positive correlation:
For example, the scatter plot below shows a strong positive correlation between a person's height and arm span. You can see that as the first variable increases, the second increases too.
Here is a table of positive style correlations
A negative correlation is when the data appears to gather in a negative relationship. Similar to a straight line with a negative slope.
In other words, as one variable increases, the other one decreases.
Like positive correlation, there are varying degrees of negative correlation:
The next scatter plot shows a strong negative correlation. You can see that as the first variable increases, the second variable decreases.
Here is a table of negative style correlations
No correlation is when there is no relationship between the variables.
This means that there is a random relationship between the two sets of data.
Here is a diagram of no correlation
Identify the type of correlation in the following scatter plot.
Weak positive correlation
Weak negative correlation
No correlation
Strong negative correlation
Strong positive correlation
Consider the two variables: time spent studying and exam performance.
Is there likely to be a relationship between the two?
Yes
No
Do you think the correlation is positive or negative?
Positive
Negative
The following table shows the number of traffic accidents associated with a sample of drivers of different age groups.
Age | Accidents |
---|---|
$20$20 | $41$41 |
$25$25 | $44$44 |
$30$30 | $39$39 |
$35$35 | $34$34 |
$40$40 | $30$30 |
$45$45 | $25$25 |
$50$50 | $22$22 |
$55$55 | $18$18 |
$60$60 | $19$19 |
$65$65 | $17$17 |
Which of the following scatter plots correctly represents the above data?
Is the correlation between a person's age and the number of accidents they are involved in positive or negative?
Positive
Negative
Is the correlation between a person's age and the number of accidents they are involved in strong or weak?
Strong
Weak
Which age group's data represent an outlier?
30-year-olds
None of them
65-year-olds
20-year-olds
A line of best fit (or trend line) is a straight line that best represents the data on a scatter plot. Depending on the strength of the correlation, this line may pass exactly through all of the points, some of the points, or none of the points. However, it always represents the general trend of the of the data.
Lines of best fit are really handy as we can use them to help us make predictions or conclusions about the data.
To draw a line of best fit by eye, balance the number of points above the line with the number of points below the line. You should generally ignore outliers as they can skew the line of best fit. Later we will look at how we can calculate a line of best fit's equation.
Below is an example of what a good line of best fit might look like.
If the points appear to lie close to a line, we conclude that a relationship probably exists and it is safe to make predictions using a line of best fit. Making predictions inside the range of the data is called interpolation.
In a well-designed experiment, a researcher is careful not to use the fitted line to make predictions about the response that would be observed to values of the independent variable that are outside the range of the values used in the experiment. For example, if in the experiment the smallest value of the independent variable was 10 and the largest 85, then it would be unwise to try to predict what the response would be when the independent variable was smaller than 10 or larger than 85.
To make such predictions beyond the range of the data is called extrapolation and is considered unsafe.
The data points illustrated in the graph below show the sale price of an item of goods measured against the age of the item. Given the value of either the independent or dependent variable, we can determine the corresponding value using the line of best fit as seen below.
The regression line can be used to predict that at 23 months the value of the goods will be approximately $\$1250$$1250. Considering the amounts by which the data points are above and below the regression line, it could easily happen that the estimated value of the goods at 23 months is $\$100$$100 too low or too high.
The following scatter plot shows the data for two variables, $x$x and $y$y.
Determine which of the following graphs contains the line of best fit.
Use the line of best fit to estimate the value of $y$y when $x=4.5$x=4.5.
$4.5$4.5
$5$5
$5.5$5.5
$6$6
Use the line of best fit to estimate the value of $y$y when $x=9$x=9.
$6.5$6.5
$7$7
$8.4$8.4
$9.5$9.5
The number of fish in a river is measured over a five year period.
The results are shown in the following table and plotted below.
Time in years ($t$t) | $0$0 | $1$1 | $2$2 | $3$3 | $4$4 | $5$5 |
---|---|---|---|---|---|---|
Number of fish ($F$F) |
$1903$1903 | $1998$1998 | $1900$1900 | $1517$1517 | $1693$1693 | $1408$1408 |
Which line best approximates the data?
Use this line to predict the number of years until there are no fish left in the river.
Now predict the number of fish remaining in the river after $7$7 years.
Predict how long it will be before there are $900$900 fish left in the river.