We have already learned to create a scatter plot and to perform analysis such as determining correlation. Another type of analysis we may choose to do to the graph of a scatter plots is to identify a line of best fit.
We describe the correlation from data using language like positive correlation, negative correlation or no correlation. We might even say that two variables have strong or weak correlation.
Positive correlation - the data appears to gather in a positive relationship, similar to a straight line with a positive slope.
Negative correlation - is when the data appears to gather in a negative relationship, similar to a straight line with a negative slope.
No correlation - when there is no relationship between the variables we say they have no correlation.
Below are some examples of scatter plots with different correlations:
Positive correlation | Negative correlation | No correlation |
The more closely the plotted data resembles a straight line, the stronger the correlation is between the variables.
Just because two variables have a correlation, even a strong one, does not mean that one causes the other. For example, there is a strong correlation between height and stride length. However, it doesn't mean that if you take big steps you'll grow taller!
A line of best fit (sometimes called a trend or regression line) is a straight line that best represents the data on a scatter plot. Depending on the strength of the correlation, this line may pass exactly through all of the points, some of the points, or none of the points. However, it always represents the general trend of the of the data.
Lines of best fit are really handy as we can use them to help us make predictions or conclusions about the data.
To draw a line of best fit by eye, balance the number of points above the line with the number of points below the line. You should generally ignore outliers (points that fall very far from the rest of the data) as they can skew the line of best fit. Later we will look at how we can calculate a line of best fit's equation.
Below is an example of what a good line of best fit might look like.
If the points appear to lie close to a line, we conclude that a relationship probably exists and it is safe to make predictions using a line of best fit. Making predictions inside the range of the data is called interpolation.
In a well-designed experiment, a researcher is careful not to use the fitted line to make predictions about the response that would be observed to values of the independent variable that are outside the range of the values used in the experiment. For example, if in the experiment the smallest value of the independent variable was 10 and the largest 85, then it would be unwise to try to predict what the response would be when the independent variable was smaller than 10 or larger than 85.
To make such predictions beyond the range of the data is called extrapolation and is considered unsafe.
The data points illustrated in the graph below show the sale price of an item of goods measured against the age of the item. Given the value of either the independent or dependent variable, we can determine the corresponding value using the line of best fit as seen below.
The regression line can be used to predict that at 23 months the value of the goods will be approximately $\$1250$$1250. Considering the amounts by which the data points are above and below the regression line, it could easily happen that the estimated value of the goods at 23 months is $\$100$$100 too low or too high.
The following scatter plot shows the data for two variables, $x$x and $y$y.
Determine which of the following graphs contains the line of best fit.
Use the line of best fit to estimate the value of $y$y when $x=4.5$x=4.5.
$4.5$4.5
$5$5
$5.5$5.5
$6$6
Use the line of best fit to estimate the value of $y$y when $x=9$x=9.
$6.5$6.5
$7$7
$8.4$8.4
$9.5$9.5
The number of fish in a river is measured over a five year period.
The results are shown in the following table and plotted below.
Time in years ($t$t) | $0$0 | $1$1 | $2$2 | $3$3 | $4$4 | $5$5 |
---|---|---|---|---|---|---|
Number of fish ($F$F) |
$1903$1903 | $1998$1998 | $1900$1900 | $1517$1517 | $1693$1693 | $1408$1408 |
Which line best approximates the data?
Use this line to predict the number of years until there are no fish left in the river.
Now predict the number of fish remaining in the river after $7$7 years.
Predict how long it will be before there are $900$900 fish left in the river.