Bivariate data arises when a study aims to determine whether there is a relation between two variable quantities. The quantities under investigation are called the explanatory variable and the response variable, or equivalently, the independent- and the dependent variable.
If the dependent variable is found to be related in a definite way to the values taken by the independent variable, then further research may show that there is a causal relationship between the two. This is not necessarily the case, however, because the two quantities may both be varying in response to changes in a third factor and so it could not be claimed that one of the variables being studied has had a causal effect on the other. You can read more on causality and correlation here.
The results of bivariate data investigations can be displayed graphically using a scatter plot. The level of the independent variable corresponds to a position on the horizontal axis and the resulting value of the dependent variable corresponds to a distance along the vertical axis. In this way, each data point is displayed as a point in a two-dimensional coordinate system.
A correlation is a way of expressing a relationship between two variables and, more specifically, how strongly pairs of data are related. We describe the correlation from data using language like positive correlation, negative correlation or no correlation. We can even further strengthen the language by using strong or weak.
It is often possible to determine by looking at a scatter plot the nature of a relation between variables. If the data points lie on or close to a line, a linear relation is strongly suggested. An algebraic model of the form $y=ax+b$y=ax+b may then be proposed as a summary of the relation. If the points are not close to a line but still display a generally linear trend, a weak linear relation may be said to exist. Techniques are available to find the best fitting linear model to any bivariate data set whether or not there is a genuine linear relation.
The slope of a line fitted to a data set may be positive or negative, depending on the sign of the coefficient $a$a in the formula $y=ax+b$y=ax+b. This corresponds to whether the response variable increases or decreases respectively in response to an increase in the explanatory variable.
A positive correlation is when the data appears to gather in a positive relationship. Similar to a straight line with a positive gradient.
In other words, as one variable increases, the other variable also increases.
There are three types of positive correlation:
You may also come across a moderate correlation, which is a correlation between weak and strong. Here are some examples of positive style correlations
A negative correlation is when the data appears to gather in a negative relationship. Similar to a straight line with a negative gradient.
In other words, as one variable increases, the other one decreases.
Like positive correlation, there are three types of negative correlation:
Here are some examples of negative style correlations
No correlation is when there is no relationship between the variables.
This means that there is a random or nonlinear relationship between the two sets of data.
Here is a diagram of no correlation
Shapes other than a line may be apparent in a scatter plot. If the data points lie on or near a curve, it may be appropriate to infer a non-linear relation between the variables. It is possible, for example, to find the best-fitting polynomial of a given degree or some other function, that reasonably describes the observed effect. Non-linear relations would still be described as strong/moderate/weak and positive/negative depending on how strongly they resemble the chosen curve and whether the curve is positive or negative in shape.
Other features that may appear in scatter plots include clustering and outliers.
In an observational study, a gap in the values available in the explanatory variable may create the appearance of clusters in the response values. Such a feature might arise when the population under consideration is made up of distinct sub-populations. Even without gaps in the values of the independent variable there may be distinct sets of data points in which different trends are apparent, indicating the existence of different groups within the population.
An outlier occurs where a single value of the response variable is very different from neighbouring values. An outlier might be due to measurement error, but not necessarily. An outlier should not be discarded from the data set before looking for a satisfactory explanation.
Consider the following graph:
The correlation is: (select the best answer)
No Correlation
Linear Negative
Nonlinear
Linear Positive
The correlation features:
Gaps
Outliers
Clusters
None
The scatter plot shows the relationship between sea temperatures and the amount of healthy coral.
Describe the correlation between sea temperature the amount of healthy coral.
Select all that apply.
Weak
Negative
Strong
Positive
Which variable is the dependent variable?
Sea temperature
Level of healthy coral
Which variable is the independent variable?
Sea temperature
Level of healthy coral
The following table has data results from an experiment.
$X$X | $2$2 | $4$4 | $7$7 | $9$9 | $12$12 | $15$15 | $17$17 | $20$20 |
$Y$Y | $2$2 | $4$4 | $6$6 | $8$8 | $12$12 | $18$18 | $28$28 | $38$38 |
Plot the data from the table on the graph below.
What is the type of correlation between the data points? Select the best answer.
Linear Positive
Linear Negative
Nonlinear
No Correlation