Bivariate data isn't always linear. However, a transformation can be used to attempt to linearise it, or make it more "linear", so that a least-squares linear regression model can be used. Three common approaches are the squared, logarithmic, and reciprocal transformations, and the most appropriate choice depends on the shape (distribution) of the bivariate data. The transformation circle below shows different possible shapes of data and some possible transformations to linearity:
The basic steps for performing a data transformation (this is most often done using technology) are:
Input the explanatory and response variables into lists.
Draw a scatterplot and decide whether the data is linear or non-linear.
If it is non-linear, use the circle of transformations to determine an appropriate transformation.
Transform the data accordingly, in a new list.
Apply a linear regression using the transformed data and calculate the Pearson correlation coefficient r, coefficient of determination r^2 and the regression equation (include the transformed variable as part of the equation).
Draw a scatterplot of the transformed data. Is it linear?
Draw a residual plot of the transformed data. Does this plot confirm the assumption of linearity?
If needed, repeat this process with other transformations to compare results and determine the best fit.
Squared transformation
Consider the following data set:
x | 0 | 1.1 | 2.2 | 4 | 6 | 8 | 10 |
---|---|---|---|---|---|---|---|
y | 39.8 | 44.3 | 42.3 | 38.7 | 30.8 | 20.8 | 0.4 |
Enter this data into the calculator and examine the value of the correlation coefficient (for a linear fit) and the shape of the scatter plot:
As can be seen, the correlation coefficient shows a strong linear relationship between the data (r=-0.9145), but the scatter plot seems nonlinear, that is, less like a line and more curved in shape. To be more linear, the data would need to be stretched out from the curve into a straight line. To do this, the data could be transformed by either squaring the x-values or squaring the y-values. For this example, try transforming the x-values by squaring them. This is a square transformation.
The data now looks like this:
x | 0 | 1.1 | 2.2 | 4 | 6 | 8 | 10 |
---|---|---|---|---|---|---|---|
x^2 | 0 | 1.21 | 4.84 | 16 | 36 | 64 | 100 |
y | 39.8 | 44.3 | 42.3 | 38.7 | 30.8 | 20.8 | 0.4 |
Now fit a linear regression to this new data, analysing the correlation coefficient:
Notice how the data has now been stretched into a more linear shape? The correlation has become stronger and it is now possible to use this least-squares regression line to predict for values of y.
The new transformed least-squares line could be written as y=43.9879-0.4090\left(x^2\right).
To predict a value of y for a given x-value, remember to square the x-value as this transformation uses x^2, then substitute into the least-squares regression. So for x=5:
\begin{aligned}y&=43.9879-0.4090\times 25 \\ y &= 33.7629\end{aligned}
Logarithmic transformation
A logarithmic transformation requires taking \log_{10}(x) of all the x-values, and then following the same procedure as above. This transformation is a compressing transformation, that is, it "squeezes" the data points together into a more linear shape.
The equation of the transformed least-squares line would then be: y=a+b\log _{10}\left(x\right)
Reciprocal transformation
The reciprocal transformation is a stretching transformation. The equation of the transformed least-squares line would then be given by: y=a+b\dfrac{1}{x}
Consider the data in the table below:
x | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
---|---|---|---|---|---|---|---|---|
y | 3 | 5 | 8 | 12 | 22 | 30 | 50 | 80 |
Determine the best transformation for this data to linearity.
Consider the following data set.
x | 1 | 1.8 | 2.2 | 3 | 4 | 4.5 | 5.8 | 6.2 | 7 | 9 |
---|---|---|---|---|---|---|---|---|---|---|
y | 1 | 10 | 16.4 | 37 | 53 | 79 | 129.6 | 155.8 | 187 | 321 |
Use technology to create a scattergraph of this data. The shape of the data appears to be:
Which transformation should we perform to linearise the data?
Complete the transformation of the x-values of the data and fill in this table.
x^2 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
y | 1 | 10 | 16.4 | 37 | 53 | 79 | 129.6 | 155.8 | 187 | 321 |
Find the correlation coefficient of the transformed data set. Give your answer to two decimal places.
Find the least squares regression line of your transformed data set. Give your answer in the form y=ax^2+b, where a and b are to one decimal place.
Predict the value for y when x=70. Enter your answer to two decimal places.
Choose the description which best describes the validity of this prediction.
The transformation circle below shows different possible shapes of data and some possible transformations to linearity: