topic badge
AustraliaVIC
VCE 12 General 2023

3.03 Data transformation

Lesson

Data transformation

Bivariate data isn't always linear. However, a transformation can be used to attempt to linearise it, or make it more "linear", so that a least-squares linear regression model can be used. Three common approaches are the squared, logarithmic, and reciprocal transformations, and the most appropriate choice depends on the shape (distribution) of the bivariate data. The transformation circle below shows different possible shapes of data and some possible transformations to linearity:

The image shows different possible transformations and their respective equations. Ask your teacher for more information.

The basic steps for performing a data transformation (this is most often done using technology) are:

  1. Input the explanatory and response variables into lists.

  2. Draw a scatterplot and decide whether the data is linear or non-linear.

  3. If it is non-linear, use the circle of transformations to determine an appropriate transformation.

  4. Transform the data accordingly, in a new list.

  5. Apply a linear regression using the transformed data and calculate the Pearson correlation coefficient r, coefficient of determination r^2 and the regression equation (include the transformed variable as part of the equation).

  6. Draw a scatterplot of the transformed data. Is it linear?

  7. Draw a residual plot of the transformed data. Does this plot confirm the assumption of linearity?

  8. If needed, repeat this process with other transformations to compare results and determine the best fit.

Squared transformation

Consider the following data set:

x01.12.246810
y39.844.342.338.730.820.80.4

Enter this data into the calculator and examine the value of the correlation coefficient (for a linear fit) and the shape of the scatter plot:

A CAS calculator menu to calculate the value of the correlation coefficient. Ask your teacher for more information.

As can be seen, the correlation coefficient shows a strong linear relationship between the data (r=-0.9145), but the scatter plot seems nonlinear, that is, less like a line and more curved in shape. To be more linear, the data would need to be stretched out from the curve into a straight line. To do this, the data could be transformed by either squaring the x-values or squaring the y-values. For this example, try transforming the x-values by squaring them. This is a square transformation.

The data now looks like this:

x01.12.246810
x^201.214.84163664100
y39.844.342.338.730.820.80.4

Now fit a linear regression to this new data, analysing the correlation coefficient:

A CAS calculator menu to calculate the linear regression. Ask your teacher for more information.
A CAS calculator menu to calculate the linear regression and its line of best fit. Ask your teacher for more information.

Notice how the data has now been stretched into a more linear shape? The correlation has become stronger and it is now possible to use this least-squares regression line to predict for values of y.

The new transformed least-squares line could be written as y=43.9879-0.4090\left(x^2\right).

To predict a value of y for a given x-value, remember to square the x-value as this transformation uses x^2, then substitute into the least-squares regression. So for x=5:

\begin{aligned}y&=43.9879-0.4090\times 25 \\ y &= 33.7629\end{aligned}

Logarithmic transformation

A scatterplot showing a logarithmic shape. Ask your teacher for more information.

In general, if data has the following shape, a logarithmic transformation will be appropriate:

A logarithmic transformation requires taking \log_{10}(x) of all the x-values, and then following the same procedure as above. This transformation is a compressing transformation, that is, it "squeezes" the data points together into a more linear shape.

The equation of the transformed least-squares line would then be: y=a+b\log _{10}\left(x\right)

Reciprocal transformation

A scatterplot showing a reciprocal shape. Ask your teacher for more information.

Likewise, in general, if data has the following shape, then a reciprocal of each value could be used by calculating \dfrac{1}{x}.

The reciprocal transformation is a stretching transformation. The equation of the transformed least-squares line would then be given by: y=a+b\dfrac{1}{x}

Examples

Example 1

Consider the data in the table below:

x12345678
y3581222305080
A scatterplot showing an increasing trend. Ask your teacher for more information.

This data has the following scatterplot:

Determine the best transformation for this data to linearity.

Worked Solution
Create a strategy

Use the data below to determine which transformation best suited the given scatterplot:

The image shows three examples of transformation summarised in the tables. Ask your teacher for more information.
Apply the idea

The tables show that the logarithmic transform best suits this data with the most linear scatterplot and correlation coefficient r closest to 1. The coefficient of determination r^2 is also higher for this transformation than for the other two transformations.

A residual plot. Ask your teacher for more information.

To test the assumption of linearity, check the residual plot for the logarithmic transformation.

Since there is no pattern to the residual plot, this supports the assumption of linearity. So, the logarithmic transformation is the best transformation to linearity for this data.

Example 2

Consider the following data set.

x11.82.2344.55.86.279
y11016.4375379129.6155.8187321
a

Use technology to create a scattergraph of this data. The shape of the data appears to be:

A
Parabolic
B
Reciprocal
C
Logarithmic
Worked Solution
Create a strategy

Use a graphing calculator or other technology to examine the shape of the scatterplot.

Apply the idea

Using the Spreadsheets mode, enter each x-values along with its y-values into a data table on your calculator then find the graph.

1
2
3
4
5
6
7
8
9
x
50
100
150
200
250
300
y

The shape of the data appears to be parabolic, as it has an increasing gradient. So the correct answer is A.

b

Which transformation should we perform to linearise the data?

A
Find the reciprocal of the x-values \left(\dfrac{1}{x}\right)
B
Find the log of the x-values \left(\log x\right)
C
Square the x-values \left(x^2\right)
Worked Solution
Create a strategy

Use the answer in part (a) to determine the required transformation..

Apply the idea

Since the data fits a parabolic curve \left(y=\pm x^{2} \right) then the x-values must be squared. So the correct answer is C.

c

Complete the transformation of the x-values of the data and fill in this table.

x^2
y11016.4375379129.6155.8187321
Worked Solution
Create a strategy

Square each value of x from the original data table.

Apply the idea

Square the first x-value:

\displaystyle x^2\displaystyle =\displaystyle 1^2Square 1
\displaystyle =\displaystyle 1Evaluate

Square the second x-value:

\displaystyle x^2\displaystyle =\displaystyle 1.8^2Square 1.8
\displaystyle =\displaystyle 3.24Evaluate

By squaring the rest of the x-values, we have the complete transformation of the data:

x^213.244.8491620.2533.6438.444981
y11016.4375379129.6155.8187321
d

Find the correlation coefficient of the transformed data set. Give your answer to two decimal places.

Worked Solution
Create a strategy

Use the linear regression function on your calculator.

Apply the idea

Using the Statistics mode, enter the values in the table from part (c) on your calculator then find the linear regression.

Look for the correlation coefficient (r):r=1

e

Find the least squares regression line of your transformed data set. Give your answer in the form y=ax^2+b, where a and b are to one decimal place.

Worked Solution
Create a strategy

The values of a and b should be available from the same place on your calculator where you found the correlation coefficient r from the previous part.

Apply the idea

Write down the equation of the line in the form y=a+bx:y=-3.2+4x^2

f

Predict the value for y when x=70. Enter your answer to two decimal places.

Worked Solution
Create a strategy

Use the regression line we calculated in part (e) to make the prediction.

Apply the idea

Substitute x=70 into the equation of the regression line y=-3.2+4x^2.

\displaystyle y\displaystyle =\displaystyle -3.2+4(70)^2Substitute x=70
\displaystyle =\displaystyle 19\,596.80Evaluate using your calculator
g

Choose the description which best describes the validity of this prediction.

A
Despite an interpolated prediction, unreliable due to a moderate to weak correlation.
B
Very unreliable due to extrapolation and a moderate to weak correlation.
C
Reliable due to interpolation and a strong correlation.
D
Despite a strong correlation, unreliable due to extrapolation.
Worked Solution
Create a strategy

Check if x=70 lies within the original range of x-values.

Apply the idea

The x-values has an original range from 1 to 9. So, the prediction in part (f) is extrapolation since x=70 lies outside the range.

In part (d), we know found the correlation coefficient is close to 1, so there is a strong correlation between the two variables.

So the correct answer is D.

Idea summary

The transformation circle below shows different possible shapes of data and some possible transformations to linearity:

The image shows different possible transformations and their respective equations. Ask your teacher for more information.

Outcomes

U3.AoS1.12

data transformation and its purpose

U3.AoS1.27

construct a residual analysis to test the assumption of linearity and, in the case of clear non-linearity, transform the data to achieve linearity and repeat the modelling process using the transformed data

What is Mathspace

About Mathspace