topic badge

1.03 Associations between categorical variables

Lesson

Row and column percentages

Two-way tables allow us to display and examine the relationship between two sets of categorical data. The categories are labelled at the top and the left side of the table, and the frequency of the different characteristics appear in the interior of the table. Often the totals of each row and column are also shown.

The following are the statistics of the passengers and crew who sailed on the Titanic on its fateful maiden voyage in 1912.

First ClassSecond ClassThird ClassCrewTotal
Survived202118178212710
Died1231675286961514
Total:3252857069082224

Although it is interesting to know that 202 First Class passengers survived, it is far more useful to know the percentage break up of survivors from each class. To do this we need to calculate row percentages. To find the percentage divide the value in the table by the row total and multiply by 100\%. The row percentages are shown in the table below with working shown in some of the cells.

First ClassSecond ClassThird ClassCrewTotal
Survived\dfrac{202}{710} \times 100\% \approx 28\%\dfrac{118}{710} \times 100\% \approx 17\%25\%30\%100\%
Died\dfrac{123}{1514}\times 100\% \approx 8\%11\%35\%46\%100\%

This percentage frequency table now gives us more useful information than the raw data. The first row gives us the percentage breakdown of survivors by class. We can now easily read that 46\% of the people who died were crew members whereas only 8\% of the people who died were in first class.

We can also calculate the percentage in each class type that survived or died. To do this we calculate column percentages. To find the percentage divide the value in the table by the column total and multiply by 100\%. The column percentages are shown in the table below with working shown in some of the cells.

First ClassSecond ClassThird ClassCrew
Survived\dfrac{202}{325} \times 100\% \approx 62\%\dfrac{118}{285} \times 100\% \approx 41\%25\%23\%
Died\dfrac{123}{325}\times 100\% \approx 38\%59\%75\%77\%
Total100\%100\%100\%100\%

This percentage frequency table now gives us more useful information. The first column gives us the percentage breakdown of survivals and deaths in first class. From the raw data we can see that a similar number of first class (202) and third class (178) passengers survived. However this can be misleading. The percentage frequency table shows us that 62\% of first class passengers survived whereas only 25\% of third class passengers survived.

Examples

Example 1

Maria surveyed a group of people about the type of job they had. She recorded the data in the following graph.

A graph that shows the number of men and women working types of jobs. Ask your teacher for more information.
a

Complete the following two-way table displaying the row percentages. Give your answers correct to one decimal place.

NoneCasualPart-timeFull-timeTotal
Men17.6\%29.4\%41.2\%100\%
Women22.2\%27.8\%100\%
Worked Solution
Create a strategy

Create a table for the data. Then for each missing value, divide the value in the table by the row total and multiply by 100\%.

Apply the idea

From the graph we can make the following two-way table:

NoneCasualPart-timeFull-timeTotal
Men1510253585
Women2025153090
Total35354065

From the above table we can see that there were 10 casual men employees, and the row total for men is 85.

\displaystyle \text{Casual Men Percent}\displaystyle =\displaystyle \dfrac{10}{85} \times 100\%Find the row percentage
\displaystyle \approx\displaystyle 11.8\%Evaluate

From the table we can see that there were 15 women part-time employees, and the row total for women is 90.

\displaystyle \text{Part-Time Women Percent}\displaystyle =\displaystyle \dfrac{15}{90}\times 100\%Find the row percentage
\displaystyle \approx\displaystyle 16.7\%Evaluate

From the table we can see that there were 30 full-time women employees.

\displaystyle \text{Full-time women percent}\displaystyle =\displaystyle \dfrac{30}{90} \times 100\%Find the row percentage
\displaystyle \approx\displaystyle 33.3\%Evaluate

So the row percentage table is:

NoneCasualPart-timeFull-timeTotal
Men17.6\%11.8\%29.4\%41.2\%100\%
Women22.2\%27.8\%16.7\%33.3\%100\%
b

Complete the following two-way table displaying the column percentage. Give your answers correct to one decimal place.

NoneCasualPart-timeFull-time
Men28.6\%62.5\%
Women57.1\%37.5\%
Total100\%100\%100\%100\%
Worked Solution
Create a strategy

Use the table from part (b) and for each missing value, divide the value in the table by the column total and multiply by 100\%.

Apply the idea
NoneCasualPart-timeFull-timeTotal
Men1510253585
Women2025153090
Total35354065

From the above table we can see that there were 15 men in the None category, and the None column total is 35.

\displaystyle \text{None Men Percent}\displaystyle =\displaystyle \dfrac{15}{35} \times 100\%Find the column percentage
\displaystyle \approx\displaystyle 42.9\%Evaluate

From the above table we can see that there were 35 full-time men, and the full-time column total is 65.

\displaystyle \text{Full-time Men Percent}\displaystyle =\displaystyle \dfrac{35}{65}\times 100\%Find the column percentage
\displaystyle \approx\displaystyle 53.8\%Evaluate

From the above table we can see that there were 25 casual women employees, and the casual column total is 35.

\displaystyle \text{Casual women Percent}\displaystyle =\displaystyle \dfrac{25}{35} \times 100\%Find the column percentage
\displaystyle \approx\displaystyle 71.4\%Evaluate

From the above table we can see that there were 30 full-time women employees, and the full-time column total is 65.

\displaystyle \text{Full-time women Percent}\displaystyle =\displaystyle \dfrac{30}{65} \times 100\%Find the column percentage
\displaystyle \approx\displaystyle 46.2\%Evaluate

So the column percentage table is:

NoneCasualPart-timeFull-time
Men42.9\%28.6\%62.5\%53.8\%
Women57.1\%71.4\%37.5\%46.2\%
Total100\%100\%100\%100\%
Idea summary

To find the row percentage for a particular value, divide the value in the table by the row total and multiply by 100\%.

To find the column percentage for a particular value, divide the value in the table by the column total and multiply by 100\%.

Associations between variables

To find if there is an association between the variables in the Titanic table we can ask the question "Is survival rate dependent on the class of the passenger?"

In order to find this we must first identify the explanatory variable. In this problem it is the class of passenger. The explanatory variable forms the heading of each column, therefore the column percentage frequency table will best indicate any patterns.

First ClassSecond ClassThird ClassCrew
Survived62\%41\%25\%23\%
Died38\%59\%75\%77\%
Total:100\%100\%100\%100\%

When we read across the first row in the column percentage frequency table and look at the numbers we can see a clear difference in the percentages of passengers that lived or died in each class. This suggests there is an association between the class of passenger and the rate of survival. It appears that the higher the class of passenger, the higher the rate of survival.

Which percentage frequency table to use?

If the explanatory variable forms the column headings then we use the column percentage frequency table to look for association.

If the explanatory variable forms the row headings then we use the row percentage frequency table to look for association.

How do we tell if there is an association?

If it's a column percentage table then look across the rows for differences in the values. If the values are similar then we say there is NO clear association.

If it's a row percentage table then look down the columns for differences in the values. If the values are similar then we say there is NO clear association.

How to describe the association?

First state if there is or is not an association apparent. For example: There appears to be an association between the variables.

Next describe the association. For example: The class of passenger affects the survival rate of the passengers.

Finally give an example. For example: The higher the class of passenger, the more likely the passenger was to survive.

An overview of the percentages in two-way tables can bring to light clear associations. The presence of more subtle associations and an objective measure of the significance of such associations requires additional analysis and methods from further studies in statistics.

Note: The term 'association' is used to describe a relationships between variables. An association does not mean one variable causes the other variable to change but that a change in one variable appears to affect the other.

Examples

Example 2

Glen surveyed all the students in Year 12 at his school and summarised the results in the following table:

Play netballDo not play netballTotal
\text{Height} \geq 170 \text{ cm}3372105
\text{Height} \lt 170 \text{ cm}133043
\text{Total}46102148
a

Which variable is the explanatory variable?

A
Play netball
B
Height
Worked Solution
Create a strategy

Choose a variable on which when changed will affect the other variable.

Apply the idea

The decision of the students to play or not to play may depend on their actual height whereas their height won't change even if they play or they don't play the netball. So the height is the explanatory variable, option B.

b

To examine if there is an association between height and playing netball, should Glen use a column or row percentage frequency table?

Worked Solution
Create a strategy

In a percentage frequency table, we want to sum the percentages of the explanatory variable.

Apply the idea

We found in part (a) that the height is the explanatory variable, which are the rows. To sum the percentage of height we should use a row frequency table, option B.

c

Complete the row percentage frequency table for this data. Round your answers to the nearest percentage.

Play netballDo not play netballTotal
\text{Height} \geq 170 \text{ cm}⬚\%⬚\%⬚\%
\text{Height} \lt 170 \text{ cm}⬚\%⬚\%⬚\%
Worked Solution
Create a strategy

Divide each number of students by the total number of students in that row and multiply the answer by 100\%.

Apply the idea

Here is the original table with the row totals:

Play netballDo not play netballTotal
\text{Height} \geq 170 \text{ cm}3372105
\text{Height} \lt 170 \text{ cm}133043

Here is the table with the calculations for the row percentages:

Play netballDo not play netballTotal
\text{Height} \geq 170 \text{ cm}\dfrac{33}{105}\times 100\% \approx 31\%\dfrac{72}{105}\times 100\% \approx 69\%100\%
\text{Height} \lt 170 \text{ cm}\dfrac{13}{43}\times 100\% \approx 30\%\dfrac{30}{43}\times 100\% \approx 70\%100\%
d

Looking at the columns of the completed table, does there appear to be an association between height and playing netball?

A
No, there does not appear to be any association as the numbers are similar.
B
Yes, there appears to be an association as the numbers are quite different. It seems that taller people like to play netball.
Worked Solution
Create a strategy

In a row percentage table, if the values are similar then we say there is NO clear association.

Apply the idea

The completed row percentage table is shown:

Play netballDo not play netballTotal
\text{Height} \geq 170 \text{ cm}31\%69\%100\%
\text{Height} \lt 170 \text{ cm}30\%70\%100\%

The numbers in each column are quite similar quite similar with just 1\% difference in values. So we can say that there does not appear to be any association between height and playing netball, option A.

Idea summary
  • If the explanatory variable forms the column headings then we use the column percentage frequency table to look for association.

    • Then look across the rows for differences in the values. If the values are similar then we say there is NO clear association.

  • If the explanatory variable forms the row headings then we use the row percentage frequency table to look for association.

    • Then look down the columns for differences in the values. If the values are similar then we say there is NO clear association.

To describe the association:

  1. State whether there is or is not an association apparent.

  2. Describe the association. e.g. The class of passenger affects the survival rate of the passengers.

  3. Give an example. e.g. The higher the class of passenger, the more likely the passenger was to survive.

100 % stacked column graphs

Association between variables can often be seen more clearly in a stacked column graph. Below is a stacked column graph (also called segmented column graph) for the data from the Titanic table earlier. When we look at each column we can see the proportion of death to survival in each column is different. This indicates there is an association between the variables.

A stacked column graph showing death and survival rates in each class on the titanic. Ask your teacher for more information.

If there is no association then the proportion of the sections in each column are the same. When we look at the graph below we can see that each column is divided into similar size sections. This indicates there is NO clear association between household composition and distribution of money.

A stacked column graph which shows the distribution of money in each household. Ask your teacher for more information.

How to draw a stacked column graph?

  • Label the horizontal axis with the explanatory variables.

  • Label the vertical axis with percentages from 0\% to 100\%.

  • Draw a column for each explanatory variable that reaches the height of 100\% on the vertical axis.

  • To divide each column into the percentages as shown in the frequency table start from the bottom of the column, count to the first percentage and draw a horizontal line to mark it off, then count up to the second percentage from the horizontal line and then mark off again, until all sections are complete

  • Write the label and percentage in each section of the columns which indicates the response variables displayed or provide a key.

Examples

Example 3

A group of year 12 students surveyed their class and recorded the hair colour and eye colour for each student. The results are displayed in the 100\% stacked column chart shown below.

A stacked column graph for hair colour for various eye colours. Ask your teacher for more information.
a

What is the explanatory variable for this chart?

A
Eye colour
B
Hair colour
Worked Solution
Create a strategy

In a stacked column graph, the explanatory variable is placed along the horizontal axis.

Apply the idea

The stacked column graph has the eye colour placed along the horizontal axis. So it is the explanatory variable, option A.

b

Does the chart suggest an association between eye colour and hair colour?

A
Yes, as the corresponding segments are similar in size.
B
Yes, as the corresponding segments are of different sizes.
C
No, as the corresponding segments are similar in size.
D
No, as the corresponding segments are of different sizes.
Worked Solution
Create a strategy

Check whether the proportion of the segments in each column are similar sizes.

Apply the idea

The stacked column graph segments differ in sizes which means that it does suggests an association between eye colour and hair colour, option B.

c

Can we say that having blue eyes causes a high chance of having blonde hair?

A
Yes. The data shows that students with blue eyes are more likely to have blonde hair.
B
No. There appears to be an association, but we cannot say that one causes the other.
Worked Solution
Create a strategy

Consider the difference between association and causation.

Apply the idea

An association does not mean one variable causes the other variable to change but that a change in one variable appears to affect the other.

The chance of having blonde hair can be caused by many factors. Having blue eyes may affect the chance of having one but we cannot say that it causes such. So the correct answer is option B.

Idea summary

How to draw a stacked column graph?

  • Label the horizontal axis with the explanatory variables.

  • Label the vertical axis with percentages from 0\% to 100\%.

  • Draw a column for each explanatory variable that reaches the height of 100\% on the vertical axis.

  • To divide each column into the percentages, start from the bottom of the column, count to the first percentage and draw a horizontal line to mark it off, then count up to the second percentage from the horizontal line and then mark off again, until all sections are complete

  • Write the label and percentage in each section of the columns which indicates the response variables displayed or provide a key.

Outcomes

ACMGM049

construct two-way frequency tables and determine the associated row and column sums and percentages

ACMGM050

use an appropriately percentaged two-way frequency table to identify patterns that suggest the presence of an association

ACMGM051

describe an association in terms of differences observed in percentages across categories in a systematic and concise manner, and interpret this in the context of the data

ACMGM055

identify the response variable and the explanatory variable

What is Mathspace

About Mathspace