topic badge

1.02 Associations between categorical variables

Lesson

Analysing two-way frequency tables: row and column percentages

Two-way tables allow us to display and examine the relationship between two sets of categorical data. The categories are labelled at the top and the left side of the table, and the frequency of the different characteristics appear in the interior of the table. Often the totals of each row and column are also shown.

The following are the statistics of the passengers and crew who sailed on the Titanic on its fateful maiden voyage in 1912.

  First Class Second Class Third Class Crew Total
Survived $202$202 $118$118 $178$178 $212$212 $710$710
Died $123$123 $167$167 $528$528 $696$696 $1514$1514
Total: $325$325 $285$285 $706$706 $908$908 $2224$2224

 

 

Row percentages table

Although it is interesting to know that $202$202 First Class passengers survived, it is far more useful to know the percentage break up of survivors from each class. To do this we need to calculate row percentages. To find the percentage divide the value in the table by the row total and multiply by $100%$100%. The calculations for the first row in the table are shown below.

Think: We need the fraction of first class survivors out of total number of survivors (row total) rounded to the nearest percentage.

Percentage of survivors that were first class $=$= $\frac{\text{first class survivors}}{\text{total number survivors}}$first class survivorstotal number survivors
  $=$= $\frac{202}{710}$202710
  $\approx$ $28%$28%
 
  First Class Second Class Third Class Crew Total
Survived $\frac{202}{710}\times100%\approx28%$202710×100%28% $\frac{118}{710}\times100%\approx17%$118710×100%17% $\frac{178}{710}\times100%\approx25%$178710×100%25% $\frac{212}{710}\times100%\approx30%$212710×100%30% $100%$100%
Died $\frac{123}{1514}\times100%\approx8%$1231514×100%8% $11%$11% $35%$35% $46%$46% $100%$100%

Reflect: This percentage frequency table now gives us more useful information than the raw data. The first row gives us the percentage breakdown of survivors by class. We can now easily read that $46%$46% of the people who died were crew members whereas only $8%$8% of the people who died were in first class.

 

Column percentages table

We can also calculate the percentage in each class type that survived or died. To do this we calculate column percentages. To find the percentage divide the value in the table by the column total and multiply by $100%$100%. The calculations are shown in the table below:

  First Class Second Class Third Class Crew
Survived $\frac{202}{325}\times100%\approx62%$202325×100%62% $\frac{118}{285}\times100%\approx41%$118285×100%41% $\frac{178}{706}\times100%\approx25%$178706×100%25% $\frac{212}{908}\times100%\approx23%$212908×100%23%
Died $\frac{123}{325}\times100%\approx38%$123325×100%38% $59%$59% $75%$75% $77%$77%
Total: $100%$100% $100%$100% $100%$100% $100%$100%

Reflect: This percentage frequency table now gives us more useful information. The first column gives us the percentage breakdown of survivals and deaths in first class. From the raw data we can see that a similar number of first class ($202$202) and third class ($178$178) passengers survived. However this can be misleading. The percentage frequency table shows us that $62%$62% of first class passengers survived whereas only $25%$25% of third class passengers survived.

 

Practice question

Question 1

Maria surveyed a group of people about the type of job they had. She recorded the data in the following graph.

  1. Complete the following two-way table displaying the row percentage.

    Give your answers correct to one decimal place.

    None Casual Part-time Full-time Total
    Men $17.6%$17.6% $\editable{}$$%$% $29.4%$29.4% $41.2%$41.2% $100%$100%
    Women $22.2%$22.2% $27.8%$27.8% $\editable{}$$%$% $\editable{}$$%$% $100%$100%
  2. Complete the following two-way table displaying the column percentage.

    Give your answers correct to one decimal place.

    None Casual Part-time Full-time
    Men $\editable{}$$%$% $28.6%$28.6% $62.5%$62.5% $\editable{}$$%$%
    Women $57.1%$57.1% $\editable{}$$%$% $37.5%$37.5% $\editable{}$$%$%
    Total $100%$100% $100%$100% $100%$100% $100%$100%

 

Finding the association between variables

To find if there is an association between the variables in the Titanic table we can ask the question "Is survival rate dependent on the class of the passenger?"

In order to find this we must first identify the explanatory variable. In this problem it is the class of passenger. The explanatory variable forms the heading of each column, therefore the column percentage frequency table will best indicate any patterns.

  First Class Second Class Third Class Crew
Survived $62%$62% $41%$41% $25%$25% $23%$23%
Died $38%$38% $59%$59% $75%$75% $77%$77%
Total: $100%$100% $100%$100% $100%$100% $100%$100%

When we read across the first row in the column percentage frequency table and look at the numbers we can see a clear difference in the percentages of passengers that lived or died in each class. This suggests there is an association between the class of passenger and the rate of survival. It appears that the higher the class of passenger, the higher the rate of survival.

Which percentage frequency table to use?

If the explanatory variable forms the column headings then we use the column percentage frequency table to look for association.

If the explanatory variable forms the row headings then we use the row percentage frequency table to look for association.

 

How do we tell if there is an association?

If it's a column percentage table then look across the rows for differences in the values. If the values are similar then we say there is NO clear association.

If it's a row percentage table then look down the columns for differences in the values. If the values are similar then we say there is NO clear association.

 

How to describe the association?

First state if there is or is not an association apparent. For example: There appears to be an association between the variables.

Next describe the association. For example: The class of passenger affects the survival rate of the passengers.

Finally give an example. For example: The higher the class of passenger, the more likely the passenger was to survive.

An overview of the percentages in two-way tables can bring to light clear associations. The presence of more subtle associations and an objective measure of the significance of such associations requires additional analysis and methods from further studies in statistics.

Note: The term 'association' is used to describe a relationships between variables. An association does not mean one variable causes the other variable to change but that a change in one variable appears to affect the other.

 

Practice questions

Question 2

Glen surveyed all the students in Year $12$12 at his school and summarised the results in the following table:

  Play netball Do not play netball Total
Height$\ge$$170$170 cm $33$33 $72$72 $105$105
Height$<$<$170$170 cm $13$13 $30$30 $43$43
Total $46$46 $102$102 $148$148
  1. Which variable is the explanatory variable?

    Play netball

    A

    Height

    B
  2. To examine if there is an association between height and playing netball, should Glen use a column or row percentage frequency table?

    Column

    A

    Row

    B
  3. Complete the row percentage frequency table for this data.

    Round your answers to the nearest percentage.

     

      Play netball Do not play netball Total
    Height$\ge$$170$170 cm $\editable{}$$%$% $\editable{}$$%$% $\editable{}$$%$%
    Height$<$<$170$170 cm $\editable{}$$%$% $\editable{}$$%$% $\editable{}$$%$%
  4. Looking at the columns of the completed table, does there appear to be an association between height and playing netball?

    No, there does not appear to be any association as the numbers are similar.

    A

    Yes, there appears to be an association as the numbers are quite different. It seems that taller people like to play netball.

    B

Question 3

Members of a gym were asked what kind of training they do. Each responder only did one kind of training. The table shows the results.

  Cardio Weight
Male $11$11 $26$26
Female $46$46 $17$17
  1. Which variable is the explanatory variable?

    Gender

    A

    Type of training

    B
  2. To examine if there is an association between the type of training and gender, should we use a column or row percentage frequency table?

    Row

    A

    Column

    B
  3. Complete the row percentage frequency table for this data.

    Round your answers to the nearest percentage.

     

      Cardio Weight Total
    Male $\editable{}$$%$% $\editable{}$$%$% $\editable{}$$%$%
    Female $\editable{}$$%$% $\editable{}$$%$% $\editable{}$$%$%
  4. Looking at the columns of the completed table, does there appear to be an association between the type of training and the gender of gym members?

    Yes, there appears to be an association as the numbers are quite different. Men seem to prefer weights, while women seem to prefer cardio.

    A

    No, there does not appear to be any association as the numbers are different.

    B
  5. Does a person’s gender cause them to choose a certain type of training?

    Yes. As we saw, women prefer to do cardio and men prefer to do weights.

    A

    No, association is not causation. There appears to be an association but we cannot say whether one variable causes the other.

    B

 

100% stacked column graphs

Association between variables can often be seen more clearly in a stacked column graph. Below is a stacked column graph (also called segmented column graph) for the data from the Titanic table earlier. When we look at each column we can see the proportion of death to survival in each column is different. This indicates there is an association between the variables.

If there is no association then the proportion of the sections in each column are the same. When we look at the graph below we can see that each column is divided into similar size sections. This indicates there is NO clear association between household composition and distribution of money.

How to draw a stacked column graph

Label the horizontal axis with the explanatory variables.

Label the vertical axis with percentages from $0%$0% to $100%$100%.

Draw a column for each explanatory variable that reaches the height of $100%$100% on the vertical axis.

To divide each column into the percentages as shown in the frequency table start from the bottom of the column, count to the first percentage and draw a horizontal line to mark it off, then count up to the second percentage from the horizontal line and then mark off again, until all sections are complete

Write the label and percentage in each section of the columns which indicates the response variables displayed or provide a key.

 

Practice questions

Question 4

$170$170 people were surveyed about their music preference. The results have been recorded in the table below.

Musical Preferences by Gender
Music Preference Male Female Total
Rock & Roll $24$24 $19$19 $43$43
Classical $8$8 $15$15 $23$23
Pop $17$17 $17$17 $34$34
Rap $6$6 $2$2 $8$8
Country & Western $17$17 $24$24 $41$41
R&B $6$6 $9$9 $15$15
Punk $4$4 $2$2 $6$6
Total $82$82 $88$88 $170$170
  1. What is the explanatory variable in this data set?

    Music preference

    A

    Gender

    B
  2. Which of the following $100%$100% stacked column charts should be used to look for an association between the variables?

    A

    B
  3. Does this stacked column chart suggest that there is an association between music preference and gender?

    No. The data does not suggest any association, as the corresponding segments are of similar sizes.

    A

    Yes. The data suggests there is an association, as the corresponding segments are of similar sizes.

    B

    No. The data does not suggest any association, as the corresponding segments are of different sizes.

    C

    Yes. The data suggests there is an association, as the corresponding segments are of different sizes.

    D

Question 5

A group of year 12 students surveyed their class and recorded the hair colour and eye colour for each student. The results are displayed in the $100%$100% stacked column chart shown below.

  1. What is the explanatory variable for this chart?

    Eye colour

     
    A

    Hair colour

     
    B
  2. Does the chart suggest an association between eye colour and hair colour?

    Yes, as the corresponding segments are similar in size.

    A

    Yes, as the corresponding segments are of different sizes.

    B

    No, as the corresponding segments are similar in size.

    C

    No, as the corresponding segments are of different sizes.

    D
  3. Can we say that having blue eyes causes a high chance of having blonde hair?

    Yes. The data shows that students with blue eyes are more likely to have blonde hair.

    A

    No. There appears to be an association, but we cannot say that one causes the other.

    B

Outcomes

3.1.2

construct two-way frequency tables and determine the associated row and column sums and percentages

3.1.3

use an appropriately percentaged two-way frequency table to identify patterns that suggest the presence of an association

3.1.4

describe an association in terms of differences observed in percentages across categories in a systematic and concise manner, and interpret this in the context of the data

3.1.8

identify the response variable and the explanatory variable for primary and secondary data

What is Mathspace

About Mathspace