1.03 Associations between categorical variables

Lesson

Bivariate data: explanatory and response variables

The goal of bivariate data analysis is see if two variables are associated in some way. The two variables that we study in bivariate statistics are called the explanatory variable and the response variable.

Worked example

An experiment is to be conducted in which a researcher surveys men and women as to their preference for coffee or tea or neither. The researcher would then analyse the results to find whether gender is a predictor for hot beverage preference. Identify the explanatory and response variables in this experiment.

Think: To identify which is the explanatory variable and which is the response variable we consider the questions "Does gender explain hot beverage preference?" or "Does hot beverage preference explain the gender of a person?".

Do: So we have that

 Explanatory variable: Gender Response variable: Hot beverage preference

Definitions

Explanatory Variable - the variable which we expect to explain or predict the value of the response variable.

Response Variable - the variable which we expect to respond to the value of the explanatory variable.

When displaying bivariate data graphically the explanatory variable is plotted on the horizontal axis (the $x$x-axis), and the response variable on the vertical axis (the $y$y-axis).

A single coordinate point in a bivariate data set might be written in the form $\left(x,y\right)$(x,y), and it would be understood that $x$x is the explanatory variable and $y$y is the response variable.

Plotting explanatory and response variables

Explanatory variable - plotted on the horizontal $x$x-axis

Response variable - plotted on the vertical $y$y-axis

Practice questions

Question 1

Consider the following variables:

Temperature ($^\circ$°C)

Number of ice cream cones sold

1. Which of the following statements makes sense?

A change in temperature affects the number of ice cream cones sold.

A

A change in the number of ice cream cones sold affects the temperature.

B

A change in temperature affects the number of ice cream cones sold.

A

A change in the number of ice cream cones sold affects the temperature.

B
2. Which is the explanatory variable and which is the response variable?

EV: number of ice cream cones sold

RV: temperature

A

EV: temperature

RV: number of ice cream cones sold

B

EV: number of ice cream cones sold

RV: temperature

A

EV: temperature

RV: number of ice cream cones sold

B

Question 2

Which of the following images has the variables placed in the correct positions?

1. A

B

C

D

E

A

B

C

D

E

Question 3

The scatter plot shows the relationship between sea temperature and the amount of healthy coral.

1. Which variable is the response variable?

Sea temperature

A

Level of healthy coral

B

Sea temperature

A

Level of healthy coral

B
2. Which variable is the explanatory variable?

Sea temperature

A

Level of healthy coral

B

Sea temperature

A

Level of healthy coral

B

Analysing two-way frequency tables: row and column percentages

Two-way tables allow us to display and examine the relationship between two sets of categorical data. The categories are labelled at the top and the left side of the table, and the frequency of the different characteristics appear in the interior of the table. Often the totals of each row and column are also shown.

The following are the statistics of the passengers and crew who sailed on the Titanic on its fateful maiden voyage in 1912.

First Class Second Class Third Class Crew Total
Survived $202$202 $118$118 $178$178 $212$212 $710$710
Died $123$123 $167$167 $528$528 $696$696 $1514$1514
Total: $325$325 $285$285 $706$706 $908$908 $2224$2224

Row percentages table

Although it is interesting to know that $202$202 First Class passengers survived, it is far more useful to know the percentage break up of survivors from each class. To do this we need to calculate row percentages. To find the percentage divide the value in the table by the row total and multiply by $100%$100%. The calculations for the first row in the table are shown below.

Think: We need the fraction of first class survivors out of total number of survivors (row total) rounded to the nearest percentage.

 Percentage of survivors that were first class $=$= $\frac{\text{first class survivors}}{\text{total number survivors}}$first class survivorstotal number survivors​ $=$= $\frac{202}{710}$202710​ $\approx$≈ $28%$28%
First Class Second Class Third Class Crew Total
Survived $\frac{202}{710}\times100%\approx28%$202710×100%28% $\frac{118}{710}\times100%\approx17%$118710×100%17% $\frac{178}{704}\times100%\approx25%$178704×100%25% $\frac{212}{710}\times100%\approx30%$212710×100%30% $100%$100%
Died $\frac{123}{1514}\times100%\approx8%$1231514×100%8% $11%$11% $35%$35% $46%$46% $100%$100%

Reflect: This percentage frequency table now gives us more useful information than the raw data. The first row gives us the percentage breakdown of survivors by class. We can now easily read that $46%$46% of the people who died were crew members whereas only $8%$8% of the people who died were in first class.

Column percentages table

We can also calculate the percentage in each class type that survived or died. To do this we calculate column percentages. To find the percentage divide the value in the table by the column total and multiply by $100%$100%. The calculations are shown in the table below:

First Class Second Class Third Class Crew
Survived $\frac{202}{325}\times100%\approx62%$202325×100%62% $\frac{118}{285}\times100%\approx41%$118285×100%41% $\frac{178}{706}\times100%\approx25%$178706×100%25% $\frac{212}{908}\times100%\approx23%$212908×100%23%
Died $\frac{123}{325}\times100%\approx38%$123325×100%38% $59%$59% $75%$75% $77%$77%
Total: $100%$100% $100%$100% $100%$100% $100%$100%

Reflect: This percentage frequency table now gives us more useful information. The first column gives us the percentage breakdown of survivals and deaths in first class. From the raw data we can see that a similar number of first class ($202$202) and third class ($178$178) passengers survived. However this can be misleading. The percentage frequency table shows us that $62%$62% of first class passengers survived whereas only $25%$25% of third class passengers survived.

Practice questions

Question 4

Maria surveyed a group of people about the type of job they had. She recorded the data in the following graph.

1. Complete the following two-way table displaying the row percentage.

Give your answers correct to $1$1 decimal place.

None Casual Part-time Full-time Total
Men $17.6%$17.6% $\editable{}$$%% 29.4%29.4% 41.2%41.2% 100%100% Women 22.2%22.2% 27.8%27.8% \editable{}$$%$% $\editable{}$$%% 100%100% 2. Complete the following two-way table displaying the column percentage. Give your answers correct to 11 decimal place. None Casual Part-time Full-time Men \editable{}$$%$% $28.6%$28.6% $62.5%$62.5% $\editable{}$$%% Women 57.1%57.1% \editable{}$$%$% $37.5%$37.5% $\editable{}$$%% Total 100%100% 100%100% 100%100% 100%100% Finding the association between variables To find if there is an association between the variables in the Titanic table we can ask the question "Is survival rate dependent on the class of the passenger?" In order to find this we must first identify the explanatory variable. In this problem it is the class of passenger. The explanatory variable forms the heading of each column, therefore the column percentage frequency table will best indicate any patterns. First Class Second Class Third Class Crew Survived 62%62% 41%41% 25%25% 23%23% Died 38%38% 59%59% 75%75% 77%77% Total: 100%100% 100%100% 100%100% 100%100% When we read across the first row in the column percentage frequency table and look at the numbers we can see a clear difference in the percentages of passengers that lived or died in each class. This suggests there is an association between the class of passenger and the rate of survival. It appears that the higher the class of passenger, the higher the rate of survival. Which percentage frequency table to use? If the explanatory variable forms the column headings then we use the column percentage frequency table to look for association. If the explanatory variable forms the row headings then we use the row percentage frequency table to look for association. How do we tell if there is an association? If it's a column percentage table then look across the rows for differences in the values. If the values are similar then we say there is NO clear association. If it's a row percentage table then look down the columns for differences in the values. If the values are similar then we say there is NO clear association. How to describe the association? First state if there is or is not an association apparent. For example: There appears to be an association between the variables. Next describe the association. For example: The class of passenger affects the survival rate of the passengers. Finally give an example. For example: The higher the class of passenger, the more likely the passenger was to survive. An overview of the percentages in two-way tables can bring to light clear associations. The presence of more subtle associations and an objective measure of the significance of such associations requires additional analysis and methods from further studies in statistics. Note: The term 'association' is used to describe a relationships between variables. An association does not mean one variable causes the other variable to change but that a change in one variable appears to affect the other. Practice questions Question 5 Glen surveyed all the students in Year 1212 at his school and summarised the results in the following table: Play netball Do not play netball Total Height\ge$$170$170 cm $33$33 $72$72 $105$105
Height$<$<$170$170 cm $13$13 $30$30 $43$43
Total $46$46 $102$102 $148$148
1. Which variable is the explanatory variable?

Play netball

A

Height

B

Play netball

A

Height

B
2. To examine if there is an association between height and playing netball, should Glen use a column or row percentage frequency table?

Column

A

Row

B

Column

A

Row

B
3. Complete the row percentage frequency table for this data.

Play netball Do not play netball Total
Height$\ge$$170170 cm \editable{}$$%$% $\editable{}$$%% \editable{}$$%$%
Height$<$<$170$170 cm $\editable{}$$%% \editable{}$$%$% $\editable{}$$%% 4. Looking at the columns of the completed table, does there appear to be an association between height and playing netball? No, there does not appear to be any association as the numbers are similar. A Yes, there appears to be an association as the numbers are quite different. It seems that taller people like to play netball. B No, there does not appear to be any association as the numbers are similar. A Yes, there appears to be an association as the numbers are quite different. It seems that taller people like to play netball. B Question 6 Members of a gym were asked what kind of training they do. Each responder only did one kind of training. The table shows the results. Cardio Weight Male 1111 2626 Female 4646 1717 1. Which variable is the explanatory variable? Gender A Type of training B Gender A Type of training B 2. To examine if there is an association between the type of training and gender, should we use a column or row percentage frequency table? Row A Column B Row A Column B 3. Complete the row percentage frequency table for this data. Round your answers to the nearest percentage. Cardio Weight Total Male \editable{}$$%$% $\editable{}$$%% \editable{}$$%$%
Female $\editable{}$$%% \editable{}$$%$% $\editable{}$$%$%
4. Looking at the columns of the completed table, does there appear to be an association between the type of training and the gender of gym members?

Yes, there appears to be an association as the numbers are quite different. Men seem to prefer weights, while women seem to prefer cardio.

A

No, there does not appear to be any association as the numbers are different.

B

Yes, there appears to be an association as the numbers are quite different. Men seem to prefer weights, while women seem to prefer cardio.

A

No, there does not appear to be any association as the numbers are different.

B
5. Does a person’s gender cause them to choose a certain type of training?

Yes. As we saw, women prefer to do cardio and men prefer to do weights.

A

No, association is not causation. There appears to be an association but we cannot say whether one variable causes the other.

B

Yes. As we saw, women prefer to do cardio and men prefer to do weights.

A

No, association is not causation. There appears to be an association but we cannot say whether one variable causes the other.

B

100% stacked column graphs

Association between variables can often be seen more clearly in a stacked column graph. Below is a stacked column graph (also called segmented column graph) for the data from the Titanic table earlier. When we look at each column we can see the proportion of blue in each column is different. This indicates there is an association between the variables.

If there is no association then the proportion of the sections in each column are the same. When we look at the graph below we can see that each column is divided into similar size sections. This indicates there is NO clear association between household composition and distribution of money.

How to draw a stacked column graph

Label the horizontal axis with the explanatory variables.

Label the vertical axis with percentages from $0%$0% to $100%$100%.

Draw a column for each explanatory variable that reaches the height of $100%$100% on the vertical axis.

To divide each column into the percentages as shown in the frequency table start from the bottom of the column, count to the first percentage and draw a horizontal line to mark it off, then count up to the second percentage from the horizontal line and then mark off again, until all sections are complete

Write the label and percentage in each section of the columns which indicates the response variables displayed or provide a key.

Practice questions

Question 7

$170$170 people were surveyed about their music preference. The results have been recorded in the table below.

Musical Preferences by Gender
Music Preference Male Female Total
Rock & Roll $24$24 $19$19 $43$43
Classical $8$8 $15$15 $23$23
Pop $17$17 $17$17 $34$34
Rap $6$6 $2$2 $8$8
Country & Western $17$17 $24$24 $41$41
R&B $6$6 $9$9 $15$15
Punk $4$4 $2$2 $6$6
Total $82$82 $88$88 $170$170
1. What is the explanatory variable in this data set?

Music preference

A

Gender

B

Music preference

A

Gender

B
2. Which of the following $100%$100% stacked column charts should be used to look for an association between the variables?

A

B

A

B
3. Does this stacked column chart suggest that there is an association between music preference and gender?

No. The data does not suggest any association, as the corresponding segments are of similar sizes.

A

Yes. The data suggests there is an association, as the corresponding segments are of similar sizes.

B

No. The data does not suggest any association, as the corresponding segments are of different sizes.

C

Yes. The data suggests there is an association, as the corresponding segments are of different sizes.

D

No. The data does not suggest any association, as the corresponding segments are of similar sizes.

A

Yes. The data suggests there is an association, as the corresponding segments are of similar sizes.

B

No. The data does not suggest any association, as the corresponding segments are of different sizes.

C

Yes. The data suggests there is an association, as the corresponding segments are of different sizes.

D

Question 8

A group of year 12 students surveyed their class and recorded the hair colour and eye colour for each student. The results are displayed in the $100%$100% stacked column chart shown below.

1. What is the explanatory variable for this chart?

Eye colour

A

Hair colour

B

Eye colour

A

Hair colour

B
2. Does the chart suggest an association between eye colour and hair colour?

Yes, as the corresponding segments are similar in size.

A

Yes, as the corresponding segments are of different sizes.

B

No, as the corresponding segments are similar in size.

C

No, as the corresponding segments are of different sizes.

D

Yes, as the corresponding segments are similar in size.

A

Yes, as the corresponding segments are of different sizes.

B

No, as the corresponding segments are similar in size.

C

No, as the corresponding segments are of different sizes.

D
3. Can we say that having blue eyes causes a high chance of having blonde hair?

Yes. The data shows that students with blue eyes are more likely to have blonde hair.

A

No. There appears to be an association, but we cannot say that one causes the other.

B

Yes. The data shows that students with blue eyes are more likely to have blonde hair.

A

No. There appears to be an association, but we cannot say that one causes the other.

B

Outcomes

ACMGM049

construct two-way frequency tables and determine the associated row and column sums and percentages

ACMGM050

use an appropriately percentaged two-way frequency table to identify patterns that suggest the presence of an association

ACMGM051

describe an association in terms of differences observed in percentages across categories in a systematic and concise manner, and interpret this in the context of the data

ACMGM055

identify the response variable and the explanatory variable