Lesson

The goal of bivariate data analysis is see if two variables are associated in some way. The two variables that we study in bivariate statistics are called the explanatory variable and the response variable.

An experiment is to be conducted in which a researcher surveys men and women as to their preference for coffee or tea or neither. The researcher would then analyse the results to find whether gender is a predictor for hot beverage preference. Identify the explanatory and response variables in this experiment.

**Think:** To identify which is the explanatory variable and which is the response variable we consider the questions "Does gender explain hot beverage preference?" or "Does hot beverage preference explain the gender of a person?".

**Do:** So we have that

Explanatory variable: | Gender |

Response variable: | Hot beverage preference |

Definitions

Explanatory Variable - the variable which we expect to **explain** or **predict** the value of the response variable.

Response Variable - the variable which we expect to **respond** to the value of the explanatory variable.

When displaying bivariate data graphically the explanatory variable is plotted on the horizontal axis (the $x$`x`-axis), and the response variable on the vertical axis (the $y$`y`-axis).

A single coordinate point in a bivariate data set might be written in the form $\left(x,y\right)$(`x`,`y`), and it would be understood that $x$`x` is the explanatory variable and $y$`y` is the response variable.

Plotting explanatory and response variables

Explanatory variable - plotted on the horizontal $x$`x`-axis

Response variable - plotted on the vertical $y$`y`-axis

Consider the following variables:

Temperature ($^\circ$°C)

Number of ice cream cones sold

Which of the following statements makes sense?

A change in temperature affects the number of ice cream cones sold.

AA change in the number of ice cream cones sold affects the temperature.

BA change in temperature affects the number of ice cream cones sold.

AA change in the number of ice cream cones sold affects the temperature.

BWhich is the explanatory variable and which is the response variable?

EV: number of ice cream cones sold

RV: temperature

AEV: temperature

RV: number of ice cream cones sold

BEV: number of ice cream cones sold

RV: temperature

AEV: temperature

RV: number of ice cream cones sold

B

Which of the following images has the variables placed in the correct positions?

- ABCDEABCDE

The scatter plot shows the relationship between sea temperature and the amount of healthy coral.

Which variable is the response variable?

Sea temperature

ALevel of healthy coral

BSea temperature

ALevel of healthy coral

BWhich variable is the explanatory variable?

Sea temperature

ALevel of healthy coral

BSea temperature

ALevel of healthy coral

B

Two-way tables allow us to display and examine the relationship between two sets of **categorical** data. The categories are labelled at the top and the left side of the table, and the frequency of the different characteristics appear in the interior of the table. Often the totals of each row and column are also shown.

The following are the statistics of the passengers and crew who sailed on the Titanic on its fateful maiden voyage in 1912.

First Class | Second Class | Third Class | Crew | Total | |
---|---|---|---|---|---|

Survived | $202$202 | $118$118 | $178$178 | $212$212 | $710$710 |

Died | $123$123 | $167$167 | $528$528 | $696$696 | $1514$1514 |

Total: | $325$325 | $285$285 | $706$706 | $908$908 | $2224$2224 |

Although it is interesting to know that $202$202 First Class passengers survived, it is far more useful to know the percentage break up of survivors from each class. To do this we need to calculate **row percentages**. To find the percentage divide the value in the table by the row total and multiply by $100%$100%. The calculations for the first row in the table are shown below.

**Think**: We need the fraction of first class survivors out of total number of survivors (row total) rounded to the nearest percentage.

Percentage of survivors that were first class | $=$= | $\frac{\text{first class survivors}}{\text{total number survivors}}$first class survivorstotal number survivors |

$=$= | $\frac{202}{710}$202710 | |

$\approx$≈ | $28%$28% | |

First Class | Second Class | Third Class | Crew | Total | |
---|---|---|---|---|---|

Survived | $\frac{202}{710}\times100%\approx28%$202710×100%≈28% | $\frac{118}{710}\times100%\approx17%$118710×100%≈17% | $\frac{178}{704}\times100%\approx25%$178704×100%≈25% | $\frac{212}{710}\times100%\approx30%$212710×100%≈30% | $100%$100% |

Died | $\frac{123}{1514}\times100%\approx8%$1231514×100%≈8% | $11%$11% | $35%$35% | $46%$46% | $100%$100% |

**Reflect:** This percentage frequency table now gives us more useful information than the raw data. The first row gives us the percentage breakdown of survivors by class. We can now easily read that $46%$46% of the people who died were crew members whereas only $8%$8% of the people who died were in first class.

We can also calculate the percentage in each class type that survived or died. To do this we calculate **column** **percentages**. To find the percentage divide the value in the table by the column total and multiply by $100%$100%. The calculations are shown in the table below:

First Class | Second Class | Third Class | Crew | |
---|---|---|---|---|

Survived | $\frac{202}{325}\times100%\approx62%$202325×100%≈62% | $\frac{118}{285}\times100%\approx41%$118285×100%≈41% | $\frac{178}{706}\times100%\approx25%$178706×100%≈25% | $\frac{212}{908}\times100%\approx23%$212908×100%≈23% |

Died | $\frac{123}{325}\times100%\approx38%$123325×100%≈38% | $59%$59% | $75%$75% | $77%$77% |

Total: | $100%$100% | $100%$100% | $100%$100% | $100%$100% |

**Reflect: **This percentage frequency table now gives us more useful information. The first column gives us the percentage breakdown of survivals and deaths in first class. From the raw data we can see that a similar number of first class ($202$202) and third class ($178$178) passengers survived. However this can be misleading. The percentage frequency table shows us that $62%$62% of first class passengers survived whereas only $25%$25% of third class passengers survived.

Maria surveyed a group of people about the type of job they had. She recorded the data in the following graph.

Complete the following two-way table displaying the row percentage.

Give your answers correct to $1$1 decimal place.

None Casual Part-time Full-time Total Men $17.6%$17.6% $\editable{}$$%$% $29.4%$29.4% $41.2%$41.2% $100%$100% Women $22.2%$22.2% $27.8%$27.8% $\editable{}$$%$% $\editable{}$$%$% $100%$100% Complete the following two-way table displaying the column percentage.

Give your answers correct to $1$1 decimal place.

None Casual Part-time Full-time Men $\editable{}$$%$% $28.6%$28.6% $62.5%$62.5% $\editable{}$$%$% Women $57.1%$57.1% $\editable{}$$%$% $37.5%$37.5% $\editable{}$$%$% Total $100%$100% $100%$100% $100%$100% $100%$100%

To find if there is an association between the variables in the Titanic table we can ask the question "Is survival rate dependent on the class of the passenger?"

In order to find this we must first identify the **explanatory variable**. In this problem it is the class of passenger. The explanatory variable forms the heading of each **column**, therefore the **column** percentage frequency table will best indicate any patterns.

First Class | Second Class | Third Class | Crew | |
---|---|---|---|---|

Survived | $62%$62% | $41%$41% | $25%$25% | $23%$23% |

Died | $38%$38% | $59%$59% | $75%$75% | $77%$77% |

Total: | $100%$100% | $100%$100% | $100%$100% | $100%$100% |

When we read across the first row in the **column** percentage frequency table and look at the numbers we can see a clear difference in the percentages of passengers that lived or died in each class. This suggests there **is an association **between the class of passenger and the rate of survival. It appears that the higher the class of passenger, the higher the rate of survival.

Which percentage frequency table to use?

If the explanatory variable forms the **column** headings then we use the **column** percentage frequency table to look for association.

If the explanatory variable forms the **row** headings then we use the **row** percentage frequency table to look for association.

How do we tell if there is an association?

If it's a **column** percentage table then look across the **rows** for differences in the values. If the values are similar then we say there is NO clear association.

If it's a **row** percentage table then look down the **columns** for differences in the values. If the values are similar then we say there is NO clear association.

How to describe the association?

First state if there is or is not an association apparent. For example: There appears to be an association between the variables.

Next describe the association. For example: The class of passenger affects the survival rate of the passengers.

Finally give an example. For example: The higher the class of passenger, the more likely the passenger was to survive.

An overview of the percentages in two-way tables can bring to light clear associations. The presence of more subtle associations and an objective measure of the significance of such associations requires additional analysis and methods from further studies in statistics.

**Note: **The term 'association' is used to describe a relationships between variables. An association does **not** mean one variable **causes** the other variable to change but that a change in one variable appears to affect the other.

Glen surveyed all the students in Year $12$12 at his school and summarised the results in the following table:

Play netball | Do not play netball | Total | |
---|---|---|---|

Height$\ge$≥$170$170 cm | $33$33 | $72$72 | $105$105 |

Height$<$<$170$170 cm | $13$13 | $30$30 | $43$43 |

Total | $46$46 | $102$102 | $148$148 |

Which variable is the explanatory variable?

Play netball

AHeight

BPlay netball

AHeight

BTo examine if there is an association between height and playing netball, should Glen use a column or row percentage frequency table?

Column

ARow

BColumn

ARow

BComplete the row percentage frequency table for this data.

Round your answers to the nearest percentage.

Play netball Do not play netball Total Height$\ge$≥$170$170 cm $\editable{}$$%$% $\editable{}$$%$% $\editable{}$$%$% Height$<$<$170$170 cm $\editable{}$$%$% $\editable{}$$%$% $\editable{}$$%$% Looking at the columns of the completed table, does there appear to be an association between height and playing netball?

No, there does not appear to be any association as the numbers are similar.

AYes, there appears to be an association as the numbers are quite different. It seems that taller people like to play netball.

BNo, there does not appear to be any association as the numbers are similar.

AYes, there appears to be an association as the numbers are quite different. It seems that taller people like to play netball.

B

Members of a gym were asked what kind of training they do. Each responder only did one kind of training. The table shows the results.

Cardio | Weight | |
---|---|---|

Male | $11$11 | $26$26 |

Female | $46$46 | $17$17 |

Which variable is the explanatory variable?

Gender

AType of training

BGender

AType of training

BTo examine if there is an association between the type of training and gender, should we use a column or row percentage frequency table?

Row

AColumn

BRow

AColumn

BComplete the row percentage frequency table for this data.

Round your answers to the nearest percentage.

Cardio Weight Total Male $\editable{}$$%$% $\editable{}$$%$% $\editable{}$$%$% Female $\editable{}$$%$% $\editable{}$$%$% $\editable{}$$%$% Looking at the columns of the completed table, does there appear to be an association between the type of training and the gender of gym members?

Yes, there appears to be an association as the numbers are quite different. Men seem to prefer weights, while women seem to prefer cardio.

ANo, there does not appear to be any association as the numbers are different.

BYes, there appears to be an association as the numbers are quite different. Men seem to prefer weights, while women seem to prefer cardio.

ANo, there does not appear to be any association as the numbers are different.

BDoes a person’s gender cause them to choose a certain type of training?

Yes. As we saw, women prefer to do cardio and men prefer to do weights.

ANo, association is not causation. There appears to be an association but we cannot say whether one variable causes the other.

BYes. As we saw, women prefer to do cardio and men prefer to do weights.

ANo, association is not causation. There appears to be an association but we cannot say whether one variable causes the other.

B

Association between variables can often be seen more clearly in a stacked column graph. Below is a stacked column graph (also called segmented column graph) for the data from the Titanic table earlier. When we look at each column we can see the proportion of blue in each column is different. This indicates there **is** an association between the variables.

If there is no association then the proportion of the sections in each column are the same. When we look at the graph below we can see that each column is divided into similar size sections. This indicates there is NO clear association between household composition and distribution of money.

How to draw a stacked column graph

Label the **horizontal axis **with the explanatory variables.

Label the **vertical axis** with percentages from $0%$0% to $100%$100%.

Draw a column for each explanatory variable that reaches the height of $100%$100% on the vertical axis.

To divide each column into the percentages as shown in the frequency table start from the bottom of the column, count to the first percentage and draw a horizontal line to mark it off, then count up to the second percentage from the horizontal line and then mark off again, until all sections are complete

Write the **label** and **percentage** in each section of the columns which indicates the response variables displayed or provide a key.

$170$170 people were surveyed about their music preference. The results have been recorded in the table below.

Musical Preferences by Gender | |||
---|---|---|---|

Music Preference | Male | Female | Total |

Rock & Roll | $24$24 | $19$19 | $43$43 |

Classical | $8$8 | $15$15 | $23$23 |

Pop | $17$17 | $17$17 | $34$34 |

Rap | $6$6 | $2$2 | $8$8 |

Country & Western | $17$17 | $24$24 | $41$41 |

R&B | $6$6 | $9$9 | $15$15 |

Punk | $4$4 | $2$2 | $6$6 |

Total | $82$82 | $88$88 | $170$170 |

What is the explanatory variable in this data set?

Music preference

AGender

BMusic preference

AGender

BWhich of the following $100%$100% stacked column charts should be used to look for an association between the variables?

ABABDoes this stacked column chart suggest that there is an association between music preference and gender?

No. The data does not suggest any association, as the corresponding segments are of similar sizes.

AYes. The data suggests there is an association, as the corresponding segments are of similar sizes.

BNo. The data does not suggest any association, as the corresponding segments are of different sizes.

CYes. The data suggests there is an association, as the corresponding segments are of different sizes.

DNo. The data does not suggest any association, as the corresponding segments are of similar sizes.

AYes. The data suggests there is an association, as the corresponding segments are of similar sizes.

BNo. The data does not suggest any association, as the corresponding segments are of different sizes.

CYes. The data suggests there is an association, as the corresponding segments are of different sizes.

D

A group of year 12 students surveyed their class and recorded the hair colour and eye colour for each student. The results are displayed in the $100%$100% stacked column chart shown below.

What is the explanatory variable for this chart?

Eye colour

AHair colour

BEye colour

AHair colour

BDoes the chart suggest an association between eye colour and hair colour?

Yes, as the corresponding segments are similar in size.

AYes, as the corresponding segments are of different sizes.

BNo, as the corresponding segments are similar in size.

CNo, as the corresponding segments are of different sizes.

DYes, as the corresponding segments are similar in size.

AYes, as the corresponding segments are of different sizes.

BNo, as the corresponding segments are similar in size.

CNo, as the corresponding segments are of different sizes.

DCan we say that having blue eyes causes a high chance of having blonde hair?

Yes. The data shows that students with blue eyes are more likely to have blonde hair.

ANo. There appears to be an association, but we cannot say that one causes the other.

BYes. The data shows that students with blue eyes are more likely to have blonde hair.

ANo. There appears to be an association, but we cannot say that one causes the other.

B

construct two-way frequency tables and determine the associated row and column sums and percentages

use an appropriately percentaged two-way frequency table to identify patterns that suggest the presence of an association

describe an association in terms of differences observed in percentages across categories in a systematic and concise manner, and interpret this in the context of the data

identify the response variable and the explanatory variable