Lesson

Your task is to follow the statistical investigation process in order to investigate and prepare a report from data collected about passengers on the Titanic. The data can be found in the attached table.

Below is a summary of the statistical investigation process. The corresponding exercise contains questions that will help you along the way.

The statistical investigation process is a process that begins with the need to solve a real-world problem and aims to reflect the way statisticians work.

It is a cyclic process that involves several stages:

Statistical investigation process

**Identify a problem**- clarify the problem and formulate questions that can be answered with data

**Collect data**- design and implement a plan to collect or obtain appropriate data

**Analyse data**- select and apply appropriate techniques to analyse the data

**Interpret and communicate the results**- interpret the results of analysis in a way that relates to the original question

For example, if a scientist is using the statistical investigation process to investigate a possible relationship between litres of soft drink consumed per week and BMI for a set of people (bivariate data), then the steps may look something like this:

**Identify a problem**- does there appear to be an association? Do people who drink more soft drink tend to have a higher BMI?

**Collect data**- collect bivariate data (litres consumed and BMI) for a sufficient number of people.
- the source of the data should be recorded so that someone reading the final report could verify the data themselves.

**Analyse data**- this is where we "do the maths" by applying graphical or numerical techniques to analyse the data.
- the scientist might construct a percentage two-way frequency table or a scattergraph, determine the strength of the relationship by determining the correlation co-efficient, or find an equation of a line or curve that best describes the apparent relationship.

**Interpret and communicate the results**- comment on whether the analysis indicates that there is an association between the variables.
- the interpretation of the results should be related to the original question and communicated in a systematic and concise manner.

Note: a statistician must consider whether they will survey an entire population of interest (**census**) or a representative group from within the population (**sample**). The process of selecting a sample must be as unbiased as possible to keep the data as representative as possible.

A typical structure for a statistical investigation report would have these sections:

- Introduction
- Data
- Analysis
- Conclusions

This format described above will not be suitable for all investigations, so you may choose to add additional sections, or break up these sections.

The introduction presents an outline of the investigation, including:

- clarify the task, clearly stating the problem that you are addressing as a statistical question
- describe the applicable circumstances
- identify the mathematical and statistical content
- state relevant and important assumptions

In this section, we should describe, explain and justify the methods that you used to obtain data. Data can be presented in tables, graphs or lists; preferably using familiar mathematical formats.

- if you are using primary data, you should explain
- the choice of census or sample
- how you selected your sample
- the methods you used to collect the data (e.g. questionnaire, measurement, counting)
- any problems that you encountered in collecting the data

- if you have used secondary data, you should state the source of the data and provide relevant information on how it was collected
- organise and display data, using lists, tables or graphs. When you have a large amount of data, it would be best to summarise the data in this section and refer to an appendix for the full data.

The analysis section contains the mathematical calculations along with an explanation and justification of the interpretations leading to the conclusions that we draw.

- describe, explain and justify the mathematical process used
- perform the mathematical analysis
- define the required variable and constant parameters
- calculations should be presented systematically and explained using mathematical language
- clearly state the final results of our analysis

- discuss strengths and weaknesses
- if appropriate, propose refinements to the investigation that would lead to stronger, or more useful, conclusions.

If the analysis is extensive, this section could just contain a summary of the mathematical analysis with references to further details in an appendix.

The conclusion should be an interpretation of the mathematical and statistical results in the context of the investigation. It should be a concise statement of the most important information and must not introduce any new information.

A good conclusion should concisely:

- restate the question
- state how data was obtained
- summarise the mathematical processes used to analyse the data and the results of analysis;
- state your conclusions, in the context of the original question and
- describe important limitations

Task description: to analyse data on Titanic passengers in order to evaluate which type of passenger is most likely to survive.

The screenshot below shows the first few rows of the data table from the file titanic-data.csv containing information about passengers on the Titanic. (Data originally provided at https://www.kaggle.com/c/titanic-survival/data).

Notes about the data:

Data name |
Explanation |
---|---|

Survived |
$0$0 = No $1$1 = Yes |

Pclass |
Also means socio-economic status (SES) 1st = Upperclass; 2nd = Middleclass; 3rd = Lowerclass |

Age |
Is in years. If estimated it is in form xx.$5$5 If age is less than one year old it is given as a fraction |

Sibsp |
Number of siblings or spouses travelling with this passenger |

Parch |
Number of parents or children aboard travelling with this passenger |

Fare |
The passenger fare measured in pounds sterling |

Embarked | C = Cherbourg Q = Queenstown S = Southhampton |

- Select or create a suitable statistical question that captures the requirements of this investigation.

For example:

Is there an association between port of embarkation and survival?

Is there an association between gender and survival?

Is there an association between class of passenger and survival?

Is there an association between travelling with spouse or siblings and survival?

Is there an association between travelling with parents or children and survival?

- Determine what data you will need to collect from the table.
- Decide which is the explanatory variable and the response variable.
- List the assumptions we are making when we select data from this document.
- Decide how you will present the data (% row frequency table or % column frequency table?)
- Collect the necessary data and record it in an appropriate table. Hint: it may be easier to collect the data by using the spreadsheet file and filtering or sorting the data.

- Do the mathematics needed. For example: calculate the % frequencies and find the appropriate measures of centre or measures of spread.
- Determine what type of graph would be best suited to displaying the results of our analysis. For example: a side by side column graph or a $100%$100% stacked column graph?
- Use technology or paper to construct graph(s) required that can be used to show the association between variables.
- Using mathematical terminology describe the association resulting from your calculations.

- Do the results of our calculations enable us to answer the statistical question that we posed?
- What is the answer to the question we posed?
- Would it have been better to use other data variables?
- Write a complete statistical investigation report for this investigation, following the guidelines provided in the lesson.

review the statistical investigation process; for example, identifying a problem and posing a statistical question, collecting or obtaining data, analysing the data, interpreting and communicating the results

implement the statistical investigation process to answer questions that involve identifying, analysing and describing associations between two categorical variables or between two numerical variables; for example, is there an association between attitude to capital punishment (agree with, no opinion, disagree with) and sex (male, female)? is there an association between height and foot length?