topic badge

3.01 Classifying data

Lesson

Conducting a statistical investigation

A statistical inquiry is a process of transforming raw data into useful information that can tell us more about a subject and that allows us to make recommendations and possibly make predictions of future outcomes. It consists of six stages:

  1. Posing questions
  2. Collecting data
  3. Organising data
  4. Summarising and displaying data
  5. Analysing data and drawing conclusions
  6. Writing a report

A statistician seeks to be as accurate and fair as possible in each step of the process, especially in the collecting of the data (who they survey and the method of data collection), so that the data is as unbiased as possible. Issues of privacy, ethics and cultural sensitivities need to be considered in the collection process, for example, when considering the wording of questions to ask in a survey.

A statistician also needs to consider whether they will survey an entire population of interest (census) or a representative group from within the population (sample). The process of selecting a sample must be as unbiased as possible to keep the data as representative as possible. 

Types of data

In statistics, a 'variable' characterises data that is measurable or observable. A variable could be something like temperature, mass, height, make of car, type of animal or goals scored. We often collect data to observe and analyse changes in a variable.

Data variables can be defined as either numerical or categorical.

  • Numerical data is where each data point is represented by a number. Examples include: number of items sold each month, daily temperatures, heights of people, and ages of a population. The data can be further defined as either discrete (associated with counting) or continuous (associated with measuring). Numerical data is also known as quantitative data.
     
  • Categorical data is where each data point is represented by a word or label. Examples include: brand names, types of animals, favourite colours, and names of countries. The data can be further defined as either ordinal (it can be ordered) or nominal (un-ordered). Categorical data is also known as qualitative data.

Discrete numerical data

Discrete numerical data involves data points that are distinct and separate from each other. There is a definite 'gap' separating one data point from the next. Discrete data usually, but not always, consists of whole numbers, and is often collected by some form of counting.

Examples of discrete data:

Number of goals scored per match $1$1, $3$3, $0$0, $1$1, $2$2, $0$0, $2$2, $4$4, $2$2, $0$0, $1$1, $1$1, $2$2, ...
Number of children per family $2$2, $3$3, $1$1, $0$0, $1$1, $4$4, $2$2, $2$2, $0$0, $1$1, $1$1, $5$5, $3$3, ...
Number of products sold each day $437$437, $410$410, $386$386, $411$411, $401$401, $397$397, $422$422, ...

In each of these cases, there are no in-between values. We cannot have $2.5$2.5 goals or $1.2$1.2 people, for example.

This doesn't mean that discrete data always consists of whole numbers. Shoe sizes, an example of discrete data, are often separated by half-sizes. For example, $8$8, $8.5$8.5, $9$9, $9.5$9.5. Even still, there is a definite gap between the sizes. A shoe won't ever come in size $8.145$8.145.

Continuous numerical data

Continuous numerical data involves data points that can occur anywhere along a continuum. Any value is possible within a range of values. Continuous data often involves the use of decimal numbers, and is often collected using some form of measurement.

Examples of continuous data:

Height of trees in a forest (in metres) $12.359$12.359, $14.022$14.022, $14.951$14.951, $18.276$18.276, $11.032$11.032, ...
Times taken to run a $10$10 km race (minutes) $55.34$55.34, $58.03$58.03, $57.25$57.25, $61.49$61.49, $66.11$66.11, $59.87$59.87, ...
Daily temperature (degrees C) $24.4$24.4, $23.0$23.0, $22.5$22.5, $21.6$21.6, $20.7$20.7, $20.2$20.2, $19.7$19.7...

In practice, continuous data will always be subject to the accuracy of the measuring device being used, so is generally rounded. However, given a height measured to the nearest centimetre of $165$165 cm we know that the height lies on the interval $\left[164.5,165.5\right)$[164.5,165.5). So unlike discrete numbers, such measurements are on a continuous interval with no gaps between neighbouring measurements.

Ordinal categorical data

The word 'ordinal' basically means 'ordered'. Ordinal categorical data involves data points, consisting of words or labels, that can be ordered or ranked in some way.

Examples of ordinal data:

Product rating on a survey good, satisfactory, good, excellent, excellent, good, good, ...
Exam grades A, C, A, B, B, C, A, B, A, A, C, B, A, B, B, B, C, A, C, ...
Size of fish in a lake medium, small, small, medium, small, large, medium, large, ...

Ordinal data is often used in surveys such as a service rating (poor, average, good, excellent), results can then be further analysed by changing the ordered ratings to numerical data.

Nominal categorical data

The word 'nominal' basically means 'name'. Nominal categorical data consists of words or labels, that name individual data points.

Examples of nominal data:

Nationalities in a sporting team German, Austrian, Italian, Spanish, Dutch, Italian, ...
Make of car driving through an intersection Toyota, Holden, Mazda, Toyota, Ford, Toyota, Mazda, ...
Hair colour of students in a class blonde, red, brown, blonde, black, brown, black, red, ...

Nominal data is often described as 'un-ordered' because it can't be ordered in a way that is statistically meaningful.

Practice questions

Question 1

Which two of the following are examples of numerical data?

  1. favourite flavours

    A

    maximum temperature

    B

    daily temperature

    C

    types of horses

    D

Question 2

Which one of the following data types is discrete?

  1. The number of classrooms in your school

    A

    Daily humidity

    B

    The ages of a group of people

    C

    The time taken to run $200$200 metres

    D

Question 3

Classify this data into its correct category:

Weights of dogs

  1. Categorical Nominal

    A

    Categorical Ordinal

    B

    Numerical Discrete

    C

    Numerical Continuous

    D

Outliers

An outlier is an event that is very different from the norm and results in a score that is far away from the rest of the scores. For example, suppose there are $10$10 people in a long jump contest. Nine of those people managed a jump between $4$4 and $5$5 metres, while one person only jumped $1$1 metre. That one person is an outlier, as their jump is so much shorter than everyone else's. We will look more at outliers in a later lesson.

Outcomes

MA12-8

solves problems using appropriate statistical processes

What is Mathspace

About Mathspace