topic badge

7.02 Classify and organise data

Lesson

In statistics, a 'variable' refers to a characteristic of data that is measurable or observable. A variable could be something like temperature, mass, height, make of car, type of animal or goals scored. We often collect data to observe and analyse changes in a variable.

Types of data

Data variables can be defined as either numerical or categorical.

  • Numerical data is where each data point is represented by a number. Examples include: number of items sold each month, daily temperatures, heights of people, and ages of a population. The data can be further defined as either discrete (associated with counting) or continuous (associated with measuring). Numerical data is also known as quantitative data.
     
  • Categorical data is where each data point is represented by a word or label. Examples include: brand names, types of animals, favourite colours, and names of countries. The data can be further defined as either ordinal (it can be ordered) or nominal (un-ordered). Categorical data is also known as qualitative data.

 

Discrete numerical data

Discrete numerical data involve data points that are distinct and separate from each other. There is a definite 'gap' separating one data point from the next. Discrete data usually, but not always, consists of whole numbers, and is often collected by some form of counting.

Examples of discrete data:

Number of goals scored per match $1$1, $3$3, $0$0, $1$1, $2$2, $0$0, $2$2, $4$4, $2$2, $0$0, $1$1, $1$1, $2$2, ...
Number of children per family $2$2, $3$3, $1$1, $0$0, $1$1, $4$4, $2$2, $2$2, $0$0, $1$1, $1$1, $5$5, $3$3, ...
Number of products sold each day $437$437, $410$410, $386$386, $411$411, $401$401, $397$397, $422$422, ...

In each of these cases, there are no in-between values. We cannot have $2.5$2.5 goals or $1.2$1.2 people, for example.

This doesn't mean that discrete data always consists of whole numbers. Shoe sizes, an example of discrete data, are often separated by half-sizes. For example, $8$8, $8.5$8.5, $9$9, $9.5$9.5. Even still, there is a definite gap between the sizes. A shoe won't ever come in size $8.145$8.145.

 

Continuous numerical data

Continuous numerical data involves data points that can occur anywhere along a continuum. Any value is possible within a range of values. Continuous data often involves the use of decimal numbers, and is often collected using some form of measurement.

Examples of continuous data:

Height of trees in a forest (in metres) $12.359$12.359, $14.022$14.022, $14.951$14.951, $18.276$18.276, $11.032$11.032, ...
Times taken to run a $10$10 km race (minutes) $55.34$55.34, $58.03$58.03, $57.25$57.25, $61.49$61.49, $66.11$66.11, $59.87$59.87, ...
Daily temperature (degrees C) $24.4$24.4, $23.0$23.0, $22.5$22.5, $21.6$21.6, $20.7$20.7, $20.2$20.2, $19.7$19.7...

In practice, continuous data will always be subject to the accuracy of the measuring device being used, so is generally rounded. However, given a height measured to the nearest centimetre of $165\ cm$165 cm we know that the height lies on the interval $\left[164.5,165.5\right)$[164.5,165.5). So unlike discrete numbers, such measurements are on a continuous interval with no gaps between neighbouring measurements.

 

Ordinal categorical data

The word 'ordinal' basically means 'ordered'. Ordinal categorical data involves data points, consisting of words or labels, that can be ordered or ranked in some way.

Examples of ordinal data:

Product rating on a survey good, satisfactory, good, excellent, excellent, good, good, ...
Exam grades A, C, A, B, B, C, A, B, A, A, C, B, A, B, B, B, C, A, C, ...
Size of fish in a lake medium, small, small, medium, small, large, medium, large, ...

Ordinal data is often used in surveys such as a service rating (poor, average, good, excellent), results can then be further analysed by changing the ordered ratings to numerical data.


Nominal categorical data

The word 'nominal' basically means 'name'. Nominal categorical data consists of words or labels, that name individual data points.

Examples of nominal data:

Nationalities in a sporting team German, Austrian, Italian, Spanish, Dutch, Italian, ...
Make of car driving through an intersection Toyota, Holden, Mazda, Toyota, Ford, Toyota, Mazda, ...
Hair colour of students in a class blonde, red, brown, blonde, black, brown, black, red, ...

Nominal data is often described as 'un-ordered' because it can't be ordered in a way that is statistically meaningful.

 

Types of data
  • Categorical - represented by words
    • Ordinal - has an implicit order (such as subject grades A, B, C, D)
    • Nominal - identified by name (such as breeds of dog)
  • Numerical - associated with a number value.
    • Discrete - can only take distinct values (such as the number of goals). Usually obtained by counting.
    • Continuous - can take on any value (such as temperature). Usually obtained by measuring.

Practice questions

Question 1

Which of the following are examples of numerical data? (Select all that apply)

  1. favourite flavours

    A

    maximum temperature

    B

    daily temperature

    C

    types of horses

    D

Question 2

Classify this data into its correct category:

Weights of dogs

  1. Categorical Nominal

    A

    Categorical Ordinal

    B

    Numerical Discrete

    C

    Numerical Continuous

    D

 

Organising data

This is usually accomplished by organising the data in tables, including frequency tables. Continuous numerical data is usually best organised in grouped frequency tables.

Frequency and grouped frequency tables

Frequency tables are the best choice to organise categorical data and discrete numerical data when there is a small number of possible values.

Grouped frequency tables are best for continuous numerical data and discrete numerical data when the data can take a large number of possible values. The frequency recorded for a group is the sum of the frequencies for all data values contained in the group.

The tables below show examples of a frequency table used for categorical data, and a grouped frequency table used for continuous numerical data.

Frequency table

Colour of cars in the school carpark

Colour Frequency
white $14$14
red $2$2
blue $3$3
black $1$1
yellow $1$1

Grouped frequency table

Height of year 9 students

Height (cm) Frequency
$145-<150$145<150 $3$3
$150-<155$150<155 $10$10
$155-<160$155<160 $8$8
$160-<165$160<165 $13$13
$165-<170$165<170 $1$1

 

Grouped data

When we group data, we create class intervals, which tell us the range of scores in a particular group. 

Important!

Every data value must go into exactly one and only one class interval.

 

There are several different ways that class intervals can be defined. Here are some examples with two adjacent class intervals:

Class interval formats Description
$4545<x50 $5050<x55 Upper endpoint included, lower endpoint excluded.
$45\le x<50$45x<50 $50\le x<55$50x<55 Lower endpoint included, upper endpoint excluded.
$45$45 to $<50$<50 $50$50 to $<55$<55 Lower endpoint included, upper endpoint excluded.
$45-49$4549 $50-54$5054 Suitable for discrete data, both end points included.

 

To help make it easier to work with our data, we usually find the class centre which is taken as the representative value of the class interval when we analyse the data. The class centre is the middle score of each class interval. For the interval $1-5$15, the class centre would be $\frac{1+5}{2}=3$1+52=3.

Selecting the interval width is important. If the intervals are too narrow there will be many gaps so the shape of the distribution will not be clearly visible. If the intervals are too wide the shape of distribution will not be apparent. As a guide, $6$6 to $12$12 intervals will typically be most useful for moderate size data sets.

 

Practice questions

Question 3

Find the class centre for the class interval $17$17-$22$22.

Question 4

What would be the most appropriate way of representing data from:

  1. A survey conducted of $1000$1000 people, asking them how many languages they speak?

    Leaving the data ungrouped and constructing a frequency table

    A

    Grouping the responses and constructing a frequency table

    B
  2. A survey conducted of $1000$1000 people, asking them how many different countries they know the names of?

    Grouping the responses and constructing a frequency table

    A

    Leaving the data ungrouped and constructing a frequency table

    B

Question 5

The time spent by patients waiting at a doctor’s office was recorded over one week. The results are shown in the frequency table below.

Wait time (mins) Frequency
$0\le t<10$0t<10 $37$37
$10\le t<20$10t<20 $50$50
$20\le t<30$20t<30 $25$25
$30\le t<40$30t<40 $21$21
$40\le t<50$40t<50 $12$12
$50\le t<60$50t<60 $2$2
$60\le t<70$60t<70 $3$3
  1. How many patients visited the doctor this week?

  2. What percentage of patients had to wait half an hour or more?

    Round your answer to two decimal places.

  3. What percentage of patients were seen within $20$20 minutes?

Outcomes

2.3.1.2

classify statistical variables as categorical or numerical

2.3.1.3

classify a categorical variable as ordinal or nominal and use tables and pie, bar and column charts to organise and display the data, e.g. ordinal: income level (high, medium, low); or nominal: place of birth (Australia, overseas)

2.3.1.4

classify a numerical variable as discrete or continuous, e.g. discrete: the number of rooms in a house; or continuous: the temperature in degrees Celsius

2.3.1.5

select and justify an appropriate graphical display to describe the distribution of a numerical dataset, including dot plot, stem-and-leaf plot, column chart or histogram

What is Mathspace

About Mathspace