In statistics, a 'variable' refers to a characteristic of data that is measurable or observable. A variable could be something like temperature, mass, height, make of car, type of animal or goals scored. We often collect data to observe and analyse changes in a variable.
Data variables can be defined as either numerical or categorical.
Discrete numerical data involve data points that are distinct and separate from each other. There is a definite 'gap' separating one data point from the next. Discrete data usually, but not always, consists of whole numbers, and is often collected by some form of counting.
Examples of discrete data:
Number of goals scored per match | $1$1, $3$3, $0$0, $1$1, $2$2, $0$0, $2$2, $4$4, $2$2, $0$0, $1$1, $1$1, $2$2, ... |
---|---|
Number of children per family | $2$2, $3$3, $1$1, $0$0, $1$1, $4$4, $2$2, $2$2, $0$0, $1$1, $1$1, $5$5, $3$3, ... |
Number of products sold each day | $437$437, $410$410, $386$386, $411$411, $401$401, $397$397, $422$422, ... |
In each of these cases, there are no in-between values. We cannot have $2.5$2.5 goals or $1.2$1.2 people, for example.
This doesn't mean that discrete data always consists of whole numbers. Shoe sizes, an example of discrete data, are often separated by half-sizes. For example, $8$8, $8.5$8.5, $9$9, $9.5$9.5. Even still, there is a definite gap between the sizes. A shoe won't ever come in size $8.145$8.145.
Continuous numerical data involves data points that can occur anywhere along a continuum. Any value is possible within a range of values. Continuous data often involves the use of decimal numbers, and is often collected using some form of measurement.
Examples of continuous data:
Height of trees in a forest (in metres) | $12.359$12.359, $14.022$14.022, $14.951$14.951, $18.276$18.276, $11.032$11.032, ... |
---|---|
Times taken to run a $10$10 km race (minutes) | $55.34$55.34, $58.03$58.03, $57.25$57.25, $61.49$61.49, $66.11$66.11, $59.87$59.87, ... |
Daily temperature (degrees C) | $24.4$24.4, $23.0$23.0, $22.5$22.5, $21.6$21.6, $20.7$20.7, $20.2$20.2, $19.7$19.7... |
In practice, continuous data will always be subject to the accuracy of the measuring device being used, so is generally rounded. However, given a height measured to the nearest centimetre of $165\ cm$165 cm we know that the height lies on the interval $\left[164.5,165.5\right)$[164.5,165.5). So unlike discrete numbers, such measurements are on a continuous interval with no gaps between neighbouring measurements.
The word 'ordinal' basically means 'ordered'. Ordinal categorical data involves data points, consisting of words or labels, that can be ordered or ranked in some way.
Examples of ordinal data:
Product rating on a survey | good, satisfactory, good, excellent, excellent, good, good, ... |
---|---|
Exam grades | A, C, A, B, B, C, A, B, A, A, C, B, A, B, B, B, C, A, C, ... |
Size of fish in a lake | medium, small, small, medium, small, large, medium, large, ... |
Ordinal data is often used in surveys such as a service rating (poor, average, good, excellent), results can then be further analysed by changing the ordered ratings to numerical data.
The word 'nominal' basically means 'name'. Nominal categorical data consists of words or labels, that name individual data points.
Examples of nominal data:
Nationalities in a sporting team | German, Austrian, Italian, Spanish, Dutch, Italian, ... |
---|---|
Make of car driving through an intersection | Toyota, Holden, Mazda, Toyota, Ford, Toyota, Mazda, ... |
Hair colour of students in a class | blonde, red, brown, blonde, black, brown, black, red, ... |
Nominal data is often described as 'un-ordered' because it can't be ordered in a way that is statistically meaningful.
Which two of the following are examples of numerical data?
favourite flavours
maximum temperature
daily temperature
types of horses
Classify this data into its correct category:
Weights of dogs
Categorical Nominal
Categorical Ordinal
Numerical Discrete
Numerical Continuous
This is usually accomplished by organising the data in tables, including frequency tables. Continuous numerical data is usually best organised in grouped frequency tables.
Frequency tables are the best choice to organise categorical data and discrete numerical data when there is a small number of possible values.
Grouped frequency tables are best for continuous numerical data and discrete numerical data when the data can take a large number of possible values. The frequency recorded for a group is the sum of the frequencies for all data values contained in the group.
The tables below show examples of a frequency table used for categorical data, and a grouped frequency table used for continuous numerical data.
Frequency table
Colour | Frequency |
---|---|
white | $14$14 |
red | $2$2 |
blue | $3$3 |
black | $1$1 |
yellow | $1$1 |
Grouped frequency table
Height (cm) | Frequency |
---|---|
$145-<150$145−<150 | $3$3 |
$150-<155$150−<155 | $10$10 |
$155-<160$155−<160 | $8$8 |
$160-<165$160−<165 | $13$13 |
$165-<170$165−<170 | $1$1 |
When we group data, we create class intervals, which tell us the range of scores in a particular group.
Every data value must go into exactly one and only one class interval.
There are several different ways that class intervals can be defined. Here are some examples with two adjacent class intervals:
Class interval formats | Description | |
---|---|---|
$45 |
$50 |
Upper endpoint included, lower endpoint excluded. |
$45\le x<50$45≤x<50 | $50\le x<55$50≤x<55 | Lower endpoint included, upper endpoint excluded. |
$45$45 to $<50$<50 | $50$50 to $<55$<55 | Lower endpoint included, upper endpoint excluded. |
$45-49$45−49 | $50-54$50−54 | Suitable for discrete data, both end points included. |
To help make it easier to work with our data, we usually find the class centre which is taken as the representative value of the class interval when we analyse the data. The class centre is the middle score of each class interval. For the interval $1-5$1−5, the class centre would be $\frac{1+5}{2}=3$1+52=3.
Selecting the interval width is important. If the intervals are too narrow there will be many gaps so the shape of the distribution will not be clearly visible. If the intervals are too wide the shape of distribution will not be apparent. As a guide, $6$6 to $12$12 intervals will typically be most useful for moderate size data sets.
Find the class centre for the class interval $17$17-$22$22.
What would be the most appropriate way of representing data from:
A survey conducted of $1000$1000 people, asking them how many languages they speak?
Leaving the data ungrouped and constructing a frequency table
Grouping the responses and constructing a frequency table
A survey conducted of $1000$1000 people, asking them how many different countries they know the names of?
Grouping the responses and constructing a frequency table
Leaving the data ungrouped and constructing a frequency table
The time spent by patients waiting at a doctor’s office was recorded over one week. The results are shown in the frequency table below.
Wait time (mins) | Frequency |
---|---|
$0\le t<10$0≤t<10 | $37$37 |
$10\le t<20$10≤t<20 | $50$50 |
$20\le t<30$20≤t<30 | $25$25 |
$30\le t<40$30≤t<40 | $21$21 |
$40\le t<50$40≤t<50 | $12$12 |
$50\le t<60$50≤t<60 | $2$2 |
$60\le t<70$60≤t<70 | $3$3 |
How many patients visited the doctor this week?
What percentage of patients had to wait half an hour or more?
Round your answer to two decimal places.
What percentage of patients were seen within $20$20 minutes?