topic badge
AustraliaVIC
VCE 11 General 2023

1.03 Numerical data

Lesson

Introduction

Recall that data can be either numerical or categorical. While column (or bar) graphs are preferred for displaying categorical data, histograms are the preferred option for data that is numerical.

Histograms and grouped data

Continuous numerical data, such as times, heights, weights or temperatures, are based on measurements, so any data value is possible within a large range of values. For displaying this type of data, a histogram is used.

Although a histogram looks similar to a bar chart, there are a number of important differences between them:

  • Histograms show the distribution of data values, whereas a bar chart is used to compare data values.

  • Histograms are used for numerical data, whereas bar charts are often used for categorical data.

  • A histogram has a numerical scale on both axes, while a bar chart only has a numerical scale on the vertical axis.

  • The columns in a bar chart could be re-ordered, without affecting the representation of the data. In a histogram, each column corresponds with a range of values on a continuous scale, so the columns cannot be re-ordered.

A Histogram:

A histogram with the data of distribution of running times for a 10 kilometres race. Ask your teacher for more information.

A Bar chart:

A bar graph with the data of waste generated in Australia by industry. Ask your teacher for more information.

Key features of a frequency histogram:

  • The horizontal axis is a continuous numerical scale (like a number line). It represents numerical data, such as time, height, mass or temperature, and may be divided into class intervals.

  • The vertical axis is the frequency of each data value or class interval.

  • There are no gaps between the columns because the horizontal axis is a continuous scale. It is possible for a class interval to have a frequency of zero, but this is not the same as having gaps between each column.

  • It is good practice, when creating a histogram, to leave a half-column-width gap between the vertical axis and the first column.

Class intervalFrequency
45\leq \text{time} \lt 509
50\leq \text{time} \lt 557
55\leq \text{time} \lt 6020
60\leq \text{time} \lt 6530
65\leq \text{time} \lt 706

As an example, the following frequency distribution table represents the times taken for 72 runners to complete a ten kilometre race.

A histogram with the data of distribution of running times for a 10 kilometres race. Ask your teacher for more information.

The histogram represents the distribution of the data. It allows us to see clearly where all of the recorded times fall along a continuous scale.

What may surprise us at first is that the histogram has only five columns, even though it represents 72 different data values.

To produce the histogram, the data is first grouped into class intervals (which are also called bins), using the frequency distribution table.

In the table above,

  • The first class interval includes the running times for 9 different runners. Each of their times fall within a range that is greater than or equal to 45 minutes, but less than 50 minutes. This class interval is represented by the first column in the histogram.

  • The second class interval includes the running times for 7 different runners, each with times falling with a range greater than or equal to 50 minutes, but less than 55 minutes. This class interval is represented by the second column in the histogram, and so on.

Every data value must go into exactly one and only one class interval. Class intervals should be equal width.

There are several different ways that class intervals are defined. Here are some examples with two adjacent class intervals:

Class interval formatsFrequency
45\lt \text{time} \leq 50 \qquad 0\lt \text{time} \leq 55\text{Upper endpoint included, lower endpoint} \\ \text{excluded.}
45\leq \text{time} \lt 50 \qquad 0\leq \text{time} \lt 55\text{Lower endpoint included, upper endpoint} \\ \text{excluded.}
45\text{ to} \lt 50 \qquad \qquad 45\text{ to} \lt 50 \text{Lower endpoint included, upper endpoint} \\ \text{excluded.}
45 - 49 \qquad \, \, \quad \qquad 50 - 54\text{Suitable for data rounded to the nearest } \\ \text{whole number, or discrete data.}
45 \to 50\qquad \quad \qquad 50 \to 55\text{Not clear which endpoints are included or} \\ \text{ excluded. Assume the upper endpoint is included.}

Regardless of the format used, each class interval for a given set of data should be consistent across all class intervals.

Note: In this course, class intervals for any particular set of data will be the same width. There are situations in data representation when class intervals are different widths, but this is beyond the scope of this course.

The class centre is the average of the endpoints of each interval.

For example, if the class interval is 45\leq \text{time} \lt 50, or 45-50, the class centre is calculated as follows: \begin{aligned} \text{Class interval}&=\dfrac{45+50}{2} \\ &=47.5 \end{aligned}

Since the class centre is an average of the endpoints, it is often used as a single value to represent the class interval. In some histograms, it may be used for the scale on the horizontal axis, with the class centre displayed directly below the middle of each vertical column.

Examples

Example 1

Find the class centre for the class interval 19\leq t<23 where t represents time.

Worked Solution
Create a strategy

Calculate the average of the upper and lower limits of the class interval.

Apply the idea
\displaystyle \text{Class centre}\displaystyle =\displaystyle \dfrac{19+23}{2}Get the average of 19 and 23
\displaystyle =\displaystyle 21Evaluate

Example 2

In product testing, the number of faults detected in producing a certain machinery is recorded each day for several days. The frequency table shows the results.

Number of faultsFrequency
0 - 310
4 - 714
8 - 1120
12 - 1516
a

Construct a histogram to represent the data.

Worked Solution
Create a strategy

Find the class centres and then draw the histogram.

Apply the idea

We can add the class centres of each class to the table, by finding the average of the end points of each class.

Working hoursClass centreFrequency
0 - 31.510
4 - 75.514
8 - 119.520
12 - 1513.516

Now we are ready to draw a histogram, with the class centres on the horizontal axis and the frequencies on the vertical axis.

A histogram on faulty machinery. Ask your teacher for more information.
b

What is the lowest possible number of faults that could have been recorded on any particular day?

Worked Solution
Create a strategy

Look for the minimum value in the smallest class interval.

Apply the idea

Based from the table, the lowest possible number of faults that could have been recorded is 0 fault as this is minimum value in the smallest class interval.

Example 3

As part of a fuel watch initiative, the price of petrol at a service station was recorded each day for 21 days. The frequency table shows the findings.

Price (cents per Litre)Class CentreFrequency
130.9 - 135.9133.46
135.9 - 140.9138.45
140.9 - 145.9143.45
145.9 - 150.9148.45
a

What was the highest price that could have been recorded?

Worked Solution
Create a strategy

Look for the maximum value in the highest class interval.

Apply the idea

Based from the table, the highest price that could have been recorded is 150.9 cents as this is maximum value in the highest class interval.

b

How many days was the price above 140.9 cents?

Worked Solution
Create a strategy

Find the sum of the frequencies from the appropriate classes.

Apply the idea

The prices were above 140.9 in the last two classes: 140.9-145.9 and 145.9-150.9. So we should add the frequencies of those two classes.

\displaystyle \text{Days}\displaystyle =\displaystyle 5+5Add the frequencies
\displaystyle =\displaystyle 10Evaluate
Idea summary

Key features of a frequency histogram:

  • The horizontal axis is a continuous numerical scale (like a number line). It represents numerical data, such as time, height, mass or temperature, and may be divided into class intervals.

  • The vertical axis is the frequency of each data value or class interval.

  • There are no gaps between the columns because the horizontal axis is a continuous scale. It is possible for a class interval to have a frequency of zero, but this is not the same as having gaps between each column.

  • It is good practice, when creating a histogram, to leave a half-column-width gap between the vertical axis and the first column.

For grouped data:

  • Every data value must go into exactly one and only one class interval. Class intervals should be equal width.

  • Each class interval must be the same size, e.g. 1-5,5-10,10-15,\ldots , 10-20, 20-30, 30-40, \ldots

  • The class centre is the average of the end points of the class interval.

Dot plots

Dot plots are a graphical way of displaying the distribution of numerical or categorical data on a simple scale with dots representing the frequency of data values. They are best used for small to medium size sets of data and are good for visually highlighting how the data is spread and whether there are any gaps in the data or outliers. We will look at identifying outliers in more detail in our next lesson.

In a dot plot, each individual value is represented by a single dot, displayed above a horizontal line. When data values are identical, the dots are stacked vertically. The graph appears similar to a pictograph or column graph with the number of dots representing the total count.

  • To correctly display the distribution of the data, the dots must be evenly spaced in columns above the line

  • The scale or categories on the horizontal line should be evenly spaced

  • A dot plot does not have a vertical axis

  • The dot plot should be appropriately labelled

Examples

Example 4

Here is a dot plot of the number of goals scored in each of Bob’s soccer games.

A dot plot on the number of goals scored in Bob's soccer games. Ask your teacher for more information.
a

How many times was one goal scored?

Worked Solution
Create a strategy

Count the number of dots above the number 1.

Apply the idea

There are 7 dots above the number 1, so 1 goal was scored 7 times.

b

Which number of goals were scored equally and most often?

A
4
B
3
C
5
D
2
E
0
F
1
Worked Solution
Create a strategy

Look for the scores that have the same number of dots, and the most dots.

Apply the idea

On the dot plot, there are two scores that have the same number of dots and also the most dots. These are 0 and 3.

So, the correct answers are options B and E.

c

How many games were played in total?

Worked Solution
Create a strategy

Count all the dots in the graph.

Apply the idea
\displaystyle \text{Number of games}\displaystyle =\displaystyle 10+10+5+7+6+3Add the number of dots in each column
\displaystyle =\displaystyle 41Evaluate
Idea summary
A dot plot for days spent exercising. Ask your teacher for more information.

In a dot plot, each individual value is represented by a single dot, displayed above a horizontal line. The number of dots represents the total number of data values.

They are best used for small to medium size sets of data and are good for visually highlighting how the data is spread and whether there are any gaps or outliers.

Stem and leaf plots

A stem plot, or stem and leaf plot, is used for organising and displaying numerical data. It is appropriate for small to moderately sized data sets. The graph is similar to a column graph on its side. An advantage of a stem and leaf plot over a column graph is the individual scores are retained and further calculations can be made accurately.

In a stem and leaf plot, the right-most digit in each data value is split from the other digits, to become the leaf. The remaining digits become the stem.

StemLeaf
10\ 3\ 6
21\ 6\ 7\ 8
35\ 5\ 6
41\ 1\ 5\ 6\ 9
50\ 3\ 6\ 8
Key 2\vert 1 = 21

The data values 10,\,13,\,16,\,21,\,26,\,27,\, 28,\,35,\,35, \\ 36,\, 41,\,41,\,45,\,46,\,49,\,50,\,53,\,56,\,58 are displayed in the stem and leaf plot shown.

  • The stems are arranged in ascending order, to form a column, with the lowest value at the top

  • The leaf values are arranged in ascending order from the stem out, in rows, next to their corresponding stem

  • A single vertical line separates the stem and leaf values

  • There are no commas or other symbols between the leaves, only a space between them

  • In order to correctly display the distribution of the data, the leaves must line up in imaginary columns, with each data value directly below the one above

  • A stem and leaf plot includes a key that describes the way in which the stem and the leaf combine to form the data value

Examples

Example 5

The stem-and-leaf plot below shows the age of people to enter through the gates of a concert in the first 5 seconds.

StemLeaf
11\ 2\ 6\ 8\ 9\ 9
20\ 2\ 3\ 4\ 5\ 7\ 7\ 8
32\ 2\ 4\ 7
4
59
Key 1\vert 2 = 12
a

How many people passed through the gates in the first 5 seconds?

Worked Solution
Create a strategy

Count the number of leaves.

Apply the idea

To find the number of people surveyed, we can count the number of leaves in the stem and leaf plot, since each data entry will have one leaf.

There are 6 people with an age between 10 and 19, 8 people in their 20s, 4 people in their 30s, and 1 person in their 50s. So we can add all these numbers to find the total number of people.

\displaystyle \text{Total number of people}\displaystyle =\displaystyle 6+8+4+1
\displaystyle =\displaystyle 19
b

What was the age of the youngest person?

Worked Solution
Create a strategy

Find the smallest number recorded.

Apply the idea

The smallest number will be in the smallest stem which is 1, and have the smallest leaf which is also 1. This stem and leaf make the number 11.

The youngest person is 11 years old.

c

What was the age of the oldest person?

Worked Solution
Create a strategy

Find the largest number recorded.

Apply the idea

The largest number will be in the largest stem which is 5, and have the largest leaf which is 9. This stem and leaf make the number 59.

The oldest person is 59 years old.

d

What proportion of the concert-goers were under 24 years old?

Worked Solution
Create a strategy

Divide the number of people whose ages are less than 24 by the total number of people.

Apply the idea

There are 9 people who are less than 24 years old.

Since the total number of people is 19, then the proportion of concert goers is \dfrac{9}{19}.

Idea summary
StemLeaf
10\ 3\ 6
21\ 6\ 7\ 8
35\ 5\ 6
41\ 1\ 5\ 6\ 9
50\ 3\ 6\ 8
Key 2\vert 1 = 21

A stem plot, or stem and leaf plot, is used for organising and displaying numerical data. It is appropriate for small to moderately sized data sets. An advantage of a stem and leaf plot is the individual scores can be seen.

The right-most digit in each data value is split from the other digits, to become the leaf. The remaining digits become the stem.

Back-to-back stem plots

Back-to-back stem and leaf plots allow for the display of two data sets at the same time. These types of plots are a great way to make comparisons between data sets.

Reading a back-to-back stem and leaf plot is very similar to a regular stem and leaf plot. The "stem" is used to group the scores and each "leaf" indicates the individual scores within each group. The "stem" is a column and the stem values are written downwards in that column. The "leaf" values are written across in the rows corresponding to the "stem" value. In a back-to-back stem-and-leaf plot, however, two sets of data are displayed simultaneously. One set of data is displayed with its leaves on the left, and the other with its leaves on the right. The "leaf" values are still written in ascending order from the stem outwards.

Examples

Example 6

The back-to-back stem plots show the number of pieces of paper used over several days by Maximillian’s and Charlie’s students.

Maximillian's studentsCharlie's students
707
311\ 2\ 3
828
4\ 332\ 3\ 4
7\ 6\ 549
3\ 252

Key: 6 \vert 1 \vert 2 = 16 \text{ and }12

Which of the following statements are true?

I. Maximillian's students did not use 7 pieces of paper on any day.

II. Charlie's median is higher than Maximillian’s median.

III. The median is greater than the mean in both groups.

A
I and II
B
II and III
C
None of the statements are correct
D
III only
E
II only
F
I only
Worked Solution
Create a strategy

Examine both stem and leaf plots and assess the validity of each statement.

Apply the idea

Statement I: Based on the stem and leaf plot of Maximillian's students, they used 7 pieces of paper on any day. This means that statement I is incorrect.

Statement II: Both groups have 10 data points. For Charlie's students, the median lies between 28 and 32 as these are the 5th and 6th data points, respectively. For Maximillian's students, the median lies between 34 and 45 as these are the 5th and 6th data points, respectively. Calculating the median for each group, we have:

\displaystyle \text{Maximillian's median}\displaystyle =\displaystyle \dfrac{34+45}{2}
\displaystyle =\displaystyle 39.5
\displaystyle \text{Charlie's median}\displaystyle =\displaystyle \dfrac{28+32}{2}
\displaystyle =\displaystyle 30

This means that Maximillian's median is higher than the median of Charlie's, so statement II is incorrect.

Statement III: Calculating the mean for each group, we have

\displaystyle \text{Maximillian's mean}\displaystyle =\displaystyle \dfrac{7+13+28+33+34+45+46+47+52+53}{10}
\displaystyle =\displaystyle 35.8
\displaystyle \text{Charlie's mean}\displaystyle =\displaystyle \dfrac{7+11+12+13+28+32+33+34+49+52}{10}
\displaystyle =\displaystyle 27.1

Comparing the calculated mean of both groups with their respective median, the median is greater than the mean in both groups. This means that the statement III is correct.

So, the correct answer is option D.

Idea summary

A back-to-back stem plot is very similar to a regular stem plot, in that the "stem" is used to group the scores and each "leaf" indicates the individual scores within each group.

If you have to create your own stem-and-leaf plot, it's easier to write all your scores in ascending order before you start putting them into a stem and leaf plot.

Outcomes

U1.AoS1.3

the five-number summary and possible outliers

What is Mathspace

About Mathspace