Recall that data can be either numerical or categorical. While column (or bar) graphs are preferred for displaying categorical data, histograms are the preferred option for data that is numerical.
Continuous numerical data, such as times, heights, weights or temperatures, are based on measurements, so any data value is possible within a large range of values. For displaying this type of data, a histogram is used.
As an example, the following frequency distribution table and histogram represent the times taken for $72$72 runners to complete a ten kilometre race.
Class interval | Frequency |
---|---|
$45\le\text{time }<50$45≤time <50 | $9$9 |
$50\le\text{time }<55$50≤time <55 | $7$7 |
$55\le\text{time }<60$55≤time <60 | $20$20 |
$60\le\text{time }<65$60≤time <65 | $30$30 |
$65\le\text{time }<70$65≤time <70 | $6$6 |
The histogram represents the distribution of the data. It allows us to see clearly where all of the recorded times fall along a continuous scale.
What may surprise us at first is that the histogram above has only five columns, even though it represents $72$72 different data values. To produce the histogram, the data is first grouped into class intervals (also known as classes or bins), using the frequency distribution table.
In the table above,
Every data value must go into exactly one and only one class interval.
Class intervals should be equal width.
There are several different ways that class intervals are defined. Here are some examples with two adjacent class intervals:
Class interval formats | Description | |
---|---|---|
$45<\text{time }\le50$45<time ≤50 | $50<\text{time }\le55$50<time ≤55 | Upper endpoint included, lower endpoint excluded. |
$45\le\text{time }<50$45≤time <50 | $50\le\text{time }<55$50≤time <55 | Lower endpoint included, upper endpoint excluded. |
$45$45 to $<50$<50 | $50$50 to $<55$<55 | Lower endpoint included, upper endpoint excluded. |
$45-49$45−49 | $50-54$50−54 | Suitable for data rounded to the nearest whole number, or discrete data. |
$45$45 → $50$50 | $50$50 → $55$55 | Not clear which endpoints are included or excluded. Assume the upper endpoint is included. |
Regardless of the format used, each class interval for a given set of data should be consistent across all class intervals.
Note: In this course, class intervals for any particular set of data will be the same width. There are situations in data representation when class intervals are different widths, but this is beyond the scope of this course.
The class centre is the average of the endpoints of each interval.
For example, if the class interval is $45\le\text{time }<50$45≤time <50, or $45-50$45−50, the class centre is calculated as follows:
Class centre | $=$= | $\frac{45+50}{2}$45+502 |
$=$= | $47.5$47.5 |
Since the class centre is an average of the endpoints, it is often used as a single value to represent the class interval. In some histograms, it may be used for the scale on the horizontal axis, with the class centre displayed directly below the middle of each vertical column.
Find the class centre for the class interval $19\le t<23$19≤t<23 where $t$t represents time.
Although a histogram looks similar to a bar graph, there are a number of important differences between them:
Histogram | Bar graph |
To better understand histograms, we will look at an example of how a histogram is created from a set of raw data.
In 2016, the World Health Organisation (WHO) collected data on the average life expectancy at birth for $183$183 countries around the world.
To appreciate what the raw data looks like, here is a reduced version of the data set.
$62.7$62.7 | $76.4$76.4 | $76.4$76.4 | $62.6$62.6 | $75.0$75.0 | $76.9$76.9 | $74.8$74.8 |
$82.9$82.9 | $81.9$81.9 | $73.1$73.1 | $75.7$75.7 | $79.1$79.1 | $72.7$72.7 | $75.6$75.6 |
$\vdots$⋮ | $\vdots$⋮ | $\vdots$⋮ | ||||
$62.5$62.5 | $72.5$72.5 | $77.2$77.2 | $81.4$81.4 | $63.9$63.9 | $78.5$78.5 | $77.1$77.1 |
$72.3$72.3 | $72.0$72.0 | $74.1$74.1 | $76.3$76.3 | $65.3$65.3 | $62.3$62.3 | $61.4$61.4 |
Each value represents the average life expectancy at birth (in years) for a single country.
Before organising the data into a frequency distribution table, we need to decide on the number of class intervals. Although there is no fixed rule, using between $5$5 and $10$10 class intervals usually produces good results for most data sets.
We know that the lowest life expectancy in the data is $52.9$52.9 years (Lesotho in southern Africa), while the highest is $84.2$84.2 (Japan). These values indicate that the scale on the horizontal axis of our histogram should be from at least $50$50 to $85$85 years. It seems appropriate to have class intervals of width $5$5 years, which means we will have $7$7 class intervals in total. We'll use the variable $t$t to represent average life expectancy in the table below:
Class interval | Frequency |
---|---|
$50\le t<55$50≤t<55 | $5$5 |
$55\le t<60$55≤t<60 | $10$10 |
$60\le t<65$60≤t<65 | $25$25 |
$65\le t<70$65≤t<70 | $26$26 |
$70\le t<75$70≤t<75 | $40$40 |
$75\le t<80$75≤t<80 | $49$49 |
$80\le t<85$80≤t<85 | $28$28 |
This compact table now represents all $183$183 life expectancy values. With our frequency distribution table complete, we are ready to create a histogram:
Use the histogram to answer the following questions:
In product testing, the number of faults detected in producing a certain machinery is recorded each day for several days. The frequency table shows the results.
Number of faults | Frequency |
---|---|
$0-3$0−3 | $10$10 |
$4-7$4−7 | $14$14 |
$8-11$8−11 | $20$20 |
$12-15$12−15 | $16$16 |
Construct a histogram to represent the data.
What is the lowest possible number of faults that could have been recorded on any particular day?
$\editable{}$ faults
As part of a fuel watch initiative, the price of petrol, $p$p, at a service station was recorded each day for $21$21 days. The frequency table shows the findings.
Price (in cents per litre) | Class Centre | Frequency |
---|---|---|
$120.9120.9<p≤125.9 | $123.4$123.4 | $4$4 |
$125.9125.9<p≤130.9 | $128.4$128.4 | $6$6 |
$130.9130.9<p≤135.9 | $133.4$133.4 | $5$5 |
$135.9135.9<p≤140.9 | $138.4$138.4 | $6$6 |
What was the highest price that could have been recorded?
How many days was the price above $130.9$130.9 cents?
Dot plots are a graphical way of displaying the distribution of numerical or categorical data on a simple scale with dots representing the frequency of outcomes. They are best used for small to medium size sets of data and are good for visually highlighting how the data is spread and whether there are any gaps in the data or outliers.
In a dot plot, each individual value is represented by a single dot, displayed above a horizontal line. When data values are identical, the dots are stacked vertically. The graph appears similar to a pictograph or column graph with the number of dots representing the total count.
Here is a dot plot of the number of goals scored in each of Bob’s soccer games.
How many times were five goals scored?
Which number of goals were scored equally and most often?
$1$1
$0$0
$4$4
$3$3
$2$2
$5$5
How many games were played in total?
A stem plot, or stem and leaf plot, is used for organising and displaying numerical data. It is appropriate for small to moderately sized data sets. The graph is similar to a column graph on its side, an advantage of a stem plot over a column graph is the individual scores are retained and further calculations can be made accurately.
In a stem plot, the right-most digit in each data value is split from the other digits, to become the leaf. The remaining digits become the stem.
The values in a stem plot are generally arranged in ascending order (from lowest to highest) from the centre out. To emphasise this, it is often called an ordered stem plot.
The data values $10,13,16,21,26,27,28,35,35,36,41,41,45,46,49,50,53,56,58$10,13,16,21,26,27,28,35,35,36,41,41,45,46,49,50,53,56,58 are displayed in the stem plot below.
The stem-and-leaf plot below shows the age of people to enter through the gates of a concert in the first $5$5 seconds.
Stem | Leaf | |
$1$1 | $1$1 $2$2 $4$4 $5$5 $6$6 $6$6 $7$7 $9$9 $9$9 | |
$2$2 | $2$2 $3$3 $5$5 $5$5 $7$7 | |
$3$3 | $1$1 $3$3 $8$8 $9$9 | |
$4$4 | ||
$5$5 | $8$8 | |
|
How many people passed through the gates in the first $5$5 seconds?
What was the age of the youngest person?
The youngest person was $\editable{}$ years old.
What was the age of the oldest person?
The oldest person was $\editable{}$ years old.
What proportion of the concert-goers were under $20$20 years old?
Back-to-back stem and leaf plots allow for the display of two data sets at the same time. These types of plots are a great way to make comparisons between data sets.
Reading a back-to-back stem and leaf plot is very similar to a regular stem and leaf plot. The "stem" is used to group the scores and each "leaf" indicates the individual scores within each group. The "stem" is a column and the stem values are written downwards in that column. The "leaf" values are written across in the rows corresponding to the "stem" value. In a back-to-back stem and leaf plot, one set of data is displayed on the left and one set of data is written on the right. The "leaf" values are still written in ascending order from the stem outwards.
If you have to create your own stem-and-leaf plot, it's easier to write all your scores in ascending order before you start putting them into a stem and leaf plot.
The stem-and-leaf plot shows the number of pieces of paper used over several days by Maximilian’s and Charlie’s students.
Maximilian | Stem | Charlie |
---|---|---|
$7$7 | $0$0 | $7$7 |
$3$3 | $1$1 | $1$1 $2$2 $3$3 |
$8$8 | $2$2 | $8$8 |
$4$4 $3$3 | $3$3 | $2$2 $3$3 $4$4 |
$7$7 $6$6 $5$5 | $4$4 | $9$9 |
$3$3 $2$2 | $5$5 | $2$2 |
Key: | $6\mid1\mid2$6∣1∣2 | $=$= | $16$16 and $12$12 |
Which of the following statements are true?
I. Maximilian's students did not use $7$7 pieces of paper on any day.
II. Charlie's median is higher than Maximilian’s median.
III. The median is greater than the mean in both groups.
I and II
II and III
None of the statements are correct.
III only
II only
I only