When we collect information for statistical purposes we refer to that information as data. Data can be classified as numerical data or categorical data.
Numerical data can be counted, ordered and measured. It can be either continuous or discrete. Numerical data is also called quantitative data.
A data set is continuous if the values can take on any value within a finite or infinite interval.
Examples of continuous data are height, weight, temperature or the time taken to run $100$100 meters.
Data for all of these examples could be anywhere on a scale interval and could even be fractions. For example, it might be $25.3$25.3 degrees or a man might be $182.13$182.13cm tall.
Notice that each of these examples is measured with some sort of instrument: a ruler, a set of scales, a thermometer, a stopwatch. Continuous data is almost always measured.
A data set is discrete if the numerical values can be counted but are distinct and separate from each other. They are often (but not always) whole number values.
Examples of discrete data are the number of pets people have and the number of goals scored in a game and money.
Data for these examples will always have distinct values. We couldn't own $\frac{1}{4}$14 of a dog score or score $2.5$2.5 goals in a game of soccer so there is no continuity between the scores.
In some soccer tournaments, half a point is awarded for a draw. In this case, there could be a score of $2.5$2.5, but there still could not be a score of $2.25$2.25 or $2.75$2.75 so the data is still discrete.
Other examples of discrete numerical data include marks on a test, views on a video, votes for a candidate in an election, and how much money you have. Notice that each of these examples of discrete data is counted, not measured. Discrete numerical data is always countable.
Categorical data is non-numeric and is represented by words. It describes the qualities or characteristics of a data set. Categorical data is also known as qualitative data.
Examples include blood groups (A, B, AB or O) or hotel star ratings.
Categories may have numeric labels, such as the numbers worn by players in a sporting team, but these labels have no numerical significance; they merely serve as labels.
Categorical data can be either ordinal or nominal.
A set of data is ordinal if the values can be counted and ordered but not measured.
Rating scales are examples of ordinal data. The finishing places in a race are another example of ordinal data. Finishing first means you were faster than the person who came second and the person who finished eighth was slower than the person who finished sixth. So the finishing places can be ordered but the differences between the finishing times may not be the same between all competitors.
For nominal data, the data is split up based on different names or characteristics. Nominal data may be the names of countries you have visited or your favorite colors. We could assign these different characteristics a number where the numbers are labels. In other words, you are giving categorical data numerical labels. You can count but not order or measure nominal data.
Which two of the following are examples of numerical data?
favorite flavors
maximum temperature
daily temperature
types of horses
Classify this data into its correct category:
Weights of dogs
Categorical Nominal
Categorical Ordinal
Numerical Discrete
Numerical Continuous
This is usually accomplished by organizing the data in tables, including frequency tables. Continuous numerical data is usually best organized in grouped frequency tables.
Frequency tables are the best choice to organize categorical data and discrete numerical data when there is a small number of possible values.
Grouped frequency tables are best for continuous numerical data and discrete numerical data when the data can take a large number of possible values. The frequency recorded for a group is the sum of the frequencies for all data values contained in the group.
The tables below show examples of a frequency table used for categorical data, and a grouped frequency table used for continuous numerical data.
Frequency table
Color | Frequency |
---|---|
white | $14$14 |
red | $2$2 |
blue | $3$3 |
black | $1$1 |
yellow | $1$1 |
Grouped frequency table
Height (cm) | Frequency |
---|---|
$145-150$145−150 | $3$3 |
$150-155$150−155 | $10$10 |
$155-160$155−160 | $8$8 |
$160-165$160−165 | $13$13 |
$165-170$165−170 | $1$1 |
When we group data, we create class intervals, which tell us the range of scores in a particular group. Class intervals should all be equal size, and there should not be gaps between intervals.
For example, if our class interval is $1-5$1−5, we know that this class contains any values from $1$1 to $5$5, inclusive. If the class interval is expressed as $1-<5$1−<5, it includes any score that is greater than or equal to $1$1 and less than $5$5.
To help make it easier to work with our data, we usually find the class center which is taken as the representative value of the class interval when we analyze the data. The class center is the middle score of each class interval. For the interval $1-5$1−5, the class center would be $\frac{1+5}{2}=3$1+52=3.
Selecting the interval width is important. If the intervals are too narrow there will be many gaps so the shape of the distribution will not be visible. If the intervals are too wide the shape of distribution will not be apparent. As a guide, $6$6 to $12$12 intervals will typically be most useful for moderate size data sets.
Find the class center for the class interval $17$17-$22$22.
What would be the most appropriate way of representing data from:
A survey conducted of $1000$1000 people, asking them how many languages they speak?
Leaving the data ungrouped and constructing a frequency table
Grouping the responses and constructing a frequency table
A survey conducted of $1000$1000 people, asking them how many different countries they know the names of?
Grouping the responses and constructing a frequency table
Leaving the data ungrouped and constructing a frequency table
As part of a fuel watch initiative, the price of gasoline at a service station was recorded each day for $21$21 days. The frequency table shows the findings.
Price (in cents per liter) | Class Center | Frequency |
---|---|---|
$130.9$130.9-$135.9$135.9 | $133.4$133.4 | $6$6 |
$135.9$135.9-$140.9$140.9 | $138.4$138.4 | $5$5 |
$140.9$140.9-$145.9$145.9 | $143.4$143.4 | $5$5 |
$145.9$145.9-$150.9$150.9 | $148.4$148.4 | $5$5 |
What was the greatest price that could have been recorded?
How many days was the price above $140.9$140.9 cents?
Once we have organized the data, we need to present the data in a form that will be easy to read, understand and analyze.
Some common ways of displaying statistical data are listed below.
The best type of display to be used will depend on the type of data and purpose of the investigation.
Another type of statistical graph, the box and whisker plot is used to display statistical summary data, and will be described in a later section.
These graphs represent the frequency of data values as the length of horizontal bars or vertical columns.
Column graphs (also known as bar graphs) are usually used to display categorical data.
Histograms are similar to column graphs, with vertical columns used to display numerical data. The main difference between a column graph and histogram is that histograms do not have spaces between the columns.
The reason that histograms do not have gaps between columns is that the class intervals are not separate categories. Instead, the columns represent the frequency of values observed in the class intervals. The width of the columns indicates the range of values in the class intervals.
Each student in a class was surveyed and asked about the color of their eyes. The data is categorical and the results are displayed in a column graph (left) and horizontal bar chart (right) below:
Each student in a class was surveyed and asked the size of their families. The data is numerical and the results are displayed in a histogram below:
The data that was collected in this survey is discrete data because it can take particular values (in this case whole numbers). In histograms that display discrete data the mark is located in the center of the columns across the horizontal axis. The height of each column represents the frequency of each data item.
Each student in a class was surveyed and asked their heights. The data is numerical and the results are displayed in a histogram below:
The data that was collected in this survey is continuous data because it can take any value within a range. In histograms that display continuous data, the column width represents the range of each interval or bin. The height of each column represents the frequency of each data item within each interval.
Continuous data is represented in a histogram as shown:
Complete the following frequency table:
Score | Frequency |
---|---|
$21$21 | $\editable{}$ |
$23$23 | $\editable{}$ |
$25$25 | $\editable{}$ |
$27$27 | $\editable{}$ |
$29$29 | $\editable{}$ |
$31$31 | $\editable{}$ |
In product testing, the number of faults detected in producing a certain machinery is recorded each day for several days. The frequency table shows the results.
Number of faults | Frequency |
---|---|
$0-3$0−3 | $10$10 |
$4-7$4−7 | $14$14 |
$8-11$8−11 | $20$20 |
$12-15$12−15 | $16$16 |
Construct a histogram to represent the data.
What is the least possible number of faults that could have been recorded on any particular day?
$\editable{}$ faults
Dot plots are a graphical way of displaying the distribution of numerical or categorical data on a simple scale with dots representing the frequency of data values. They are best used for small to medium size sets of data and are good for visually highlighting how the data is spread and whether there are any gaps in the data or outliers. We will look at identifying outliers in more detail in our next lesson.
In a dot plot, each individual value is represented by a single dot, displayed above a horizontal line. When data values are identical, the dots are stacked vertically. The graph appears similar to a pictograph or column graph with the number of dots representing the total count.
Here is a dot plot of the number of goals scored in each of Bob’s soccer games.
How many times were five goals scored?
Which number of goals were scored equally and most often?
$1$1
$0$0
$4$4
$3$3
$2$2
$5$5
How many games were played in total?
The goals scored by a football team in their matches are represented in the following dot plot.
Complete the following frequency distribution table.
Goals scored | Frequency |
---|---|
$0$0 | $\editable{}$ |
$1$1 | $\editable{}$ |
$2$2 | $\editable{}$ |
$3$3 | $\editable{}$ |
$4$4 | $\editable{}$ |
$5$5 | $\editable{}$ |
Circle graphs are, at first glance, completely different from bar graphs and histograms. The main similarity is that the mode of a circle graph is clearly visible, just as it is on a histogram.
What makes a circle graph so different is that it represents the data as parts of a whole. In a circle graph, all the data is combined to make a single whole with the different sectors representing different categories. The larger the sector, the larger percentage of the data points that category represents.
Consider the circle graph below:
We can see from the circle graph (using the legend to check our categories) that the red sector takes up half the circle, while the blue sector takes up a quarter and the yellow and orange sectors both take up one eighth.
The fraction of the circle taken up by each sector indicates what fraction of the total fish are that color. So, in this case, half the fish are red since the red sector takes up half the circle. We can also write this as a percentage: $50%$50% of the fish are red.
If we consider how much of the circle each sector takes up, we can identify what percentage of the total fish are of each color.
Color of fish | Fraction of total | Percentage |
---|---|---|
Orange | $\frac{1}{8}$18 | $12.5%$12.5% |
Red | $\frac{1}{2}$12 | $50%$50% |
Blue | $\frac{1}{4}$14 | $25%$25% |
Yellow | $\frac{1}{8}$18 | $12.5%$12.5% |
Notice that the sum of our percentages is $100%$100%. This is consistent with the fact that a circle graph represents $100%$100% of the data, one whole, split up into different category sectors.
A notable drawback of the circle graph is that it doesn't necessarily tell us how many data points belong to each category. This means that, without any additional information, the circle graph can only show us which categories are more or less popular and roughly by how much.
It is for this reason that we will often add some additional information to our circle graphs so that we can show (or at least calculate) the number of data points in each category. There are two main ways to add information to a circle graph:
By revealing the total number of data points, we can use the percentages represented by the sector sizes to calculate how many data points each sector represents.
Consider the circle graph below:
If there are $48$48 fish in total, how many of them are either blue or yellow?
Think: We found in the exploration above that $25%$25% of the fish are blue and $12.5%$12.5% are yellow. Together this represents $37.5%$37.5% of the $48$48 fish.
Do: We can find the number of blue or yellow fish by multiplying the total number of fish by the percentage taken up by these two colors.
Blue or yellow fish | $=$= | $48\times37.5%$48×37.5% |
$=$= | $48\times\frac{3}{8}$48×38 | |
$=$= | $18$18 |
As shown, $18$18 fish are either blue or yellow.
Reflect: By relating the sizes of sectors to fractions or percentages, we can calculate the number of data points belonging to a category by multiplying that fraction (or percentage) by the total number of data points.
Revealing the total number of data points is useful for calculating the value represented by each sector, but this is only if we can interpret the exact size of each sector from the circle graph.
In the case where it is not so obvious what percentage of the circle graph each sector represents, we can instead add information by explicitly stating how many data points each sector represents. This can be written either on the sectors or the legend, as shown below.
Consider the circle graph below:
Show that the sector representing basketball takes up $43%$43% of the circle graph.
Think: To show that the basketball sector takes up $43%$43% of the circle graph, we need to show that the number of basketball data points is equal to $43%$43% of the total data points.
Do: We can see from the circle graph that the basketball sector represents $86$86 data points. By adding up the data points from all the different sectors, we find that the total number of data points is:
Total number of data points | $=$= | $86+27+53+30+4$86+27+53+30+4 |
$=$= | $200$200 |
So the percentage of the total number of data points represented by basketball is:
Percentage | $=$= | $\frac{86}{200}\times100%$86200×100% |
$=$= | $43%$43% |
Since basketball represents $43%$43% of the data points, its sector must take up $43%$43% of the circle graph.
Reflect: We can calculate the exact percentage of the circle graph that different sectors take up by finding their number of data points as a percentage of the total.
Aside from these two ways to add extra information to a circle graph, there is also the case where the percentage taken up by each sector is shown on the circle graph.
This will often look something like this:
This is very useful as it does a lot of the calculations for us. However, it is important that we always check that the percentages on the graph add up to $100%$100% since a circle graph always represents the whole of the data points, no more and no less.
In this particular case, the percentages do in fact add up to $100%$100% so this circle graph is valid.
Every student in year $8$8 was surveyed on their favorite subject, and the results are displayed in this pie chart:
Which was the most popular subject?
Phys. Ed.
Math
History
Languages
Science
English
What percentage of the class selected History, Phys. Ed., or Languages?
$50%$50%
$30%$30%
$3%$3%
$25%$25%
You later find out that $32$32 students selected Science. How many students are there in year $8$8?