topic badge

10.03 The shape of data

Lesson

Introduction

When we describe the shape of data sets, we want to focus on how the scores are distributed. Some questions that we might be interested include:

  • Is the distribution symmetrical or not?

  • Are there any clusters or gaps in the data?

  • Are there any outliers?

  • Where is the centre of the data located approximately? (Recall our three measures of centre: mean, median and mode)

  • Is the data widely spread or very compact? (Recall our three measures of spread: range, interquartile range and standard deviation)

Symmetry and skew

Data may be described as symmetrical or asymmetrical.

There are many cases where the data tends to be around a central value with no bias left or right. In such a case, roughly 50 \% of scores will be above the mean and 50 \% of scores will be below the mean. In other words, the mean and median roughly coincide.

The normal distribution is a common example of a symmetrical distribution of data.

This image shows a bell-shaped curve.

The normal distribution looks like this bell-shaped curve.

The image shows a bell-shaped curve drawn over a histogram. Ask your teacher for more information.

This picture shows how a data set that has an approximate normal distribution may appear in a histogram.

The dark line shows the nice, symmetrical curve that can be drawn over the histogram that the data roughly follows.

In this distribution, the peak of the data represents the mean, the median and the mode (taken as the centre of the modal class) all these measures of central tendency are equal for this symmetrical distribution.

A uniform distribution is a symmetrical distribution where each outcome is equally likely, so the frequency should be the same for each outcome. For example, when rolling dice the outcomes are equally likely, while we might get an irregular column graph if only a small number of rolls were performed if we continued to roll the dice the distribution would approach a uniform distribution like that shown below.

The image shows a column graph with all columns of the same height. Ask your teacher for more information.

If a data set is asymmetrical instead (i.e. it isn't symmetrical), it may be described as skewed.

A data set that has positive skew (sometimes called a 'right skew') has a longer tail of values to the right of the data set. The mass of the distribution is concentrated on the left of the figure.

The image shows a curve with its right side stretched out.

A positively skewed has this general shape with right side stretched out.

The image shows a curve shown over a histogram of positively skewed data. Ask your teacher for more information.

General shape shown over a histogram of positively skewed data.

A data set that has negative skew (sometimes called a 'left skew') has a longer tail of values to the left of the data set. The mass of the distribution is concentrated on the right of the figure.

The image shows a curve of negatively skewed data with left side stretched out.

A negatively skewed graph has this general shape with left side stretched out.

The image shows a curve over a histogram of negatively skewed data. Ask your teacher for more information.

General shape shown over a histogram of negatively skewed data.

Examples

Example 1

The stem-and-leaf plot below shows the age of people to enter through the gates of a concert in the first 5 seconds.

StemLeaf
10\ 1\ 2\ 3\ 4\ 5\ 6\ 6\ 6
20\ 0\ 1\ 4\ 9
31\ 4\ 7\ 9
4
54
Key 1\vert 2 = 12 years old
a

What was the median age?

Worked Solution
Create a strategy

Find the middle score.

Apply the idea

We know that there 19 values recorded, so the 10th value will be the median.\text{Median}=20 \text{ years old}

b

What was the difference between the lowest age and the median?

Worked Solution
Create a strategy

Subtract the lowest value recorded from the median.

Apply the idea

The lowest age recorded was 10 years old.

\displaystyle \text{Difference}\displaystyle =\displaystyle 20-10Subtract the values
\displaystyle =\displaystyle 10 \text{ years}Evaluate
c

What is the difference between the highest age and the median?

Worked Solution
Create a strategy

Subtract the median from the highest value recorded.

Apply the idea

The highest age recorded was 54 years old.

\displaystyle \text{Difference}\displaystyle =\displaystyle 54-20Subtract the values
\displaystyle =\displaystyle 34 \text{ years}Evaluate
d

What was the mean age? Round your answer to two decimal places if needed.

Worked Solution
Create a strategy

Use the formula: \text{Mean}=\dfrac{\text{Sum of scores}}{\text{Number of scores}}

Apply the idea

We can add the scores up in our calculator to get: 432.

\displaystyle \text{Mean}\displaystyle =\displaystyle \dfrac{432}{19}Use the formula
\displaystyle \approx\displaystyle 22.74 \text{ years}Evaluate the division
e

Is the data positively or negatively skewed?

Worked Solution
Create a strategy

If most of the scores are relatively low, the distribution is positively skewed whereas if most of the scores are relatively high, the distribution is negatively skewed.

Apply the idea

Most of the scores appear in Stem 1 and Stem 2, so they are mostly low. So the distribution is positively-skewed.

Example 2

State whether the scores in each histogram are positively skewed, negatively skewed or symmetrical (approximately).

a
The image shows a histogram with high-frequency columns on the middle. Ask your teacher for more information.
Worked Solution
Create a strategy

Check where the bulk of the data sits and look at the general shape of the distribution.

Apply the idea

The scores are roughly even in both the high and low end, so the distribution is symmetrical.

b
The image shows a histogram with high-frequency columns on the left. Ask your teacher for more information.
Worked Solution
Apply the idea

The scores have higher-frequency columns on the left (the lower scores), so the distribution is positively-skewed.

c
The image shows a histogram with high-frequency columns on the right. Ask your teacher for more information.
Worked Solution
Apply the idea

The scores have higher-frequency columns on the right (the higher scores), so the distribution is negatively-skewed.

Idea summary

A distribution is said to be symmetric if its left and right sides are mirror images of one another.

A uniform distribution is a symmetrical distribution where each outcome is equally likely, so the frequency should be the same for each outcome.

A data set that has positive skew (sometimes called a 'right skew') has a longer tail of values to the right of the data set. The mass of the distribution is concentrated on the left of the figure.

A data set that has negative skew (sometimes called a 'left skew') has a longer tail of values to the left of the data set. The mass of the distribution is concentrated on the right of the figure.

Clusters and outliers

In a set of data, a cluster occurs when a large number of the scores are grouped together within a small range. Clustering may occur at a single location or several locations. For example, annual wages for a factory may cluster around \$ 40\,000 for unskilled factory workers, \$ 55\,000 for tradespersons and \$ 70\,000 for management. The data may also have clear gaps where values are either very uncommon or not possible in the data set.

The image shows two curves drawn over a data with two peaks. Ask your teacher for more information.

If the data has two clear peaks then the shape is called bimodal.

As we have seen previously, an outlier is a data point that varies significantly from the body of the data. An outlier will be a value that is either significantly larger or smaller than other observations. Outliers are important to identify as they point to unusual bits of data that may require further investigation and impact some calculations such as mean, range, and standard deviation.

A dot plot showing a data of scores. Ask your teacher for more information.

For the dot plot given above the score of 9 would be considered an outlier as it is well above the body of the data.

Examples

Example 3

The percentage of faulty computer chips in 42 batches were recorded in the histogram below.

The image shows a histogram with two highest columns of frequency 12. Ask your teacher for more information.
a

Which of the following makes this statement true? The distribution is:

A
Uni-modal
B
Bi-modal
C
Multi-modal, but not bi-modal
Worked Solution
Create a strategy

Count the number of peeks in the distribution.

Apply the idea

Looking at the histogram, there are two peaks. So the distribution is bi-modal. The correct answer is option B.

b

Which of the following are the modal classes? Select all that apply.

A
0-1
B
1-2
C
2-3
D
3-4
E
4-5
F
5-6
Worked Solution
Create a strategy

The modal classes are those with the highest frequency.

Apply the idea

The highest frequency is 12, and the classes with this frequency are 1-2 and 3-4. So the correct answers are options B and D.

Idea summary

To determine the modality of a data distribution:

  • If there is a single class the data is uni-modal.

  • If there are two classes the data is bi-modal.

  • If there are more than two the data is multi-modal.

An outlier is a value that is either noticeably greater or smaller than other observations.

Histograms and box plots

These two displays are great for being able to identify key features of the shape of the data, as well as the range and in the case of the box plot, the interquartile range and median.

We should expect then that the shape of the data would be the same whether it is represented in a curve, box plot or histogram. Remember that the shape of data can be symmetric, negatively skewed or positively skewed.

For a symmetric distribution:

  • The median is in the centre of the range and the tails (whiskers) of the data are of equal length.

  • The graph should be approximately a mirror image of itself about the centre of the data.

The image shows a symmetrical curve.

Common shape of a symmetrical distribution

The image shows a histogram with approximately symmetrical data. Ask your teacher for more information.

Histogram of approximately symmetrical data

The image shows box plot for symmetrical data. The box plot is symmetrical about the median.

Box plot of symmetrical data

For a positive skewed (skewed right) data distribution:

  • The data is stretched out to the right, producing a longer tail (whisker) to the right of the graph.

  • The bulk of the data is to the left. Higher frequency columns and the box should appear to the left.

The image shows a curve that has a longer tail to the right.

General shape positively skewed data

The image shows a histogram with high  columns on the left. Ask your teacher for more information.

Histogram of positively skewed data

The image shows a box plot with a long right whisker and short left whisker.

Box plot of positively skewed data

For a negative skewed (skewed left) data distribution:

  • The data is stretched out to the left, producing a longer tail (whisker) to the left of the graph.

  • The bulk of the data is to the right. Higher frequency columns and the box should appear to the right.

The image shows a curve that has a longer tail to the left.

General shape negatively skewed data

The image shows a histogram with high frequency columns on the right. Ask your teacher for more information.

Histogram of negatively skewed data

The image shows a box plot with a long left whisker and short right whisker.

Box plot of negatively skewed data

Looking at the diagrams above, can you see the similarities in the representations?

We can see the skewed tails, where the bulk of the data sits and general shape. These are some of the features you can use to match histograms and box plots. We can also look at the data range.

Examples

Example 4

Match the histograms to its box plot.

The image shows 3 histograms and 3 box plots. Ask your teacher for more information.
Worked Solution
Create a strategy

Look for characteristics of skewed and symmetric distributions of data.

Apply the idea
  • Boxplot A and histogram 3 have long right tails, so they are both right skewed.

  • Boxplot C and histogram 2 have long left tails, so they are both left skewed.

  • Boxplot B and histogram 1 are both approximately symmetric.

Idea summary

Symmetric

  • The median is in the centre of the range and the tails (whiskers) of the data are of equal length

  • The graph should be approximately a mirror image of itself about the centre of the data

Positive skewed (skewed right)

  • The data is stretched out to the right, producing a longer tail (whisker) to the right of the graph

  • The bulk of the data is to the left-higher frequency columns and the box should appear to the left

Negative skewed (skewed left)

  • The data is stretched out to the left, producing a longer tail (whisker) to the left of the graph

  • The bulk of the data is to the right-higher frequency columns and the box should appear to the right

Outcomes

VCMSP325

Construct back-to-back stem-and-leaf plots and histograms and describe data, using terms including ‘skewed’, ‘symmetric’ and ‘bi modal’.

VCMSP326

Compare data displays using mean, median and range to describe and interpret numerical data sets in terms of location (centre) and spread.

What is Mathspace

About Mathspace