topic badge

6.04 Interpreting data distributions

Introduction

We described and compared sets of data in lesson  7.02 Measures of center  using the mean and median, and then described and compared sets of data in lesson  7.03 Measures of spread  using standard deviation and interquartile range. We will continue to use data displays to describe the center and spread of data and determine the impact that outliers have on the shape of data.

Interpreting data distributions

When we describe the shape of data sets, we focus on how the data points are distributed and whether the shape is symmetric or not.

Symmetric

The data set is distributed around the center with a similar frequency on the left and right

A bell shaped curve with the middle part taller than the left and right tail. The left side of the curve is a mirror image of the right side. Below the curve is a box and whisker plot where the left whisker having the same length as the right whisker.

In symmetric distributions the \text{mean}\approx \text{median}.

Left skew

The majority of the data points have higher values, with some data points at lower values

A curve with flat left tail and the right tail taller. Below the curve is a box and whisker plot with the left whisker longer than the right whisker.

In distributions that are skewed left, the \text{mean} < \text{median}

Right skew

The majority of the data points have lower values, with some data points at higher values

A curve with flat right tail and the left tail taller. Below the curve is a box and whisker plot with the right whisker longer than the left whisker.

In distributions that are skewed right, the \text{mean} > \text{median}.

Uniform

The data set is evenly distributed across all values

A distrbution that has a shape of a rectangle. Below is a box and whisker plot with the two whiskers having the same length and the the box having a length equal to the total length of the whisker. The box is divided into two equal parts by a line segment.

In uniform distributions, the \text{mean} \approx \text{median}.

Keep in mind:

  • when describing skewed distributions, it's better to use median and interquartile range as measures of center and spread because they are resistant more to outliers.
  • when describing symmetric or uniform distributions, it's better to use mean and standard deviation as measures of center and spread because they take the values of all data points into account.

Examples

Example 1

The number of minutes spent exercising per day for 10 days is recorded for two people who have just signed up for a new gym membership. Compare the exercise data for each person. What does the shape, center, and spread of the data tell us about each person's exercise habits?

Person A time spent exercising (minutes)
0
10
20
30
40
50
60
70
80
Person B time spent exercising (minutes)
0
10
20
30
40
50
60
70
80
Person APerson B
Mean5754.5
Median57.570
Standard deviation6.3417.55
Interquartile range11.2525
Worked Solution
Create a strategy

The shape of each data set will indicate which characteristics will describe the data better.

Apply the idea

The shape of Person A's data is roughly symmetric, while Person B has a set of data skewed to the left. Because Person B has a skewed graph, we'll rely on median and interquartile range to compare the data.

The median of Person A's workout times is 57.5 minutes, and the median of Person B's workout times is 70 minutes. Typically, Person B spends about 12.5 minutes more exercising than Person A.

From the size of the box in the box plots, we can see that the workout times for Person B are more spread out. This tells us that Person B's workout lengths have higher variation and that Person A is more consistent with the amount of time they spend exercising. When we compare the interquartile ranges, we see that the middle half of Person A's workouts varied by 11.25 minutes and the middle of Person B's workouts varied by 15 minutes which supports our analysis.

Reflect and check

Although it's recommended to look at the median for center and interquartile for range of skewed graphs, we can still analyze the mean and standard deviation.

Person B has a mean workout time of 54.5 minutes which is even less than Person A's mean workout time of 57 minutes. This tells a very different story than the median times, which suggests Person B typically works out more. This signficiant difference is likely caused by the large left skew in Person B's workout time. The shorter workouts bring down the average, when in reality 50\% of Person B's workouts (from the third quartile to the max) are longer than all of Person A's workouts (who has a maximum of 65 minutes).

Similarly, Person A has a much smaller standard deviation at 6.34 minutes. If we calculate, 57\pm 6.34 we find that the majority of Person A's workouts are between 50.66 and 63.34 minutes long. Person B's standard deviation is 17.55. By finding 54.5\pm17.55 we can see that the majority of Person B's workouts are between 36.95 and 72.05 minutes long. This is a difference of nearly 30 minutes.

Example 2

Consider the list of ages of people in a field trip group:\{12, 12, 13, 13, 13, 13, 13, 14, 14, 24 \}

a

Interpret the data set using shape, center, and spread.

Worked Solution
Create a strategy

Create a data display to view the shape of the data then calculate any statistics for the center and spread.

Apply the idea
A dot plot titled Age, ranging from 11 to 25 in steps of 1. The number of dots is as follows: at 12, 2; at 13, 5; at 14, 2; at 24, 1.

The data set is skewed right, showing that one person in the group is much older than the others. The person is likely a chaperone, while the other people in the field trip group are students.

The mean will be higher than the median for a data set skewed right, so the best descriptor of the center of the ages in the group is the median. The median age of people in the group is 13 years old. About half of the group is younger than 13 and about half are older. The mean age of the group is 14.1 years old.

The interquartile range is more resistant to outliers, so we can describe the range of ages using the interquartile range rather than the standard deviation. The interquartile range is 1 year, meaning most people in the group are within a year in age.

b

Remove the outlier from the data set and describe how the shape, center, and spread change.

Worked Solution
Apply the idea

Without the outlier, the data set will have a symmetric shape. We can use the mean or median to describe the ages of people in the group. After removing 24 from the data set, the median is still 13 years old. The mean age of this group is 13 years old, meaning the average age is lower without the outlier.

We know that without the outlier, the standard deviation will be lower. Instead, if we compare the spread of this data without the outlier, we see that the interquartile range for the new data is still 1 year, meaning that most of the group is within a year in age.

Reflect and check

The standard deviation for the group with the outlier was 3.36, and the standard deviation for the group without the outlier was 0.67, showing that the ages were less variable without an outlier.

Example 3

Consider the following data distributions that show statistics for the WNBA and NBA:\text{Highest WNBA Salaries (in millions): } \{0.23, 0.23, 0.23, 0.23, 0.23, 0.23, 0.23, 0.22, 0.22, 0.20\}\\ \text{Highest NBA Salaries (in millions): } \{46, 44, 44, 44, 42, 42, 39, 39, 39, 39\}

Compare the shape, center, and spread for the highest salaries in each basketball association.

Worked Solution
Create a strategy

Since the data sets are small, we have an idea of the shape of the data, and calculate the center and spread.

Apply the idea

The highest WNBA salaries fall within \$0.20 million and \$0.23 million or \$200\,000 and \$230\,000. The shape of the data may be skewed left toward \$200\,000 because most salaries are clustered at \$230\,000 in the data set.

The highest NBA salaries fall within a \$7 million range from \$39 million to \$46 million. The shape of the data is more symmetric, but the highest amount, \$46 million may slightly skew the data right.

We can describe the center of each data set using the median since the data could be slightly skewed. The median salary for the WNBA data is \$230\,000 and the median salary for the NBA data is \$42\,000\,000.

The interquartile range where most of the highest-paid WNBA players are getting paid within is \$.01 million or \$10\,000. The interquartile range where most of the highest NBA players are getting paid within is \$5\,000\,000.

In general, without having a visual representation of the data, since the values are being compared using the same scale (in millions), we can see that in general, the data for the highest paid athletes in the NBA have a larger center and spread than the data for the highest paid athletes in the WNBA.

Idea summary

Interpret sets of data by considering the shape and skew:

  • Symmetric or uniform distributions should be described using mean and standard deviation
  • Skewed distributions should be described using median and interquartile range

Outcomes

S.ID.A.1

Represent data with plots on the real number line (dot plots, histograms, and box plots).

S.ID.A.2

Use statistics appropriate to the shape of the data distribution to compare center (median, mean) and spread (interquartile range, standard deviation) of two or more different data sets.

S.ID.A.3

Interpret differences in shape, center, and spread in the context of the data sets, accounting for possible effects of extreme data points (outliers).

What is Mathspace

About Mathspace