topic badge

7.06 Normal distributions

Normal distributions

A data set that is symmetric and bell-shaped about the mean is said to have an approximately normal distribution.

A histogram with Percentage on the y-axis, with numbers 0 through 30, and Weight on the x-axis, with bars labeled at their endpoint 75 to 110 in steps of 5. The heights of the bars follow a bell shape with its peak at the 90 through 95 bar. A curve is also plotted with a mound on the middle and tails trailing off to the left and right.

This shows how a data set that has an approximately normal distribution may appear in a histogram. A smooth, symmetrical curve can be drawn over the histogram that the data roughly follows.

This curve is called a normal curve. The arithmetic mean \left(\mu\right) is located on the line of symmetry of the curve and is approximately equivalent to the median and mode of the data set.

If a data set is not symmetrical about the mean, we cannot use normal distribution to interpret it.

Recall that the the standard deviation, denoted by \sigma, describes the spread of the data.

-1
1
x
1
y
A small standard deviation provides a tight cluster around the mean
-1
1
x
1
y
A larger standard deviation shows data that is more spread out

Exploration

Consider the data sets represented by the histograms.

A histogram with numbers 7 to 31, in steps of 4, on the x axis. The heights of the bars follow a bell shape with its peak at the 19 mark. Speak to your teacher for more details.
Histogram 1
A histogram with numbers 8 to 22, in steps of 2, on the x axis. The heights of the bars follow a bell shape with its peak between the 14 to 16 mark. Speak to your teacher for more details.
Histogram 2
A histogram with numbers 12 to 24, in steps of 2, on the x axis. The heights of the bars follow a bell shape with its peak around the 18 mark. Speak to your teacher for more details.
Histogram 3
A histogram with numbers 5 to 40, in steps of 5, on the x axis. There are bars from around the 5 mark to around the 25 mark. The heights of the bars follow a bell shape with its peak just after the 15 mark. Speak to your teacher for more details.
Histogram 4
  1. Match each of the histograms to the correct mean and standard deviation.

    • \mu=16, \sigma=3

    • \mu=15,\sigma=2

    • \mu=19,\sigma=4

    • \mu=18,\sigma=2

  2. Justify your choices.

Consider this normally distributed data set with a mean of 92.5 pounds and a standard deviation of 5 pounds. The data is centered around the mean weight, and we can use the standard deviation to divide the curve into different sections.

A histogram with showing weight on the horizontal axis and percentage on the vertical axis. The bins, starting from 75 to 110, has a scale of 5. A normal, symmetrical, curve is drawn passing through 77.5,82.5,87.5,92.5,97.5,102.5, and 107.5.

If the mean is 92.5, one standard deviation above the mean is 92.5+5 = 97.5. One standard deviation below the mean is 92.5-5=87.5. This means that data values between 87.5 and 92.5 pounds lie within one standard deviation of the mean.

Continuing this pattern, we can say that data values between 82.5 and 102.5 pounds lie within two standard deviations of the mean, and data values between 77.5 and 107.5 pounds lie within three standard deviations of the mean.

The normal curve is a probability distribution, and the total area under the curve is 100\%, or 1. When data is approximately normally distributed, the percentage of data between 1, 2, and 3 standard deviations can be accurately summarized using the Empirical Rule.

A normal distribution curve. Below the curve is a horizontal axis with the following evenly spaced marks from left to right: mu minus 3 sigma, mu minus 2 sigma, mu minus sigma, mu, mu plus sigma, mu plus 2 sigma, and mu plus 3 sigma. The peak of the curve is at mu. Vertical lines are drawn from the curve to each mark in the horizontal axis. The area under the curve between the mu minus 3 sigma and mu minus 2 sigma is labeled 2.35 percent, between mu minus 2 sigma and mu minus sigma labeled 13.5 percent, between mu minus sigma and mu labeled 34 percent, between mu and mu plus sigma labeled 34 percent, between mu plus sigma and mu plus 2 sigma labeled 13.5 percent, and between mu plus 2 sigma and mu plus 3 sigma labeled 2.35 percent. Below the horizontal axis, a set of three brackets labeled Empirical Rule are shown: a bracket connecting mu minus sigma and mu plus sigma is labeled 68 percent, a bracket connecting mu minus 2 sigma and mu plus 2 sigma is labeled 95 percent, and a bracket connecting mu minus 3 sigma and mu plus 3  sigma is labeled 99.7 percent.
Empirical Rule {\left(68-95-99.7\%\right)}

A statistical rule that provides an estimate for the distribution of approximately normal data.

Examples

Example 1

Determine whether each distribution is normally distributed.

a
Leaf
16\ 7\ 7
22\ 2\ 2\ 2\ 3\ 3\ 3
33\ 3\ 3\ 6\ 6\ 6\ 7\ 7\ 7\ 7\ 7
44\ 4\ 4\ 4\ 4\ 4
57\ 7

Key: 2 \vert 3 = 23

Worked Solution
Create a strategy

The data listed on the right side of a stem and leaf plot indicate the shape of the distribution.

Apply the idea

Most of the data is in the middle row of the distribution, and the least amount of data is in the top and bottom rows. This shows that the data is roughly symmetric with a single central peak, so it is approximately normally distributed.

b
A histogram with Frequency on the y-axis, with numbers 0 to 15, and Scores on the x-axis, with the midpoint of the bars labeled 6 to 15 in steps of 1. The 6 bar goes to 5 on the y-axis; 7 goes to 5; 8 goes to 5; 9 goes to 4; 10 goes to 2; 11 goes to 6; 12 goes to 9; 13 goes to 7; 14 goes to 12; and 15 goes to 11.
Worked Solution
Apply the idea

Most of the data is on the right side of the histogram, so the data is skewed left. Since the data is not symmetric, it does not represent a normal distribution.

Reflect and check

Note that if the data was normally distributed, this would need to be converted to a relative frequency histogram before using the Empirical Rule to interpret it.

c
A dot plot titled Sample, ranging from 6 to 15 in steps of 1. The number of dots is as follows: at 6, 12; at 7, 12; at 8, 11; at 9, 12; at 10, 10; at 11, 6; at 12, 6; at 13, 5; at 14, 3; at 15, 4.
Worked Solution
Apply the idea

Most of the data is on the left side, so the data is skewed right. Because the data is not symmetric, it does not represent a normal distribution.

Example 2

The data on daily high temperatures for a certain town is approximately normally distributed. The mean high temperature for this city is 78\degree\text{F}, and the standard deviation is 6\degree\text{F}.

Histogram on daily temperatures. The horizontal axis shows the temperature from 60-94 (intervals of 4) and the vertical axis shows the frequency from 0-12. Ask your teacher for more information.
a

Identify the intervals on the histogram that have data points within 1 standard deviation of the mean.

Worked Solution
Create a strategy

We were given that the mean is 78\degree\text{F} and the standard deviation is 6\degree\text{F}. To find the interval of data values that are within 1 standard deviation of the mean, we will add and subtract 6 from the mean.

Apply the idea

One standard deviation above the mean is 78+6=84.

One standard deviation below the mean is 78-6=72.

The intevals on the histogram that have data values within 1 standard deviation of the mean are 72–76, 76–80, and 80–84.

b

Select the normal curve that approximates the data.

A
A normal curve symmetric about 78 and the data lies between 68 and 88
B
A normal curve symmetric about 78, and all of the data lies between 54 and 102.
C
A normal curve symmetric about 80, and all the data lies between 58 and 104
D
A normal curve symmetric around 78, only a part of the curve is visible (around 54-102)
Worked Solution
Create a strategy

The normal curve should be symmetric about 78\degree\text{F}, and the data should be roughly between 60\degree\text{F} and 96\degree\text{F}.

Apply the idea

Option A is symmetric about 78, but all of the data lies between 68 and 88. This implies that the standard deviation of this set is less than 6, so option A is incorrect.

Option B is symmetric about 78, and all of the data lies between 54 and 102. This is consistent with the data shown in the histogram. In addition, 54 is 4 standard deviations above the mean, and 102 is 4 standard deviations above the mean, which is consistent with the Empirical Rule. Option B is correct.

Option C is symmetric about 80, so it is incorrect.

Option D is symmetric about 78, but the spread of the curve goes beyond what is represented in the histogram. Since the area under a normal curve represents all of the data, this implies that the data represented by this curve includes values lower than 54 and higher than 102. Option D is incorrect.

Option B show the normal curve that approximates the data in the histogram.

Example 3

Consider the normally distributed data sets shown.

Normal curves for distributions A and B. Distribution A shows a tall curve, the center at 12, and all the data lies between 10 and 14. Distribution B shows a greatly smaller (close to flat) curve, the center at 14 and the data exceeds 12 to 16
a

Which data set has a higher mean?

Worked Solution
Create a strategy

A normal curve is symmetric about the mean. To compare the means of the distributions, we can identify the value that lies on the line of symmetry for each curve, then compare those values.

Apply the idea
 symmetric  curve with a mean of 12. The data lies between 10 and 14.

The mean of Distribution A is 12 since the curve is symmetric about that value.

A symmetrical curve with a mean of 14. The data extends beyond 12 and 16.

The mean of Distribution B is 14 since the curve is symmetric about that value.

Since 14\gt 12, Distribution B has a higher mean value.

Reflect and check

We do not need to consider the shape of the curve because it is not affected by the mean. The mean only affected the curve's line of symmetry.

b

Which data set has a smaller standard deviation?

Worked Solution
Create a strategy

The standard deviation affects the spread of a normal curve. To compare the standard deviations of the distributions, we can analyze and compare the spread of each curve.

Apply the idea

In distribution A, approximately 100\% of the data lies between 10 and 14.

In distribution B, the curve stretches beyond 12 and 16. The curve is wider and less peaked, showing it has a larger spread.

Distribution A has a smaller deviation.

Reflect and check

Because the normal curve has an area of 1, curves with a smaller standard deviation will be tall and thin. As the variation in the data increases (as the data values spread further from the center), the curve will become shorter and wider.

Example 4

The grades on a recent exam are approximately normally distributed with a mean score of 72 and a standard deviation of 4.

a

Construct a normal curve and label the boundaries for the Empirical Rule.

Worked Solution
Create a strategy

A normal curve will have a symmetric bell-like appearance with the mean as the central value and divisions for:

  • Mean \pm 1 standard deviation

  • Mean \pm 2 standard deviations

  • Mean \pm 3 standard deviations

Apply the idea

Subtract the standard deviations from the mean to find the values on the left side of the curve:

  • 1 standard deviation below the mean: 72-4=68

  • 2 standard deviations below the mean: 72-2\left(4\right)=64

  • 3 standard deviations below the mean: 72-3\left(4\right)=60

Add the standard deviations to the mean to find the values on the right side of the curve:

  • 1 standard deviation above the mean: 72+4=76

  • 2 standard deviations above the mean: 72+2\left(4\right)=80

  • 3 standard deviations above the mean: 72+3\left(4\right)=84

A symmetrical curve with a mean of 72. 3 standard deviations above and below are labeled. The standard deviation is 4.
b

Find the percentage of students who scored between 64 and 68 on the exam.

Worked Solution
Create a strategy

To use the Empirical Rule, we must first determine how many standard deviations 64 and 68 are away from the mean score of 72.

Apply the idea

64 is two standard deviations below the mean and 68 is one standard deviation below the mean.

A symmetrical curve with a mean of 72. 3 standard deviations above and below are labeled. The standard deviation is 4. 64-68 is highlighted.

According to the Empirical Rule, the percentage of data between 1 and 2 standard deviations below the mean is 13.5\%.

c

If 32 students took the exam, determine the number of students expected to score 80 or more on the exam.

Worked Solution
Create a strategy

We first need to determine the number of standard deviations 80 is away from the mean score of 72. Then, we can multiply the percentage found from the Empirical Rule by 32 students to determine the number of students who may have scored more than 80.

Apply the idea

80 is two standard deviations above the mean. According to the Empirical Rule, 95\% of the data is within 2 standard deviations of the mean. Since all of the data lies below the curve, we know that 1-0.95=0.05 or 5\% of the data lies above and below 2 standard deviations of the mean.

We can divide this in half to find the percentage of data that is only 2 standard deviations above the mean: \dfrac{0.05}{2}=0.025 or 2.5\%.

2.5\% of the 32 students are expected to score 80 or more. This gives us 32\left(0.025\right)=0.8. If the data is approximately normal, not even one student will score above an 80 in a class of 32.

Reflect and check

We can use the Empirical Rule to check the reasonableness of this solution. According to the rule, 95\% of the students will receive scores between 64 and 80 because these are 2 standard deviations from the mean. 32\cdot 0.95=30.4 Because of the rounding error, there are still 2 students that scored below 64 or above 80 on the test. A conclusion that 1 student scored 80 or above on the test would still be valid.

Example 5

Farrah is a movie buff and dreams of becoming a director. She notices that a lot of movies have similar running times and formulates the question, "How long are the most popular movies today?" She decides to investigate this further using the data cycle.

a

Describe a method Farrah can use to collect data.

Worked Solution
Create a strategy

First, use Farrah's statistical question to determine the type of data that needs to be collected. Then, consider whether the data can be collected by research, a survey, an observation, or a scientific experiment.

Apply the idea

"The most popular movies" is a relative term, but in this context, "most popular" is usually measured by the amount of money made while the movie was showing in theaters. Websites such as Wikipedia or IMDb generally collect and report this data.

Farrah can collect data on the length of the movies from the same websites. In this case, Farrah would collect the data through research.

b

The data Farrah gathered on the running time, in minutes, of the top 30 movies is shown:\begin{aligned} &119,\,126,\,120,\,115,\,120,\,133,\,114,\,120,\,110,\,105,\,130,\,128,\,124,\,129,\,130,\\&107,\,108,\,119,\,118,\,114,\,103,\,124,\,130,\,117,\,122,\,113,\,137,\,136,\,110,\,119 \end{aligned}Use technology to create a smooth curve to model the distribution and describe the shape of the curve.

Worked Solution
Create a strategy

Using technology, we can follow these steps to create a smooth curve of the data:

  1. Enter the data into a single column using the GeoGebra Statistics calculator.

  2. Highlight the data and select One Variable Analysis.

  3. In the settings menu (represented by the gear icon), change the frequency type to Normalized. This will adjust the values on the y-axis to reflect a probability distribution.

  4. Check the box to show the normal curve. To see the smooth curve on its own, uncheck the histogram box.

Apply the idea

After entering the data and selecting One Variable Analysis, a histogram will generate. Select the gear icon to open the settings menu.

A screenshot of the GeoGebra statistics tool showing how to generate the histogram of a given data and how to access the settings menu. Speak to your teacher for more details.

Change the frequency type to normalized. If the frequency type is not normalized, it is not possible to create the smooth curve.

A screenshot of the GeoGebra statistics tool showing how to change the frequency type in the settings menu. Speak to your teacher for more details.

Finally, check the box for normal curve.

A screenshot of the GeoGebra statistics tool showing how to find the normal curve option in the settings menu. Speak to your teacher for more details.

The curve is symmetric and bell-shaped, meaning the data is approximately normally distributed.

Reflect and check

To see the curve without the histogram, uncheck the histogram box.

A screenshot of the GeoGebra statistics tool showing how to display the normal curve of a given data set. Speak to your teacher for more details.
c

Answer the statistical question that Farrah formulated.

Worked Solution
Create a strategy

Farrah's statistical question was, "How long are the most popular movies today?" We can answer this using measures of center and spread.

Apply the idea

Because the data is normally distributed, the mean, median, and mode are approximately equal. The data distribution is symmetric about 120 mintues, which shows that most movies are 2 hours long.

According to the sample data, movies range from 103 minutes to 137 mintues, showing that movie times vary by just over half an hour.

d

Formulate a new question that can be answered by the normal curve that approximates the data.

Worked Solution
Create a strategy

Since the data is normally distributed, the question can be related to the mean and standard deviation of the data or require the use of the Empirical Rule.

Apply the idea

One possible question may be, "95\% of movie times are between what two running times?"

Reflect and check

Other possible questions may be:

  • What percent of movies are longer than 2 hours?

  • How does a 2.5 hour movie compare to the lengths of other movies?

  • Out of 200 movies, how many are expected to be under an hour and a half long?

e

Use the data to answer the statistical question from part (d).

Worked Solution
Create a strategy

According to the Empirical Rule, 95\% of movies will fall within 2 standard deviations of the mean.

We can use technology to find the standard deviation of the data, then use the standard deviation and the mean to answer the question.

Apply the idea

Using technology, we can find the standard deviation \left(\sigma\right) in the summary statistics. Select the \Sigma x icon to show the summary statistics.

A screenshot of the GeoGebra statistics tool showing how to display related statistics of a given set of data. Speak to your teacher for more details.

The mean of the data is 120, and the standard deviation is about 9 minutes.

Two standard deviations above the mean is 120+2\left(9\right)=138, and two standard deviations below the mean is 120-2\left(9\right)=102.

95\% of movies are between 102 and 138 minutes long.

Reflect and check

To visualize this better, we could have constructed a normal curve with all the standard deviation divisions labeled.

A symmetrical curve with mean of 120 and standard deviation of 9. Four standard deviations above and below the mean are labeled.
Idea summary

A data set that is symmetric and bell-shaped is said to have an approximately normal distribution. The mean, median, and mode are approximately equal in a normal distribution.

The center of the normal distribution is at the arithmetic mean, \mu. The standard deviation \sigma, describes the spread of the data.

The normal curve represents a probability distribution, and the area under the entire curve is equal to 100\%, or 1. The percentage of data between 1, 2, and 3 standard deviations can be accurately summarized using the Empirical Rule.

A normal distribution curve. Below the curve is a horizontal axis with the following evenly spaced marks from left to right: mu minus 3 sigma, mu minus 2 sigma, mu minus sigma, mu, mu plus sigma, mu plus 2 sigma, and mu plus 3 sigma. The peak of the curve is at mu. Vertical line are drawn from the curve to each mark in the horizontal axis. The area under the curve between the mu minus 3 sigma and mu minus 2 sigma is labeled 2.35 percent, between mu minus 2 sigma and mu minus sigma labeled 13.5 percent, between mu minus sigma and mu labeled 34 percent, between mu and mu plus sigma labeled 34 percent, between mu plus sigma and mu plus 2 sigma labeled 13.5 percent, and between mu plus 2 sigma and mu plus 3 sigma labeled 2.35 percent. Below the horizontal axis, a set of three brackets labeled Empirical Rule are shown: a bracket connecting mu minus sigma and mu plus sigma is labeled 68 percent, a bracket connecting mu minus 2 sigma and mu plus 2 sigma is labeled 95 percent, and a bracket connecting mu minus 3 sigma and mu plus 3  sigma is labeled 99.7 percent.

Outcomes

A2.ST.1

The student will apply the data cycle (formulate questions; collect or acquire data; organize and represent data; and analyze data and communicate results) with a focus on univariate quantitative data represented by a smooth curve, including a normal curve.

A2.ST.1a

Formulate investigative questions that require the collection or acquisition of a large set of univariate quantitative data or summary statistics of a large set of univariate quantitative data and investigate questions using a data cycle.

A2.ST.1b

Collect or acquire univariate data through research, or using surveys, observations, scientific experiments, polls, or questionnaires.

A2.ST.1d

Identify the properties of a normal distribution.

A2.ST.1e

Describe and interpret a data distribution represented by a smooth curve by analyzing measures of center, measures of spread, and shape of the curve

A2.ST.1h

Determine the solution to problems involving the relationship of the mean, standard deviation, and z-score of a data set represented by a smooth or normal curve.

A2.ST.1i

Apply the Empirical Rule to answer investigative questions.

A2.ST.1j

Compare multiple data distributions using measures of center, measures of spread, and shape of the distributions.

What is Mathspace

About Mathspace