topic badge
AustraliaVIC
VCE 12 Methods 2023

9.05 Normal distributions

Lesson

When we have a set of univariate data, it often happens that most of the measurements will be clustered close to the mean value with the density of the observations falling off as distance away from the mean increases. Results of this kind displayed in a histogram show a central peak with columns of decreasing height on each side of the mean.

The normal distribution is a special type of continuous probability distribution. It is often called the "bell curve" because of the shape of the graph. The normal distribution is important in statistics because it can be used to describe many natural variables. For example, for a large population a person's height, arm span, and IQ are each variables which demonstrate an approximate normal distribution.

The bell curve is symmetrical about the mean (which is where the peak of the bell occurs) and it has the property that the mean, mode and median are all equal

Probability distributions that display the distinctive bell curve together with its properties are described as being "normally distributed". 

The shape of the normal distribution will depend on its mean ($\mu$μ) and standard deviation $(\sigma)$(σ). The mean is where the graph peaks. To draw a normal distribution graph we only need the mean and standard deviation values.

Did you know?

For the normal distribution, the mean is exactly in the middle, so $50%$50% of the values are above the mean and $50%$50% of the values are below the mean.

The width of the bell curve is approximately $3$3 standard deviations left and right of the mean as most scores lie within that range.

 

The normal distribution formula

The probability density function describing the normal distribution function is written in terms of the mean $(\mu)$(μ) and standard deviation $(\sigma)$(σ):

$\phi\left(x\right)=\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{\left(x-\mu\right)^2}{2\sigma^2}}$ϕ(x)=1σ2πe(xμ)22σ2 

We won't be using this formula to find the probabilities. Instead, we will primarily be using our available technology on our calculators which we will demonstrate in our next lesson.

Exploration

Assume that a population is normally distributed with mean $\mu$μ and standard deviation $\sigma$σ. Use the following applet to discover what happens to the graph as we change the mean and standard deviation.

  • What do you notice about the shape of the graph when we change the mean? 
  • What happens to the shape of the graph as $\sigma$σ gets closer to $0$0?

For example, this distribution curve comes from a data set that has a very small standard deviation, $\sigma=0.2$σ=0.2,  and hence is clustered tightly around the mean:

This normal distribution has a larger standard deviation, $\sigma=0.9$σ=0.9,  and hence is quite spread.

 

 

Practice questions

Question 1

Which of these histograms is approximately normally distributed?

  1. A

    B

    C

Question 2

Which two of the following statements are true for a normal distribution?

  1. The spread of the normal distribution changes depending on the mean.

    A

    A higher mean will result in a skewed curve.

    B

    The mean, median and mode all have the same value.

    C

    The spread of the normal distribution changes depending on the standard deviation.

    D

 

The empirical rule

The empirical rule, also known as the $68-95-99.7%$689599.7% rule, is a way of estimating the way that normally distributed data spreads out. These numbers correspond to the approximate proportion within one, two, and three standard deviations of the mean.

  • Approximately $68%$68% of scores lie within $1$1 standard deviation of the mean:

  • Approximately $95%$95% of scores lie within $2$2 standard deviations of the mean:

  • Approximately $99.7%$99.7% of scores lie within $3$3 standard deviations of the mean:

A normal distribution is symmetrical, so we can use these basic values to find approximations of other regions. For example, as $95%$95% of scores lie within $2$2 standard deviations of the mean, so $47.5%$47.5% (half of $95%$95%) will lie between the mean and $2$2 standard deviations above the mean:

We can use a similar trick to conclude that $34%$34% (half of $68%$68%) lies between $1$1 standard deviation below the mean, and the mean itself.

If we add these approximations together, we conclude that $81.5%$81.5% (which is $34%+47.5%$34%+47.5%) of scores lie between $1$1 standard deviation below and $2$2 standard deviations above the mean.

Play around with this applet by moving the endpoints of the shaded region. You will see the percentage of scores lying between the endpoints, and can reveal the percentages of each piece with the toggle:

 

Watch out!

The empirical rule is only an approximation, and better approximations exist. For example, a better approximation for the area between $1$1 standard deviation below and above the mean is

$68.268949213...%$68.268949213...%

An exact value is impossible to write down, like many other important numbers in mathematics, so we need to approximate somewhere. For now, this is just a good place to start thinking about the distribution.

Exploration

Standard deviation is a measure of spread that we can apply to everyday contexts. For example, let's say the mean score in a test was $67$67 and the standard deviation was $7$7 marks. This means that:

  • a person who was $1$1 standard deviation above the mean would have received a mark of $74$74 (as this is $67+7$67+7).
  • a person who was $2$2 standard deviations below the mean would have received a mark of $53$53 (as this is $67-2\times7$672×7).

If we're told that the scores were approximately normally distributed, we could go one step further and determine the percentage of students who scored between $53$53 and $74$74.

The number of students that score between $2$2 standard deviations of the mean would be $95%$95%. The normal distribution is symmetrical, so half of $95%$95% of students scored between the mean and two standard deviations below. In other words, $47.5%$47.5% of students scored between $53$53 and $67$67.

Using the same reasoning, we know that half of $68%$68% of students scored between the mean and $1$1 standard deviation above. This means that $34%$34% of students scored between $67$67 and $74$74.

So putting the two percentages together, we can say that $\left(47.5+34\right)%=81.5%$(47.5+34)%=81.5% of students scored between $53$53 and $74$74.

The empirical rule
  • $68%$68% of scores lie within $1$1 standard deviation of the mean.
  • $95%$95% of scores lie within $2$2 standard deviations of the mean.
  • $99.7%$99.7% of scores lie within $3$3 standard deviations of the mean.

Remember, since the normal distribution is symmetrical, we can halve the interval at the mean to halve the percentage of scores.

Practice questions

Question 3

The image shows the distribution of females’ heights in a population. Use the image to help you complete the following statements.

  1. $34%$34% of females lie between $157$157 cm and $163$163 cm and $\editable{}$% of females lie between $163$163cm and $169$169cm.

  2. $1$1 standard deviation = $\editable{}$ cm.

question 4

The following figure shows the approximate percentage of scores lying within various standard deviations from the mean of a normal distribution. The heights of $600$600 boys are found to approximately follow such a distribution, with a mean height of $145$145 cm and a standard deviation of $20$20 cm. Find the number of boys with heights between:

  1. $125$125 cm and $165$165 cm

  2. $105$105 cm and $185$185 cm

  3. $85$85 cm and $205$205 cm (to the nearest whole number)

  4. $145$145 cm and $165$165 cm

  5. $165$165 cm and $185$185 cm (to the nearest whole number)

question 5

The operating times of phone batteries are approximately normally distributed with mean $34$34 hours and a standard deviation of $4$4 hours. Answer the following questions using the empirical rule:

  1. Approximately what percentage of batteries last between $22$22 and $42$42 hours?

  2. Approximately what percentage of batteries last between $30$30 hours and $42$42 hours?

  3. Any battery that lasts less than $22$22 hours is deemed faulty. If a company manufactured $51000$51000 batteries, approximately how many batteries would they be able to sell? Round your answer to the nearest integer.

Outcomes

U34.AoS4.3

continuous random variables: - construction of probability density functions from non-negative functions of a real variable - specification of probability distributions for continuous random variables using probability density functions - calculation and interpretation of mean, 𝜇, variance, 𝜎^2, and standard deviation of a continuous random variable and their use - standard normal distribution, N(0, 1), and transformed normal distributions, N(𝜇, 𝜎^2), as examples of a probability distribution for a continuous random variable - effect of variation in the value(s) of defining parameters on the graph of a given probability density function for a continuous random variable - calculation of probabilities for intervals defined in terms of a random variable, including conditional probability (the cumulative distribution function may be used but is not required)

U34.AoS4.4

statistical inference, including definition and distribution of sample proportions, simulations and confidence intervals: - distinction between a population parameter and a sample statistic and the use of the sample statistic to estimate the population parameter - simulation of random sampling, for a variety of values of 𝑝 and a range of sample sizes, to illustrate the distribution of 𝑃^ and variations in confidence intervals between samples - concept of the sample proportion as a random variable whose value varies between samples, where 𝑋 is a binomial random variable which is associated with the number of items that have a particular characteristic and 𝑛 is the sample size - approximate normality of the distribution of P^ for large samples and, for such a situation, the mean 𝑝 (the population proportion) and standard deviation - determination and interpretation of, from a large sample, an approximate confidence interval for a population proportion where 𝑧 is the appropriate quantile for the standard normal distribution, in particular the 95% confidence interval as an example of such an interval where 𝑧 ≈ 1.96 (the term standard error may be used but is not required).

What is Mathspace

About Mathspace