topic badge

10.03 Approximating the distribution of sample proportions

Lesson

Let's recap a few important things we have noticed about sample proportions from our previous lesson and also our investigation on simulations and sample proportions.

  • We know that when we take samples from a parent population that has a proportion $p$p exhibiting a particular characteristic, these samples will vary along with their $\hat{p}$^p values.
  • We know that when we are sampling from a small population, the number in the sample exhibiting a particular characteristic, $X$X, can be modelled using our knowledge of selections.
  • We know that when we are sampling from a large population, the number in the sample exhibiting a particular characteristic, $X$X, can be modelled using a binomial random variable.
  • Whether the sample is from a small or from a large population, the distribution of the sample proportions is a linear scale of $X$X such that $\hat{P}=\frac{X}{n}$^P=Xn.
  • After repeated sampling we notice that the distribution of the sample proportions is approximately normal when $n$n is sufficiently large and the population proportion is not too close to $0$0 or $1$1.

Let's focus more on this final point as this will be the emphasis of this chapter.

 

Approximating the distribution of the sample proportions

Let's go back to sampling from a large population. In the Australian population, approximately $9%$9% have extras-only private health cover.

Exploration

Scenario 1. Let's take $100$100 samples of size $20$20 from the population, calculate the associated sample proportions, plot them, and summarise our observations.

From our work in the previous chapter, and from our knowledge of binomial graphs, we should already have an idea of what we might observe. What shape do you think the graph will have?

Of course we're not really going to go and take $100$100 samples of $20$20 people each, so we'll simulate it using technology, as demonstrated in the investigation.

If $X~B\left(20,0.09\right)$X~B(20,0.09), and $\hat{P}=\frac{X}{20}$^P=X20, and we simulate sampling $100$100 times, a possible graph of the distribution is as follows:

As expected, the graph of the sample proportions is positively skewed.

Scenario 2. Let's now take $100$100 samples, this time of $50$50 people each.

Now $X~B\left(50,0.09\right)$X~B(50,0.09), and $\hat{P}=\frac{X}{50}$^P=X50. A possible graph of the sample proportions is:

Notice that the obvious positive skew in the first scenario is no longer as evident. Instead, the graph looks more symmetrical.

Scenario 3. Let's now take a final $100$100 samples, this time of $200$200 people each.

Now $X~B\left(200,0.09\right)$X~B(200,0.09), and $\hat{P}=\frac{X}{200}$^P=X200. A possible graph of the sample proportions is:

Once again, the graph appears more symmetrical and begins to resemble a bell shaped curve. If we took more samples the shape of the distribution would become more apparent and approach the normal distribution curve shown in the graph below.

The mean and standard deviation of the normal approximation shown here will match the mean and standard deviation of the distribution of sample proportions found in our previous lesson:

$\mu$μ $=$= $p$p
  $=$= $0.09$0.09

and

$\sigma$σ $=$= $\sqrt{\frac{p\times(1-p)}{n}}$p×(1p)n
  $=$= $\sqrt{\frac{0.09\times0.91}{200}}$0.09×0.91200
  $=$= $0.02$0.02

So for this last example we could quite closely approximate the distribution of sample proportions using a normal distribution.

Normal approximation

When the sample size, $n$n, is sufficiently large, the distribution of the sample proportions, $\hat{P}$^P, is approximately normal with $\mu=p$μ=p and $\sigma=\sqrt{\frac{p\times(1-p)}{n}}$σ=p×(1p)n.

 

Is $n$n large enough?

In the above scenarios we experimented with different values of $n$n to show that the distribution of the sample proportions approaches normality as $n$n gets larger. For what values of $n$n can we be confident that we can use a normal distribution to approximate the distribution of the sample proportions?

In fact, the normality of the distribution depends on both the sample size and the population proportion–which gives rise to a skewed distribution. A rough rule of thumb for normality is to require $n$n and $p$p such that $n\times p\ge5$n×p5 and also $n\times(1-p)\ge5$n×(1p)5.

Worked examples

Example 1

In the population of Australians who own a subscription to video-streaming site, it is known that $32%$32% watched El Camino on the day it was released.

In a sample of $453$453 subscribers, it was found that $172$172 watched El Camino on the day it was released.

(a) State $\hat{p}$^p, the proportion of in the sample who watched El Camino on its release date.

Think: To find the sample proportion, we simply divide the number exhibiting the characteristic by the total number in the sample.

Do: $\frac{172}{453}=0.3797$172453=0.3797

(b) Determine the approximate probability that in a random sample of $453$453 Australian Netflix subscribers, the proportion of who watched El Camino on the release date was less than the proportion in the sample.

Think: Before we calculate the probability, we can confirm the distribution of the sample proportions, $\hat{P}$^P, can be modelled by a normal distribution. Our $n$n certainly appears large enough, and both $453\times0.32$453×0.32 and $453\times0.68$453×0.68 are greater than $5$5. Therefore we can approximate the distribution of $\hat{P}$^P as a normal distribution with $\mu=0.32$μ=0.32 and $\sigma=\sqrt{\frac{0.32\times0.68}{453}}=0.0219$σ=0.32×0.68453=0.0219.

As noted above, we use the population proportion to best model the normal distribution, when we know it.

Do: Using the normal distribution we use the normal cumulative distribution function in our calculator to determine $P(\hat{P}<0.3797)=0.9968$P(^P<0.3797)=0.9968

example 2

The time taken to upload a video is uniformly distributed between $2$2 minutes and $6$6 minutes.

(a) Calculate the probability that a video takes less than $3$3 minutes to upload.

Think: Using our work on continuous random variables and the continuous uniform distribution, we can visualise or sketch the distribution and calculate the area under the rectangle.

Do: $1\times\frac{1}{4}=0.25$1×14=0.25

(b) Random samples of $70$70 video uploads are taken and each sample is used to calculate a point estimate for the proportion of videos that took less than $3$3 minutes to upload. State the parameters of the distribution that best approximates the distribution of the sample proportions.

Think: Our $n$n appears large enough and our conditions for $n\times p\ge5$n×p5 and $n\times(1-p)\ge5$n×(1p)5 hold. Thus, the distribution that will best approximate the distribution of the sample proportions is a normal distribution. We can now calculate the mean and standard deviation of the distribution.

Do: $\mu=0.25$μ=0.25, the proportion in the parent distribution that take less than $3$3 minutes to upload.

$\sigma=\sqrt{\frac{0.25\times0.75}{70}}=0.0518$σ=0.25×0.7570=0.0518

Thus, $\hat{P}\sim N\left(0.25,0.0518^2\right)$^P~N(0.25,0.05182)

(c) Hence, calculate the probability that a randomly chosen sample has a sample proportion of videos taking less than $3$3 minutes to upload greater than $0.3$0.3.

Think: Using our normal distribution defined above, we use the normal cumulative distribution function in our calculator to determine $P(\hat{P}>0.3)$P(^P>0.3)

Do: $P(\hat{P}>0.3)=0.167$P(^P>0.3)=0.167

Practice questions

question 1

A survey was carried out to investigate the number of teachers in Australian schools who like using the chalkboard to teach. This survey found that in a sample of $1222$1222 teachers, $321$321 liked using the chalkboard, while the rest did not.

  1. Calculate the sample proportion of $\hat{p}$^p, of those teachers surveyed who like using the chalkboard.

    Write your answer correct to $3$3 decimal places.

  2. Using your answer from part (a), estimate the standard deviation, $\hat{\sigma}$^σ, of the random variable $\hat{P}$^P, for such samples of size $1222$1222.

    Round your answer to three decimal places.

question 2

The probability of an adult having fair hair is $p=0.2$p=0.2. A random sample of size $381$381 was taken from the population.

  1. Many samples of size $381$381 were taken. Which of the following are true about the distribution of the sample proportions of adults having fair hair? Choose all true statements.

    The mean of this empirical distribution of sample proportions approximates $p=0.2$p=0.2.

    A

    The graph of the sample proportions is skewed.

    B

    The distribution of the sample proportions is approximately normal.

    C

    The graph of the sample proportions is approximately symmetrical and bell shaped.

    D
  2. State the standard deviation, $\sigma$σ, of $\hat{P}$^P, which is equal to the proportions of adults with fair hair in any sample of size $381$381.

    Round your answer to three decimal places.

  3. Using the normal approximation, determine the probability, $q$q, that in any random sample of $381$381 adults, there are between $70$70 and $84$84 people with fair hair.

    Round your answer to two decimal places.

question 3

A proportion $p$p of Tasmanians live in rural areas. The standard deviation of sample proportions of Tasmanians who live in rural areas in a sample of size $200$200 is $\frac{1}{30}$130.

  1. Find the possible values of $p$p.

    Enter your solutions on the same line, separated by a comma.

Outcomes

4.5.2.2

consider the approximate normality of the distribution of 𝑝̂ for large samples

4.5.2.3

simulate repeated random sampling, for a variety of values of p and a range of sample sizes, to illustrate the distribution of ˆp and the approximate standard normality of (ˆp−p)/(sqrt(ˆp(1−ˆp)/n) where the closeness of the approximation depends on both n and p

What is Mathspace

About Mathspace