 New Zealand
Level 8 - NCEA Level 3

The approximate normal distribution of the sample proportions

Lesson

Some surveys are designed to estimate the proportion of a population exhibiting a certain characteristic.

For example, a survey of $200$200 residents might be conducted to ascertain their feelings over a proposed retail development in their area. The proportion of the surveyed residents who are in favour of the development, which we call $\hat{p}$^p, is used as an estimate of the proportion $p$p of all residents who are in favour of it.

There is a probability distribution of the sample proportion $\hat{p}$^p. That is to say, $\hat{p}$^p is a random variable that, according to certain probabilities, can take on a range of values.

Like all random variables, there is a mean value $\mu_{\hat{p}}$μ^p and a variance $\sigma_{\hat{p}}^2$σ2^p around it. These parameters can be determined by knowing the size of the sample and the sample proportion $\hat{p}$^p.

Before we determine them, we need to quickly review a property concerning means and variances generally.

Dividing a data set by k

Before we begin we need to understand the effect on the mean and variance of any set of numbers when each number is divided by a non-zero constant.

As a simple example consider the data set given as {$3,5,9,11,12$3,5,9,11,12}.

The mean is easily determined as $\overline{x}=8$x=8 and, using software, the sample variance is determined as $\sigma_x^2=15$σ2x=15.

We ask what happens to these parameters if we divide each number in the set by $10$10?

The new data set becomes {$0.3,0.5,0.9,1.1,1.2$0.3,0.5,0.9,1.1,1.2}.

The new mean becomes $\overline{x}=0.8$x=0.8 and from software, the new variance becomes $\sigma_x^2=0.15$σ2x=0.15.

The lesson here is that the data division by $10$10 caused the mean to reduce by a factor of $10$10 and the variance to reduce by a factor of $100$100.

In general, dividing a data set by the constant $k$k reduces the mean by a factor of $k$k and the variance by a factor of $k^2$k2.

This fact plays an important role in determining the mean and standard deviation of the distribution of sample proportions.

The parameters of the distribution of sample proportions

The Central Limit Theorem can be applied to proportions, but the parameters of the sampling distribution change to reflect the fact that we are dealing with proportions.

Our estimate of the proportion $p$p in a binomial experiment is given by the statistic $\hat{p}=\frac{x}{n}$^p=xn, where $x$x represents the number of successes in $n$n trials.

If we designate a failure in each binomial trial by the value $0$0 and a success by the value $1$1, then a random variable $x$x can be interpreted as the sum of $n$n values consisting only of zeros and ones, and $\hat{p}$^p is just the sample mean of these $n$n values.

Hence by the Central Limit theorem, for $n$n sufficiently large, $\hat{p}$^p is approximately normally distributed.

The mean of $\hat{p}$^p will be equal to the mean of $x$x divided by $n$n and the variance of $\hat{p}$^p will be equal to the variance of $x$x divided by $n^2$n2.

This means that $\mu_{\hat{p}}=\frac{np}{n}=p$μ^p=npn=p as expected.

The variance of $\hat{p}$^p will be given by $\sigma_{\hat{p}}^2=\frac{np(1-p)}{n^2}=\frac{p(1-p)}{n}$σ2^p=np(1p)n2=p(1p)n

These are the parameters of the sampling distribution for $\hat{p}$^p.

A full example

A survey was conducted to ascertain the proportion of primary school teachers in Melbourne that own a household pet. Of the $300$300 sampled, $218$218 owned a pet.

1. Calculate the sample proportion $\hat{p}$^p of surveyed teachers that own a pet.
2. Estimate the sample standard deviation of the random variable $\hat{p}$^p for such samples of size $300$300.
3. Estimate the probability that, in a future survey involving $300$300 teachers, the value of a second estimate $\hat{p}$^p will be somewhere between $0.7$0.7 and $0.75$0.75.
4. How confident can we be that the true proportion of teachers owning a pet is somewhere between $0.675$0.675 and $0.778$0.778?

To answer this question we take, as our best estimate of the true proportion $p$p, our sample proportion $\hat{p}=\frac{218}{300}=0.7267$^p=218300=0.7267.

This value is the mean proportion $\mu_{\hat{p}}$μ^p of the distribution of all possible sample proportions.

The variance of the distribution around $\mu_{\hat{p}}$μ^p is given by $\sigma_{\hat{p}}^2=\frac{\hat{p}(1-\hat{p})}{n}=\frac{0.7267\times0.2733}{300}=0.00066202$σ2^p=^p(1^p)n=0.7267×0.2733300=0.00066202, and so the standard deviation is equal to $\sqrt{0.00066202}=0.02573$0.00066202=0.02573.

So, the mean and standard deviation of the sampling distribution is given by $\mu_{\hat{p}}=0.7267$μ^p=0.7267 and $\sigma_{\hat{p}}=0.02573$σ^p=0.02573.

Because the sampling distribution of $\hat{p}$^p is expected to be normal, we can use this to make probability estimates of future samples of size $300$300. We may notice that $\mu_{\hat{p}}\pm1\times\sigma_{\hat{p}}$μ^p±1×σ^p becomes the interval with approximate endpoints $0.7$0.7 and $0.75$0.75.

Since the sampling distribution is approximately normal we can use the empirical rule to state that there is about a $68%$68% chance that a second estimate will be within the interval specified in the question.

Provided $n$n is large we can also make approximate probability statements about the sampling distribution of the true proportion $p$p.

That is to say if $n$n is large so that we are willing to consider that $\hat{p}$^p is approximately equal to the true proportion $p$p, then we can make fairly good probability statements about where $p$p is.

With that in mind, we should calculate how many standard deviations away from the mean each endpoint is. We notice that $\mu_{\hat{p}}\pm2\times\sigma_{\hat{p}}$μ^p±2×σ^p is an interval with endpoints $0.675$0.675 and $0.778$0.778.  Hence, using the empirical rule again we can be $95%$95% confident that the true proportion is within this interval.

Worked Examples

QUESTION 1

A survey was carried out to investigate the number of teachers in Australian schools who like using the chalkboard to teach. This survey found that in a sample of $1222$1222 teachers, $321$321 liked using the chalkboard, while the rest did not.

1. Calculate the sample proportion of $\hat{p}$^p, of those teachers surveyed who like using the chalkboard.

Write your answer correct to $3$3 decimal places.

2. Using your answer from part (a), estimate the standard deviation, $\hat{\sigma}$^σ, of the random variable $\hat{P}$^P, for such samples of size $1222$1222.

Round your answer to $3$3 decimal places.

QUESTION 2

The proportion of the population exhibiting a certain characteristic is $p=0.66$p=0.66. A sample size of $50$50 was taken from the population.

1. Determine the standard deviation, $\sigma_{\hat{P}}$σ^P, of the random variable $\hat{P}$^P, which measures the sample proportion of those exhibiting a certain characteristic. Round your answer to $3$3 decimal places.

2. Using a normal approximation, what is the probability that $\hat{P}$^P would lie between $0.64$0.64 and $0.78$0.78? Round your answer to $2$2 decimal places.

QUESTION 3

The probability of an adult having fair hair is $p=0.2$p=0.2. A random sample of size $381$381 was taken from the population.

1. Many samples of size $381$381 were taken. Which of the following are true about the distribution of the sample proportions of adults having fair hair? Choose all true statements.

The distribution of the sample proportions is approximately normal.

A

The mean of this empirical distribution of sample proportions approximates $p=0.2$p=0.2.

B

The graph of the sample proportions is skewed.

C

The graph of the sample proportions is approximately symmetrical and bell shaped.

D

The distribution of the sample proportions is approximately normal.

A

The mean of this empirical distribution of sample proportions approximates $p=0.2$p=0.2.

B

The graph of the sample proportions is skewed.

C

The graph of the sample proportions is approximately symmetrical and bell shaped.

D
2. State the standard deviation, $\sigma$σ, of $\hat{P}$^P, which is equal to the proportions of adults with fair hair in any sample of size $381$381. Round your answer to $3$3 decimal places.

3. Using the normal approximation, determine the probability, $q$q, that in any random sample of $381$381 adults, there are between $70$70 and $84$84 people with fair hair. Write your answer to $2$2 decimal places.

Outcomes

S8-2

Make inferences from surveys and experiments: A determining estimates and confidence intervals for means, proportions, and differences, recognising the relevance of the central limit theorem B using methods such as resampling or randomisation to assess

91582

Use statistical methods to make a formal inference