topic badge

10.04 Confidence intervals for sample proportions

Lesson

We see and hear a lot of statistics in the media, published by polls or researchers. However, as we have discovered, these are often taken from samples of a population.

For example, in 2019, The Guardian held its final poll of Australian voters a few days before the Federal Election, and found that in a sample of $1201$1201 voters, $51.5%$51.5% would be voting for Labor on Saturday, 18th May. As we now know, the Federal Election was won by the Liberal Party, not the Labor Party. The true population proportion in a two party preferred voting showed that $p=0.4847$p=0.4847 compared with our $\hat{p}=0.515$^p=0.515.

Does this mean this was a "bad" sample? Should The Guardian have conducted a survey with a larger sample size? Are there other factors that may play a part here?

As we saw in the previous chapter, the distribution of sample proportions is approximately normal when the sample size is large enough (in this case, a sample of $1201$1201 is considered large enough). How can we use this information to interpret our point estimate of the population proportion?

We can use our sample proportion together with the normal approximation to the distribution of sample proportions to calculate a range of values within which we can have a certain level of confidence that the true population proportion is contained. We call this range of values a confidence interval and it provides an interval estimate for the population proportion.

 

Calculating a confidence interval

When calculating a confidence interval there are a few considerations and things to remember. Firstly we recall that that the distribution of the sample proportions is modelled by a normal distribution with a mean of $p$p and a standard deviation of $\sqrt{\frac{p\times(1-p)}{n}}$p×(1p)n. Since we don't know $p$p, we will use $\hat{p}$^p to estimate this normal distribution.

Now we choose a level of confidence. Often the level of confidence is either $90%$90%, $95%$95% or $99%$99%. Why don't we choose a confidence level of $100%$100%? Well, the way we calculate our confidence interval is directly related to the normal distribution, and as we saw in our study of the normal distribution, since the tails of the distribution tend to positive and negative infinity, we cannot accurately find interval endpoints that encapsulate $100%$100% confidence, unless we choose them as $0$0 and $1$1, which of course covers all options, but gives us no information to understand what the true population proportion is.

What does it mean to calculate a $90%$90% interval? Visually it looks like this:

So, we are calculating two values, one either side of $\hat{p}$^p that is $k$k standard deviations away from $\hat{p}$^p. How many standard deviations exactly? Well that depends on our confidence level.

For a $90%$90% confidence level, we are using $k=1.6449$k=1.6449 standard deviations from the mean. Since, using the standard normal distribution, $z\approx1.6449$z1.6449 and $z\approx-1.6449$z1.6449 are the $z$z-scores between which $90%$90% of scores lie.

Worked example

Example 1

Calculate a $90%$90% confidence interval for the true population proportion of voters predicted to vote for Labor in the 2019 election using the Guardian Poll held just before the Federal Election.

Think: Although we now know, after the fact, what the true population proportion was for that election, on the days prior to the event our best bet was to use polling information to understand the true value of $p$p. In this example we will calculate the true value of $p$p with $90%$90% confidence. To do this we need a $\hat{p}$^p, which we know is $0.515$0.515, a sample size, which we know is $1201$1201, and a $z$z score associated with the level of confidence we want, which we know is $1.6449$1.6449.

Do:

Lower bound of interval $=$= $0.515-1.6449\times\sqrt{\frac{0.515\times(1-0.515)}{1201}}$0.5151.6449×0.515×(10.515)1201

The mean minus $1.6449$1.6449

standard deviations from the mean

  $=$= $0.515-1.6449\times0.01442$0.5151.6449×0.01442

Calculating the standard deviation

  $=$= $0.4913$0.4913

Calculating the lower bound

 

Upper bound of interval $=$= $0.515+1.6449\times\sqrt{\frac{0.515\times(1-0.515)}{1201}}$0.515+1.6449×0.515×(10.515)1201

The mean plus $1.6449$1.6449

standard deviations from the mean

  $=$= $0.515+1.6449\times0.01442$0.515+1.6449×0.01442

Calculating the standard deviation

  $=$= $0.5387$0.5387

Calculating the upper bound

 

We can now say that we are $90%$90% confident that the true population proportion of people voting for Labor is in the interval $(0.4913,0.5387)$(0.4913,0.5387).

We also now know that $p=0.4847$p=0.4847 does not lie in that interval.

What if we increased our confidence level to $95%$95%?

Following the same procedure, this time with a $z$z score of $1.95996$1.95996 we obtain an interval of $(0.4871,0.5437)$(0.4871,0.5437). Once again, the true population proportion is slightly outside this interval.

Calculating a $99%$99% confidence interval, we obtain $(0.4783,0.5525)$(0.4783,0.5525), and this time we see that we do in fact have the true population proportion contained within this confidence interval.

 

What does a confidence interval really mean?

Let's return to a $90%$90% confidence interval.

We say, we are $90%$90% confident that the true population proportion lies within the confidence interval.

And as we saw above, when we increase our confidence level, the more confident we become that $p$p will be contained within the interval. This is because we widen the net of our interval, due to the larger $z$z-score associated with the higher level of confidence.

The three confidence intervals above, and the true proportion $p$p, can be visualised below.

Let's explore this further with the following Geogebra applet.

Set the confidence level to $90%$90%. Then set $n$n to be larger than $50$50. Select a proportion $p$p of around $0.45$0.45.

What do you notice? What do the green lines represent? What do the red lines represent?

You should be able to conclude after playing around with the applet awhile that the red lines are the number of confidence intervals that do not contain $p$p and the green lines are the number of confidence intervals that do contain $p$p. When we set the confidence interval at $90%$90% we see that approximately $90%$90% of all confidence intervals do contain $p$p.

What we can and cannot say about a confidence interval

When considering a $90%$90% confidence interval, or any other confidence interval, it is incorrect to say:

“The probability that the true proportion lies in the interval $\left(\hat{p}-1.6449\times\sqrt{\frac{\hat{p}\times\left(1-\hat{p}\right)}{n}},\hat{p}+1.6449\times\sqrt{\frac{\hat{p}\times\left(1-\hat{p}\right)}{n}}\right)$(^p1.6449×^p×(1^p)n,^p+1.6449×^p×(1^p)n) is $90%$90%."

Why is this incorrect?

The probability that the population proportion is within the interval estimate we created is either $1$1 or $0$0, since it is either within our created interval or it is not. Here are two analogies that may help you understand.

  1. What is the probability that the first day of 2017 was a Tuesday?
    You may think at first that it is $\frac{1}{7}$17, since there are $7$7 days in a week. However, there is no question of likelihood in this statement–either the first of January 2017 was a Tuesday, or it wasn’t. So we can say that this probability is either $0$0 or $1$1. (In this case it’s $0$0, since it was a Sunday).
  2. You pick any real number, and I mark out a finite interval on the real line–as big as I like, but finite. From your perspective, what is the probability that the interval I just marked out contains your number?
    Well, as soon as I mark out my interval, you will know for sure whether my guess is correct or not. Either I’m right (so the probability is $1$1), or I’m wrong (so the probability is $0$0).

So what can we say?

We can say that we're $90%$90% confident that $p$p lies within in the $90%$90% confidence interval that we've calculated.

We can also say given numerous samples and their corresponding sample proportions, $\hat{p}$^p, we could create separate $90%$90% confidence intervals for each $\hat{p}$^p and we would expect that approximately $90%$90% of such intervals would contain the population proportion and $10%$10% would not.

As we saw in the Geogebra applet, approximately $90%$90% of the confidence intervals calculated, for various values of $\hat{p}$^p, obtained from various samples, contained $p$p.

Practice questions

Question 1

$500$500 samples, each of size $250$250, have their two-sided $99%$99% confidence interval calculated. How many of these samples would we expect to contain the true population proportion?

Question 2

Ten samples, each of size $150$150, have their two-sided $90%$90% confidence interval calculated. Select the diagram of confidence intervals which best matches our expectation of how many should contain the true population proportion.

  1. A

    B

    C

    D

Question 3

A sample of size $160$160 is taken from the population, and the sample proportion is found to be $0.48$0.48.

  1. Calculate the approximate two-sided $95%$95% confidence interval for the true proportion.

    Give your answer in the form $\left(a,b\right)$(a,b), and round your answer to two decimal places.

  2. Which of the following statements about the confidence interval are correct?

    Select all that apply.

    There is a $95%$95% probability that the true proportion lies between $0.40$0.40 and $0.56$0.56.

    A

    We have $95%$95% confidence that the true proportion lies between $0.40$0.40 and $0.56$0.56.

    B

    The probability that the true proportion lies within $\left(0.40,0.56\right)$(0.40,0.56) is $0$0 or $1$1.

    C

    The true proportion lies between $0.40$0.40 and $0.56$0.56.

    D

 

Margin of error

As we saw on the diagram above, the margin of error is the distance from $\hat{p}$^p to one of the interval endpoints. We know from our calculations of a confidence interval that this is obtained by multiplying the standard deviation by the associated $z$z-score.

Confidence intervals and the margin of error

A confidence interval is calculated in the following way:

$(\hat{p}-k\times\sqrt{\frac{\hat{p}\times(1-\hat{p})}{n}},\hat{p}+k\times\sqrt{\frac{\hat{p}\times(1-\hat{p})}{n}})$(^pk×^p×(1^p)n,^p+k×^p×(1^p)n)

Where $\hat{p}$^p is the sample proportion taken from our sample, $n$n is the size of our sample, and $k$k is the $z$z score associated with the level of confidence we wish to achieve.

The margin of error is the distance from $\hat{p}$^p to either end of the confidence interval. This means we can calculate it in a number of ways:

  • Using the formula:

$\text{Margin of error}=k\times\sqrt{\frac{\hat{p}\times(1-\hat{p})}{n}}$Margin of error=k×^p×(1^p)n, where $k$k and $n$n are as noted above.

Or if we have already calculated a confidence interval, $\left(a,b\right)$(a,b), the margin of error can be calculated as:

  • Half the width of the confidence interval: $\frac{b-a}{2}$ba2
  • The distance between our point estimate and an endpoint of the confidence interval: $b-\hat{p}$b^p or $\hat{p}-a$^pa

Practice question

Question 4

A two-sided $99%$99% confidence interval is calculated for a sample proportion $\hat{p}$^p and $\left(0.672,0.841\right)$(0.672,0.841) is the confidence interval. Calculate the margin of error.

question 5

A trout farm will only harvest fish if they are at least $41$41 cm long. $600$600 trout are caught and measured, and $75%$75% were found to be at least the minimum length.

Find the approximate margin of error for a two-sided $95%$95% confidence interval.
Round your answer as a percentage correct to two decimal places.

 

Size of the interval and margin of error

As you might imagine we often want to know a population proportion within a given level of accuracy. For example, it is not particularly useful to know the population proportion is between $0.25$0.25 and $0.85$0.85, this interval is too wide to provide useful estimate of the population proportion. Consider how the parameters of the margin of error formula influence the size of the margin of error and hence, the width of the confidence interval. To reduce the margin of error and have a relatively small interval estimate for the population proportion we can either reduce the confidence level or increase the sample size.

 

Changing the level of confidence

As we increase the level of confidence, we increase the size of the $z$z-score, and we therefore increase the size of the margin of error since we are multiplying the standard deviation by a larger $z$z-value. Thus, as we increase the level of confidence, we need to accept a larger margin of error and wider confidence interval. In reverse we can accept a reduced level of confidence to obtain smaller confidence interval.

 

Changing the sample size

As we increase the sample size $n$n, we are decreasing the size of the variance $\frac{\hat{p}\times\left(1-\hat{p}\right)}{n}$^p×(1^p)n and hence decreasing the size of the standard deviation. The smaller the standard deviation, the smaller the margin of error and the smaller the confidence interval.

 

Different samples and their different $\hat{p}$^p values

In our investigation on the graphs of the binomial distribution and the mean and standard deviation, we observed the effect of the proportion $p$p on the standard deviation. The standard deviation used to calculate our confidence intervals has very similar properties.

When $\hat{p}=0.5$^p=0.5 our standard deviation is at its maximum. However, when $\hat{p}$^p tends towards $0$0 or $1$1, our standard deviation decreases. Thus, the closer our $\hat{p}$^p value is to $0$0 or $1$1, the smaller our standard deviation and the smaller our margin of error.

Using the margin of error to calculate the sample size

As we discussed earlier, real life constraints often mean that we cannot collect a large number of samples, so instead we often have to decide on how large our sample should be so that within our level of confidence the margin of error is at a level we are comfortable with. As we noted above, the larger the $n$n value, the smaller the margin of error.

Worked example

Example 2

Let's revisit our Guardian poll example.

(a) Calculate the margin of error for the $90%$90% confidence interval.

Think: Remember that the margin of error is the distance from $\hat{p}$^p to the end of one of the intervals, or half the width of the confidence interval.

Do: Let's find the margin of error by calculating half the width of the confidence interval.

Margin of Error $=$= $\frac{0.5387-0.4913}{2}$0.53870.49132
  $=$= $0.0237$0.0237

 

(b) Determine the size of the sample if the Guardian poll wanted a margin of error less than $0.02$0.02 for a $90%$90% level of confidence.

Think: Using the formula for the margin of error, we'll be able to write an equation that allows us to solve for $n$n.

Do:

$0.02$0.02 $=$= $1.6449\times\sqrt{\frac{0.515\times(1-0.515)}{n}}$1.6449×0.515×(10.515)n

Completing the relevant values of the

margin of error formula

$n$n $=$= $1689.54$1689.54

Rearrange or solve for $n$n on our CAS

Rounding appropriately to an integer sample size so that margin of error is less than $0.02$0.02, we have $n=1690$n=1690 .

 

Practice questions

Question 6

Two samples, $A$A and $B$B, have the same $\hat{p}$^p. Each sample has a two-sided $90%$90% confidence calculated. Sample $A$A is of size $100$100. Sample $B$B is of size $500$500.

  1. Which sample will have the larger margin of error?

    Sample $A$A

    A

    Sample $B$B

    B

Question 7

Use technology to determine the minimum sample size required to achieve a margin of error of $3%$3% in an approximate two-sided $95%$95% confidence interval for the proportion $p$p of primary school children in Australia who play competitive sport. The sample proportion $\hat{p}$^p is found to be $0.6$0.6.

 

Outcomes

4.5.3.1

understand the concept of an interval estimate for a parameter associated with a random variable

4.5.3.2

use the approximate confidence interval [ ˆp-√(ˆp(1−ˆp)/n, ˆp+z√(ˆp(1−ˆp)/n), as an interval estimate for p, where z is the appropriate quantile for the standard normal distribution

4.5.3.3

define the approximate margin of error E=z√(ˆp (1−ˆp)/n and understand the trade-off between margin of error and level of confidence

What is Mathspace

About Mathspace