We see and hear a lot of statistics in the media, published by polls or researchers. However, as we have discovered, these are often taken from samples of a population.
For example, in 2019, The Guardian held its final poll of Australian voters a few days before the Federal Election, and found that in a sample of $1201$1201 voters, $51.5%$51.5% would be voting for Labor on Saturday, 18th May. As we now know, the Federal Election was won by the Liberal Party, not the Labor Party. The true population proportion in a two party preferred voting showed that $p=0.4847$p=0.4847 compared with our $\hat{p}=0.515$^p=0.515.
Does this mean this was a "bad" sample? Should The Guardian have conducted a survey with a larger sample size? Are there other factors that may play a part here?
As we saw in the previous chapter, the distribution of sample proportions is approximately normal when the sample size is large enough (in this case, a sample of $1201$1201 is considered large enough). How can we use this information to interpret our point estimate of the population proportion?
We can use our sample proportion together with the normal approximation to the distribution of sample proportions to calculate a range of values within which we can have a certain level of confidence that the true population proportion is contained. We call this range of values a confidence interval and it provides an interval estimate for the population proportion.
When calculating a confidence interval there are a few considerations and things to remember. Firstly we recall that that the distribution of the sample proportions is modelled by a normal distribution with a mean of $p$p and a standard deviation of $\sqrt{\frac{p\times(1-p)}{n}}$√p×(1−p)n. Since we don't know $p$p, we will use $\hat{p}$^p to estimate this normal distribution.
Now we choose a level of confidence. Often the level of confidence is either $90%$90%, $95%$95% or $99%$99%. Why don't we choose a confidence level of $100%$100%? Well, the way we calculate our confidence interval is directly related to the normal distribution, and as we saw in our study of the normal distribution, since the tails of the distribution tend to positive and negative infinity, we cannot accurately find interval endpoints that encapsulate $100%$100% confidence, unless we choose them as $0$0 and $1$1, which of course covers all options, but gives us no information to understand what the true population proportion is.
What does it mean to calculate a $90%$90% interval? Visually it looks like this:
So, we are calculating two values, one either side of $\hat{p}$^p that is $k$k standard deviations away from $\hat{p}$^p. How many standard deviations exactly? Well that depends on our confidence level.
For a $90%$90% confidence level, we are using $k=1.6449$k=1.6449 standard deviations from the mean. Since, using the standard normal distribution, $z\approx1.6449$z≈1.6449 and $z\approx-1.6449$z≈−1.6449 are the $z$z-scores between which $90%$90% of scores lie.
Calculate a $90%$90% confidence interval for the true population proportion of voters predicted to vote for Labor in the 2019 election using the Guardian Poll held just before the Federal Election.
Think: Although we now know, after the fact, what the true population proportion was for that election, on the days prior to the event our best bet was to use polling information to understand the true value of $p$p. In this example we will calculate the true value of $p$p with $90%$90% confidence. To do this we need a $\hat{p}$^p, which we know is $0.515$0.515, a sample size, which we know is $1201$1201, and a $z$z score associated with the level of confidence we want, which we know is $1.6449$1.6449.
Do:
Lower bound of interval | $=$= | $0.515-1.6449\times\sqrt{\frac{0.515\times(1-0.515)}{1201}}$0.515−1.6449×√0.515×(1−0.515)1201 |
The mean minus $1.6449$1.6449 standard deviations from the mean |
$=$= | $0.515-1.6449\times0.01442$0.515−1.6449×0.01442 |
Calculating the standard deviation |
|
$=$= | $0.4913$0.4913 |
Calculating the lower bound |
Upper bound of interval | $=$= | $0.515+1.6449\times\sqrt{\frac{0.515\times(1-0.515)}{1201}}$0.515+1.6449×√0.515×(1−0.515)1201 |
The mean plus $1.6449$1.6449 standard deviations from the mean |
$=$= | $0.515+1.6449\times0.01442$0.515+1.6449×0.01442 |
Calculating the standard deviation |
|
$=$= | $0.5387$0.5387 |
Calculating the upper bound |
We can now say that we are $90%$90% confident that the true population proportion of people voting for Labor is in the interval $(0.4913,0.5387)$(0.4913,0.5387).
We also now know that $p=0.4847$p=0.4847 does not lie in that interval.
What if we increased our confidence level to $95%$95%?
Following the same procedure, this time with a $z$z score of $1.95996$1.95996 we obtain an interval of $(0.4871,0.5437)$(0.4871,0.5437). Once again, the true population proportion is slightly outside this interval.
Calculating a $99%$99% confidence interval, we obtain $(0.4783,0.5525)$(0.4783,0.5525), and this time we see that we do in fact have the true population proportion contained within this confidence interval.
Let's return to a $90%$90% confidence interval.
We say, we are $90%$90% confident that the true population proportion lies within the confidence interval.
And as we saw above, when we increase our confidence level, the more confident we become that $p$p will be contained within the interval. This is because we widen the net of our interval, due to the larger $z$z-score associated with the higher level of confidence.
The three confidence intervals above, and the true proportion $p$p, can be visualised below.
Let's explore this further with the following Geogebra applet.
Set the confidence level to $90%$90%. Then set $n$n to be larger than $50$50. Select a proportion $p$p of around $0.45$0.45.
What do you notice? What do the green lines represent? What do the red lines represent?
|
You should be able to conclude after playing around with the applet awhile that the red lines are the number of confidence intervals that do not contain $p$p and the green lines are the number of confidence intervals that do contain $p$p. When we set the confidence interval at $90%$90% we see that approximately $90%$90% of all confidence intervals do contain $p$p.
When considering a $90%$90% confidence interval, or any other confidence interval, it is incorrect to say:
“The probability that the true proportion lies in the interval $\left(\hat{p}-1.6449\times\sqrt{\frac{\hat{p}\times\left(1-\hat{p}\right)}{n}},\hat{p}+1.6449\times\sqrt{\frac{\hat{p}\times\left(1-\hat{p}\right)}{n}}\right)$(^p−1.6449×√^p×(1−^p)n,^p+1.6449×√^p×(1−^p)n) is $90%$90%."
Why is this incorrect?
The probability that the population proportion is within the interval estimate we created is either $1$1 or $0$0, since it is either within our created interval or it is not. Here are two analogies that may help you understand.
So what can we say?
We can say that we're $90%$90% confident that $p$p lies within in the $90%$90% confidence interval that we've calculated.
We can also say given numerous samples and their corresponding sample proportions, $\hat{p}$^p, we could create separate $90%$90% confidence intervals for each $\hat{p}$^p and we would expect that approximately $90%$90% of such intervals would contain the population proportion and $10%$10% would not.
As we saw in the Geogebra applet, approximately $90%$90% of the confidence intervals calculated, for various values of $\hat{p}$^p, obtained from various samples, contained $p$p.
$500$500 samples, each of size $250$250, have their two-sided $99%$99% confidence interval calculated. How many of these samples would we expect to contain the true population proportion?
Ten samples, each of size $150$150, have their two-sided $90%$90% confidence interval calculated. Select the diagram of confidence intervals which best matches our expectation of how many should contain the true population proportion.
A sample of size $160$160 is taken from the population, and the sample proportion is found to be $0.48$0.48.
Calculate the approximate two-sided $95%$95% confidence interval for the true proportion.
Give your answer in the form $\left(a,b\right)$(a,b), and round your answer to two decimal places.
Which of the following statements about the confidence interval are correct?
Select all that apply.
There is a $95%$95% probability that the true proportion lies between $0.40$0.40 and $0.56$0.56.
We have $95%$95% confidence that the true proportion lies between $0.40$0.40 and $0.56$0.56.
The probability that the true proportion lies within $\left(0.40,0.56\right)$(0.40,0.56) is $0$0 or $1$1.
The true proportion lies between $0.40$0.40 and $0.56$0.56.
As we saw on the diagram above, the margin of error is the distance from $\hat{p}$^p to one of the interval endpoints. We know from our calculations of a confidence interval that this is obtained by multiplying the standard deviation by the associated $z$z-score.
A confidence interval is calculated in the following way:
$(\hat{p}-k\times\sqrt{\frac{\hat{p}\times(1-\hat{p})}{n}},\hat{p}+k\times\sqrt{\frac{\hat{p}\times(1-\hat{p})}{n}})$(^p−k×√^p×(1−^p)n,^p+k×√^p×(1−^p)n)
Where $\hat{p}$^p is the sample proportion taken from our sample, $n$n is the size of our sample, and $k$k is the $z$z score associated with the level of confidence we wish to achieve.
The margin of error is the distance from $\hat{p}$^p to either end of the confidence interval. This means we can calculate it in a number of ways:
$\text{Margin of error}=k\times\sqrt{\frac{\hat{p}\times(1-\hat{p})}{n}}$Margin of error=k×√^p×(1−^p)n, where $k$k and $n$n are as noted above.
Or if we have already calculated a confidence interval, $\left(a,b\right)$(a,b), the margin of error can be calculated as:
A two-sided $99%$99% confidence interval is calculated for a sample proportion $\hat{p}$^p and $\left(0.672,0.841\right)$(0.672,0.841) is the confidence interval. Calculate the margin of error.
A trout farm will only harvest fish if they are at least $41$41 cm long. $600$600 trout are caught and measured, and $75%$75% were found to be at least the minimum length.
Find the approximate margin of error for a two-sided $95%$95% confidence interval.
Round your answer as a percentage correct to two decimal places.
As you might imagine we often want to know a population proportion within a given level of accuracy. For example, it is not particularly useful to know the population proportion is between $0.25$0.25 and $0.85$0.85, this interval is too wide to provide useful estimate of the population proportion. Consider how the parameters of the margin of error formula influence the size of the margin of error and hence, the width of the confidence interval. To reduce the margin of error and have a relatively small interval estimate for the population proportion we can either reduce the confidence level or increase the sample size.
As we increase the level of confidence, we increase the size of the $z$z-score, and we therefore increase the size of the margin of error since we are multiplying the standard deviation by a larger $z$z-value. Thus, as we increase the level of confidence, we need to accept a larger margin of error and wider confidence interval. In reverse we can accept a reduced level of confidence to obtain smaller confidence interval.
As we increase the sample size $n$n, we are decreasing the size of the variance $\frac{\hat{p}\times\left(1-\hat{p}\right)}{n}$^p×(1−^p)n and hence decreasing the size of the standard deviation. The smaller the standard deviation, the smaller the margin of error and the smaller the confidence interval.
In our investigation on the graphs of the binomial distribution and the mean and standard deviation, we observed the effect of the proportion $p$p on the standard deviation. The standard deviation used to calculate our confidence intervals has very similar properties.
When $\hat{p}=0.5$^p=0.5 our standard deviation is at its maximum. However, when $\hat{p}$^p tends towards $0$0 or $1$1, our standard deviation decreases. Thus, the closer our $\hat{p}$^p value is to $0$0 or $1$1, the smaller our standard deviation and the smaller our margin of error.
As we discussed earlier, real life constraints often mean that we cannot collect a large number of samples, so instead we often have to decide on how large our sample should be so that within our level of confidence the margin of error is at a level we are comfortable with. As we noted above, the larger the $n$n value, the smaller the margin of error.
Let's revisit our Guardian poll example.
(a) Calculate the margin of error for the $90%$90% confidence interval.
Think: Remember that the margin of error is the distance from $\hat{p}$^p to the end of one of the intervals, or half the width of the confidence interval.
Do: Let's find the margin of error by calculating half the width of the confidence interval.
Margin of Error | $=$= | $\frac{0.5387-0.4913}{2}$0.5387−0.49132 |
$=$= | $0.0237$0.0237 |
(b) Determine the size of the sample if the Guardian poll wanted a margin of error less than $0.02$0.02 for a $90%$90% level of confidence.
Think: Using the formula for the margin of error, we'll be able to write an equation that allows us to solve for $n$n.
Do:
$0.02$0.02 | $=$= | $1.6449\times\sqrt{\frac{0.515\times(1-0.515)}{n}}$1.6449×√0.515×(1−0.515)n |
Completing the relevant values of the margin of error formula |
$n$n | $=$= | $1689.54$1689.54 |
Rearrange or solve for $n$n on our CAS |
Rounding appropriately to an integer sample size so that margin of error is less than $0.02$0.02, we have $n=1690$n=1690 .
Two samples, $A$A and $B$B, have the same $\hat{p}$^p. Each sample has a two-sided $90%$90% confidence calculated. Sample $A$A is of size $100$100. Sample $B$B is of size $500$500.
Which sample will have the larger margin of error?
Sample $A$A
Sample $B$B
Use technology to determine the minimum sample size required to achieve a margin of error of $3%$3% in an approximate two-sided $95%$95% confidence interval for the proportion $p$p of primary school children in Australia who play competitive sport. The sample proportion $\hat{p}$^p is found to be $0.6$0.6.