We see and hear a lot of statistics in the media, published by polls or researchers. However, as we have discovered, these are often taken from samples of a population.
For example, in 2009, the World Bank reported that $59.45%$59.45% of adult males in Russia smoke. This would have been calculated from a sample of males from across Russia, not the whole population. So how reliable is this? Perhaps the males sampled were much more likely to smoke than the population as a whole. Can we really infer from the sample that almost $60%$60% of adult males in Russia are smokers? Can a single value accurately estimate the prevalence of male smokers?
We have seen that when the sample size is large, the sample proportion has an approximately normal distribution. So what we can look for is a range of values within which we can be fairly certain contains the true population proportion.
Use this Geogebra app to explore what happens when you change the size of the sample proportion, the size of the population proportion, the size of the sample. The true proportion is represented by the vertical line. See how you can change how many of the samples will contain the true population proportion.
An approximate $95%$95% confidence interval for the true population proportion $p$p is given by
where:
is a calculated value of the sample proportion
$n$n is the size of the sample from which was calculated.
So if we have a confidence interval of $95%$95%, we expect approximately $95%$95% of such intervals would contain $p$p. We do not know whether the particular confidence interval obtained, however, is one of the $95%$95% that contains $p$p or the $5%$5% that does not.
Looking at it another way, we can say we are $5%$5% confident that the population proportion lies outside this interval - above or below. Halving this percentage tells us that we can be $2.5%$2.5% confident that it lies above the interval, and $2.5%$2.5% confident that it lies below it.
If we wanted to be more confident, we could calculate at a $99%$99% confidence level:
If we are fine with being less confident, we can go down to a $90%$90% confidence level:
Note that the higher/lower we set the confidence level, the larger/smaller the confidence interval we calculate. The wider we cast the net, the more sure we can be that the true proportion lies within it.
However, it is wrong to say that
“The probability that the true proportion lies in the interval
is $90%$90%”.
There is a big difference between "probability" and "confidence"! The true value of $p$p doesn't take on a range of likely values - it's not a variable at all, it is fixed (even though it is unknowable). Here are two analogies that may help you understand.
1. What is the probability that the first day of 2017 was a Tuesday?
You may think at first that it is $\frac{1}{7}$17, since there are $7$7 days in a week. However, there is no question of likelihood in this statement - either the first of January 2017 was a Tuesday, or it wasn’t. So we can say that this probability is either $0$0 or $1$1.
(In this case it’s $0$0, since it was a Sunday).
2. You pick any real number, and I mark out a finite interval on the real line - as big as I like, but finite. From your perspective, what is the probability that the interval I just marked out contains your number?
Well, as soon as I mark out my interval, you will know for sure whether my guess is correct or not. Either I’m right (so the probability is $1$1), or I’m wrong (so the probability is $0$0).
In summary, we can be reasonably confident that the true proportion lies within the confidence interval, with a degree of confidence equal to the confidence level. Additional, or larger, samples will allow us to narrow the confidence interval around the fixed value of the true proportion.
For a random sample of $33$33 lollies, $12$12 were found to contain food colouring.
a) Find a point estimate for $p$p, the proportion of lollies that contain food colouring.
Think: this is a simple, single number that shows the proportion of lollies that contain food colouring, which you can express as a fraction.
Do: $\frac{12}{33}$1233
b) Calculate a $90%$90% confidence interval for $p$p.
Think: substitute in the relevant values into the correct version of the formula
Do: $\frac{12}{33}\pm1.65\sqrt{\frac{\frac{12}{33}(1-\frac{12}{33})}{33}}$1233±1.65√1233(1−1233)33
=$\frac{12}{33}\pm1.65\sqrt{\frac{\frac{12}{33}(\frac{21}{33})}{33}}$1233±1.65√1233(2133)33
= $\frac{12}{33}\pm1.65\sqrt{\frac{\frac{252}{1089}}{33}}$1233±1.65√252108933
= $\frac{12}{33}\pm1.65\frac{28}{3993}$1233±1.65283993
= $\frac{12}{33}\pm1.65\sqrt{\frac{28}{3993}}$1233±1.65√283993
= $\frac{12}{33}\pm1.65(0.083739306)$1233±1.65(0.083739306)
= $\frac{12}{33}\pm0.138169855$1233±0.138169855
= $0.225466507$0.225466507, $0.501806219$0.501806219
This means we can be $90%$90% confident that the true population proportion of lollies which contain colouring is somewhere between around $0.23$0.23 and $0.50$0.50.
A sample of size $170$170 is taken from the population, and the sample proportion is found to be $0.55$0.55.
Standard Normal Probability | z-value |
---|---|
$0.9$0.9 | $1.282$1.282 |
$0.925$0.925 | $1.440$1.440 |
$0.95$0.95 | $1.645$1.645 |
$0.975$0.975 | $1.960$1.960 |
$0.99$0.99 | $2.326$2.326 |
$0.995$0.995 | $2.576$2.576 |
State the $z$z-value that corresponds to a $90%$90% confidence interval.
Use the table of values to calculate the $90%$90% confidence interval for the true proportion.
Express your answer in the form $\left(\editable{},\editable{}\right)$(,), and give your answer to two decimal places.
Which of the following statements about the confidence interval are correct? Select all that apply.
There is a $90%$90% probability that the true proportion lies between $0.49$0.49 and $0.61$0.61.
The probability that the true proportion lies within $\left(0.49,0.61\right)$(0.49,0.61) is $0$0 or $1$1.
We have $90%$90% confidence that the true proportion lies between $0.49$0.49 and $0.61$0.61.
The true proportion lies between $0.49$0.49 and $0.61$0.61.
In a sample of $60$60 students from a school, $24$24 of them would prefer different school hours.
Standard Normal Probability | z-value |
---|---|
$0.9$0.9 | $1.282$1.282 |
$0.925$0.925 | $1.440$1.440 |
$0.95$0.95 | $1.645$1.645 |
$0.975$0.975 | $1.960$1.960 |
$0.99$0.99 | $2.326$2.326 |
$0.995$0.995 | $2.576$2.576 |
Estimate the probability that a student in the school would prefer different school hours.
Estimate the standard deviation ($\hat{\sigma}$^σ) of the sampling distribution.
Round your answer to 2 decimal places.
Use the table of values and the result of the previous part to find the $90%$90% confidence interval for the probability of a student at the school preferring different school hours.
Express your answer in the form $\left(\editable{},\editable{}\right)$(,), and give your answer to two decimal places.
Which of the following statements about the proportion of students of the school preferring different school times is correct? Select all that apply.
There is a $90%$90% probability that the true proportion lies between $0.30$0.30 and $0.50$0.50.
The true proportion lies between $0.30$0.30 and $0.50$0.50.
The probability that the true proportion lies within $\left(0.30,0.50\right)$(0.30,0.50) is $0$0 or $1$1.
We have $90%$90% confidence that the true proportion lies between $0.30$0.30 and $0.50$0.50.
In a car manufacturing plant, the brake pads of $100$100 cars are tested and $10$10 of them fail the test. State the $95%$95% confidence interval for the proportion of cars produced in the plant whose brakes fail the test.
Express your answer in the form $\left(\editable{},\editable{}\right)$(,), and give your answer to two decimal places.
Standard Normal Probability | z-value |
---|---|
$0.9$0.9 | $1.282$1.282 |
$0.925$0.925 | $1.440$1.440 |
$0.95$0.95 | $1.645$1.645 |
$0.975$0.975 | $1.960$1.960 |
$0.99$0.99 | $2.326$2.326 |
$0.995$0.995 | $2.576$2.576 |