Populations and Samples

Lesson

Depicted below is a uniform probability distribution on $[1,9]$[1,9]. The area contained within the blue rectangle is $1$1, and the probability density function $y=f(x)$`y`=`f`(`x`) is defined as:

$f(x)=\frac{1}{9-1}=\frac{1}{8}$`f`(`x`)=19−1=18 for $1\le x\le9$1≤`x`≤9 and $0$0 everywhere else.

In general, the uniform probability distribution is a continuous distribution over the interval $[a,b]$[`a`,`b`] with mean located at $\mu=\frac{b+a}{2}$`μ`=`b`+`a`2 and variance $\sigma^2=\frac{(b-a)^2}{12}$`σ`2=(`b`−`a`)212.

Therefore, this particular distribution has a mean given by $\mu=\frac{9+1}{2}=5$`μ`=9+12=5 and a variance given by $\sigma^2=\frac{(9-1)^2}{12}=5\frac{1}{3}$`σ`2=(9−1)212=513. This implies that $\sigma=\sqrt{\sigma^2}\approx2.309$`σ`=√`σ`2≈2.309.

Suppose we now randomly sample from this distribution. Specifically we will take $10$10 samples, each of size $12$12, using computer software. The results are shown here:

Notice that the $10$10 sample means vary from the distribution mean. This is to be expected. We can show that this variation reduces with larger sample sizes.

The mean of these $10$10 sample means can be easily determined as $\frac{6.22+4.9+...+5.52}{10}=5.193$6.22+4.9+...+5.5210=5.193, which is fairly close to $\mu=5$`μ`=5.

The standard deviation of the sample means can also be easily determined using the familiar sample standard deviation formula (with $n-1$`n`−1 in the denominator). Using a scientific calculator we find in this case that the value is $0.712134$0.712134.

In a later chapter we will show that this value is approximately equal to the standard deviation of the distribution (in this instance $\sigma=2.309$`σ`=2.309) divided by the square root of the sample size ( here, $n=12$`n`=12).

That is, the number $0.712134$0.712134 $\approx$≈ $\frac{\sigma}{\sqrt{n}}=\frac{2.309}{\sqrt{12}}=0.66655$`σ`√`n`=2.309√12=0.66655.

This is a remarkable result, but more will be said on this later.

The quantity $\frac{\sigma}{\sqrt{n}}$`σ`√`n` is known as the standard error often denoted as $\sigma_{\overline{x}}$`σ``x`.

As the sample size increases, the standard error decreases, getting closer and closer to $0$0. What this is really telling us is that as the sample size increases there is less and less variation in the sample means, so that ultimately all of the sample means become the distribution mean.

A fair standard six sided dice, numbered $1$1 through $6$6, is to be rolled $15$15 times.

We can construct a probability distribution for the number of times an odd prime faces upward on the $15$15 rolls. As there are two odd primes on the dice, the probability of a success $p=\frac{1}{3}$`p`=13.

Clearly then we expect to see about $5$5 successes in $15$15 trials, but we may of course see none. We may also see an odd prime $15$15 times, but the probability of either of these outcomes is small.

Using the formulae $\mu=np$`μ`=`n``p` and $\sigma^2=np(1-p)$`σ`2=`n``p`(1−`p`) we can determine for this binomial experiment that $\mu=15\times\frac{1}{3}=5$`μ`=15×13=5 and $\sigma^2=15\times\frac{1}{3}\times\frac{2}{3}=\frac{10}{3}$`σ`2=15×13×23=103. This means that $\sigma\approx1.826$`σ`≈1.826.

Here is the probability distribution with the probability of the expected value highlighted in red:

Armed with the theoretical distribution, we now begin rolling.

Initially, we roll the dice $15$15 times and count the number of times that either a $3$3 or a $5$5 (both representing a success) occurs. This number will be the frequency of our first set. The reason we pick $15$15 is that the expected number of successes will be $5$5 (we could however pick, say $30$30 rolls per set, but we would need to divide the frequencies by $2$2).

For our second set, we again roll the dice $15$15 times and, as before, record the number of odd primes that occur. We continue this procedure to obtain, say, $10$10 sample frequencies of odd primes corresponding to the $10$10 sets of data. We then determine the *average frequency* across the $10$10 sets and record it.

This is our *first sample mean* corresponding to the sample $S1$`S`1.

Using computer software and a few conditional statements, the following table was constructed and the first sample mean was determined as $4.2$4.2.

We now repeat the entire procedure again to produce a second sample mean. Then a third, and a fourth and so on. We could continue indefinitely like this producing more and more sample means, but of course we would run out of time and space, and so we might stop the collection of sample means at say $20$20.

Here they are together with the average and standard deviation of the collection. Note that the first sample mean has been highlighted.

The average of the sample means, which we could call $\mu_{\overline{x}}$`μ``x` is $5.09$5.09 which is quite close to the theoretical mean of $5$5. The standard deviation of the means is $0.49$0.49 and once again this is close to the theoretical standard error given by $\sigma_{\overline{x}}=\frac{\sigma}{\sqrt{15}}=\frac{1.826}{\sqrt{15}}\approx0.4715$`σ``x`=`σ`√15=1.826√15≈0.4715.

In most instances, the sample of size of $15$15 is a little too small to guarantee that the standard deviation of the means is as close to the standard error as this, so we were a little fortunate to get a result as close as this. A larger sample size, perhaps as large as $30$30, reduces the likelihood of a bad fit.

Suppose we didn't stop at just $20$20 sample means. Suppose instead we set the computer to work to determine a million sample means and graph them as a histogram of values. If we did this, the distribution of the sampling means would look bell shaped with most of the means crowded around the centre at $\mu=5$`μ`=5 and tapering away from the centre symmetrically. In fact, given a sample size of $30$30 or more, the distribution would be approximately normal with its standard deviation very close to the standard error. This distribution is known as the sampling distribution. This will be discussed in a later chapter.

The normal variable $X$`X` has a mean of $120$120 and a standard deviation of $15$15.

Two samples, $A$`A` and $B$`B`, each of size $10$10, are taken from $X$`X` and tabulated below.

Sample | A | B |
---|---|---|

$141.88$141.88 | $121.04$121.04 | |

$131.53$131.53 | $116.66$116.66 | |

$126.36$126.36 | $108.68$108.68 | |

$108.49$108.49 | $130.62$130.62 | |

$116.79$116.79 | $106.74$106.74 | |

$123.34$123.34 | $134.58$134.58 | |

$110.09$110.09 | $108.83$108.83 | |

$90.37$90.37 | $111.65$111.65 | |

$115.13$115.13 | $131.4$131.4 | |

$123.46$123.46 | $133.1$133.1 |

Calculate the mean for sample $A$

`A`.Calculate the mean for sample $B$

`B`.Calculate the standard deviation for sample $A$

`A`to three decimal places.Calculate the standard deviation for sample $B$

`B`to three decimal places.Which sample is more like the population?

Sample $A$

`A`ASample $B$

`B`BSample $A$

`A`ASample $B$

`B`B

Consider a binomial distribution with $15$15 trials and probability of success $0.7$0.7.

Ten random samples, each of size $10$10, were taken from the distribution and the results are tabulated below.

Fill in the missing values.

Sample 1 2 3 4 5 6 7 8 9 10 $12$12 $10$10 $13$13 $11$11 $11$11 $14$14 $11$11 $10$10 $11$11 $13$13 $9$9 $11$11 $11$11 $10$10 $11$11 $11$11 $11$11 $9$9 $10$10 $11$11 $9$9 $9$9 $11$11 $11$11 $12$12 $11$11 $10$10 $7$7 $10$10 $9$9 $11$11 $11$11 $6$6 $9$9 $6$6 $11$11 $9$9 $13$13 $10$10 $11$11 $11$11 $12$12 $9$9 $10$10 $10$10 $10$10 $11$11 $10$10 $13$13 $10$10 $9$9 $11$11 $11$11 $10$10 $11$11 $9$9 $12$12 $11$11 $11$11 $11$11 $11$11 $13$13 $11$11 $9$9 $6$6 $9$9 $11$11 $7$7 $10$10 $10$10 $9$9 $11$11 $10$10 $13$13 $8$8 $12$12 $11$11 $14$14 $11$11 $10$10 $13$13 $11$11 $12$12 $12$12 $11$11 $13$13 $\editable{}$ $11$11 $10$10 $8$8 $8$8 $13$13 $8$8 $14$14 $13$13 $9$9 $12$12 $7$7 $10$10 $14$14 Sample Mean $10.2$10.2 $11.2$11.2 $10.2$10.2 $10.9$10.9 $9.9$9.9 $10.9$10.9 $11$11 $\editable{}$ $10.6$10.6 $10.7$10.7 Calculate the mean of the sample means.

Calculate the difference between the mean of the sample means from part (b) and the theoretical mean of the sample means.

Calculate the sample standard deviation of the sample means. Round your answer to two decimal places.

Calculate the difference between the sample standard deviation from part (d) and the theoretical standard deviation of the sample means ($\frac{\sigma}{\sqrt{n}}$

`σ`√`n`), correct to two decimal places.

$X$`X` is a discrete uniform distribution across the integers $1,2,3,4,5,6$1,2,3,4,5,6 and $7$7.

Calculate the mean and standard deviation of the distribution to two decimal places if necessary.

$\mu=$

`μ`=$\editable{}$$\sigma=$

`σ`=$\editable{}$A sample of size $25$25 is taken from this distribution and the graph and table of results are shown below.

Value Frequency $1$1 $2$2 $2$2 $4$4 $3$3 $7$7 $4$4 $3$3 $5$5 $4$4 $6$6 $3$3 $7$7 $2$2 Calculate the mean and standard deviation of this sample to two decimal places if necessary.

$\overline{x}=$

`x`=$\editable{}$$s=$

`s`=$\editable{}$A sample of size $100$100 is taken from this distribution and the graph and table of results are shown below.

Value Frequency $1$1 $16$16 $2$2 $7$7 $3$3 $13$13 $4$4 $21$21 $5$5 $18$18 $6$6 $13$13 $7$7 $12$12 Calculate the mean and standard deviation of this sample to two decimal places if necessary.

$\overline{x}=$

`x`=$\editable{}$$s=$

`s`=$\editable{}$Another sample of size $100$100 is taken from this distribution and the graph and table of results are shown below.

Value Frequency $1$1 $19$19 $2$2 $16$16 $3$3 $8$8 $4$4 $13$13 $5$5 $19$19 $6$6 $14$14 $7$7 $11$11 Calculate the mean and standard deviation of this sample to two decimal places if necessary.

$\overline{x}=$

`x`=$\editable{}$$s=$

`s`=$\editable{}$By considering the three samples and the original distribution, select all statements that apply.

If another sample of size $25$25 was taken, the mean and standard deviation would be the same as the above sample of size $25$25.

AThe larger the sample, the more likely the graph is close to the graph of the population distribution.

BThe larger the sample, the more likely the mean and standard deviation are close to the population distribution.

CA larger sample must have a closer mean and standard deviation to the population than a smaller sample.

DIf another sample of size $25$25 was taken, the mean and standard deviation would be the same as the above sample of size $25$25.

AThe larger the sample, the more likely the graph is close to the graph of the population distribution.

BThe larger the sample, the more likely the mean and standard deviation are close to the population distribution.

CA larger sample must have a closer mean and standard deviation to the population than a smaller sample.

D

Make inferences from surveys and experiments: A determining estimates and confidence intervals for means, proportions, and differences, recognising the relevance of the central limit theorem B using methods such as resampling or randomisation to assess

Use statistical methods to make a formal inference