Univariate Data

Hong Kong

Stage 4 - Stage 5

Lesson

*Variance* is a statistic calculated from a univariate data set for the purpose of understanding how much variation from the mean is present in the data. When the variance is low, the data points tend to be clustered closely around the mean but when it is higher they are more widely spread.

To calculate the variance of a data set, we first calculate its mean. When dealing with data that is a sample from a population, we use the symbol $\overline{x}$`x` for the sample mean. This number is used as an estimator for the population mean for which we use the symbol $\mu$`μ` (Greek letter 'mu').

If $x_1,x_2,...,x_n$`x`1,`x`2,...,`x``n` are the values of the observations, then their mean is $\frac{1}{n}\Sigma x_i$1`n`Σ`x``i` where $\Sigma$Σ is the summation symbol and the terms are added over the subscripts $i$`i` which run from $1$1 to $n$`n`.

The essential idea in the definition of the variance is that of finding the difference of each observation from the mean. Some of these will be negative and some positive so that, if averaged, they would cancel one another. Instead, all the differences from the mean are squared to make them positive.

The squared differences from the mean are then averaged to obtain the variance. Thus, the formula for a population variance is

$\sigma^2=\frac{1}{n}\Sigma\left(x_i-\mu\right)^2$`σ`2=1`n`Σ(`x``i`−`μ`)2

If a sample from a population is being used to estimate the population variance, the formula is modified slightly to avoid a bias.

$s^2=\frac{1}{n-1}\Sigma\left(x_i-\overline{x}\right)^2$`s`2=1`n`−1Σ(`x``i`−`x`)2

When a variance is calculated automatically with a calculator or computer application, the user needs to specify whether the data set is to be considered as a population or as a sample from a population in order to choose the correct formula.

The variance is the square of the *standard deviation* which is a commonly used measure of the spread of a data set. So, a population standard deviation is calculated by $\sigma=\sqrt{\sigma^2}$`σ`=√`σ`2 and a sample standard deviation is $s=\sqrt{s^2}$`s`=√`s`2.

For the purposes of illustration, we calculate the variance of the set of prime numbers from $2$2 to $101$101.

The prime numbers are $2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97,101$2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97,101

There are $26$26 of them and their sum is $1161$1161. Thus $\mu=\frac{1161}{26}\approx44.65$`μ`=116126≈44.65.

By spreadsheet, the sum of the squared differences from the mean is approximately $24153.88$24153.88. On dividing this by $26$26 we have $\sigma^2=\frac{24153.88}{26}\approx929$`σ`2=24153.8826≈929.

Thus the standard deviation is $\sigma=\sqrt{929}\approx30.48$`σ`=√929≈30.48.

For this data set, we could *not *say that the observations are clustered about the mean.