When we examine data, we want to know the location of the centre and we also want to know to what degree the data points are clustered near the centre. The amount of variability within the observed data allows us to make predictions about the likelihood of future observations falling near the central value.
The remarks below apply equally well to discrete and to continuous random variables when we are dealing with data. However, there are differences when we consider theoretical measures of variability derived from probability distributions and density functions.
When dealing with data, we can use measures like the range, the interquartile range or the variance to get an idea of the variability of a random variable derived from the data. The square root of the variance, called the standard deviation, is also frequently used for this purpose.
The word range is used in the sense of locating the minimum and maximum values of the observed data. Thus, it indicates approximately where the distribution would be seen to lie if a very large number of observations were to be made. It does not give any information about how the data points are distributed within the general location.
We also use the term range to mean the distance between the smallest and largest observation.
The first quartile of a data set is a number below which one-quarter of the observed values fall. The third quartile, similarly, is a number below which three-quarters of the observed values occur. (There are small differences in the precise way the quartiles are calculated by various statistical software products.)
The median is the second quartile, a number below which half the observations occur.
The interquartile range is the difference between the first and third quartiles. If it is small relative to the overall range, we conclude that there is clustering around the median value. However, if the quartiles and the median divide the range into four roughly equal intervals, then no particular clustering is evident.
We find the variance of a data set by first calculating the mean. Then, the squared distances of each data point from the mean are summed and the result is divided by $n-1$n−1 where $n$n is the number of observations. We write
$\text{var}(X)=\frac{1}{n-1}\sum_{i=1}^n(x_i-\overline{x})^2$var(X)=1n−1n∑i=1(xi−x)2
Here, we have used the symbol $\overline{x}$x to represent the mean value of the data. The expression for $\text{var}(X)$var(X) is almost the ordinary average of the squared differences from the mean. It is found that division by $n$n rather than $n-1$n−1 gives a slightly too small result and this is to do with the fact that if we know the average of $n$n numbers and we have $n-1$n−1 of them written down, we can deduce the $n$nth number. That is, only $n-1$n−1 of the numbers are independent.
It may be that we have evidence, apart from the shape of a sample, of a particular probability density function thought to be associated with a continuous random variable. We can deduce from the probability density function what the mean and variance of observed values of the random variable should be.
The mean, $\mu_X$μX is the expected value of the random variable. It is given by
$E(X)=\int_{-\infty}^{\infty}t\ f(t)\mathrm{d}t$E(X)=∫∞−∞t f(t)dt.
The variance $\sigma_X^2=\text{var}(X)$σ2X=var(X) is given by the expected value of the squared difference from the mean:
$\sigma_X^2=E\left[(X-E(X))^2\right]$σ2X=E[(X−E(X))2]
If we had infinitely many values of $X$X to work with, this would be equivalent to what is called the population variance: $\text{var}(X)=\frac{1}{n}\sum_{i=1}^n(x_i-\overline{x})^2$var(X)=1nn∑i=1(xi−x)2. Note the difference between this and the sample variance described above.
To make sense of the expected value formulation, we can begin by expanding the squared bracket. Thus,
$\sigma_X^2=E\left[(X^2-2X.E(X)+E(X)^2\right]$σ2X=E[(X2−2X.E(X)+E(X)2]
Then, using some properties of the expected value operation, we can write
$\sigma_X^2$σ2X | $=$= | $E(X^2)-2E\left[X.E(X)\right]+E\left[E(X)^2\right]$E(X2)−2E[X.E(X)]+E[E(X)2] |
$=$= | $E(X^2)-2[E(X)]^2+\left[E(X)\right]^2$E(X2)−2[E(X)]2+[E(X)]2 | |
$=$= | $E(X^2)-[E(X)]^2$E(X2)−[E(X)]2 |
In this form, it is relatively easy to deduce $\sigma^2$σ2 for a particular PDF, $f(x)$f(x). (Again, we are using a fact about the expected value of a function of a random variable that we have not derived here.) If a random variable $X$X has probability density function $f$f and domain $(a,b)$(a,b), we have
$\sigma_X^2=\int_a^b\ x^2.f(x)\mathrm{d}x-\left[\int_a^b\ x.f(x)\mathrm{d}x\right]^2$σ2X=∫ba x2.f(x)dx−[∫ba x.f(x)dx]2
Given a uniform probability density defined on the domain $(a,b)$(a,b), what variance would be expected in a very large sample of values of a random variable $X$X drawn from this distribution?
The density function is $f(x)=\frac{1}{b-a}$f(x)=1b−a. Therefore,
$\sigma_X^2$σ2X | $=$= | $\int_a^b\frac{x^2}{b-a}\mathrm{d}x-\left[\int_a^b\ \frac{x}{b-a}\mathrm{d}x\right]^2$∫bax2b−adx−[∫ba xb−adx]2 |
$=$= | $\frac{1}{b-a}\left[\frac{x^3}{3}\right]_a^b-\frac{1}{(b-a)^2}\left[\left[\frac{x^2}{2}\right]_a^b\right]^2$1b−a[x33]ba−1(b−a)2[[x22]ba]2 | |
$=$= | $\frac{1}{b-a}\frac{b^3-a^3}{3}-\frac{1}{(b-a)^2}\left[\frac{b^2-a^2}{2}\right]^2$1b−ab3−a33−1(b−a)2[b2−a22]2 | |
$=$= | $\frac{b^2+ab+a^2}{3}-\frac{b^2+2ab+a^2}{4}$b2+ab+a23−b2+2ab+a24 | |
$=$= | $\frac{1}{12}\left(b^2-2ab+a^2\right)$112(b2−2ab+a2) | |
$=$= | $\frac{1}{12}(b-a)^2$112(b−a)2 |
We can calculate the standard deviation $\sigma_X$σX from the variance. Thus, $\sigma_x=\frac{b-a}{2\sqrt{3}}$σx=b−a2√3.
We could calculate from a PDF the probable values of the quartiles in a sample of observations of a random variable.
In another chapter, we considered a random variable $X$X with PDF, $f(x)=\frac{1}{5}-\frac{x}{50}$f(x)=15−x50 where $x$x is in the interval $(0,10)$(0,10).
We ask, What is the point $a\in(0,10)$a∈(0,10) such that $P(XP(X<a)=14 ? And also, What is the point $b\in(0,10)$b∈(0,10) such that $P(XP(X<b)=34 ? Then, the probability that $X$X is between $a$a and $b$b is $\frac{1}{2}$12 and we can take $a$a and $b$b to be the likely quartiles in a sample.
So, we have $\frac{1}{4}=\int_0^a\ \frac{1}{5}-\frac{x}{50}\mathrm{d}x$14=∫a0 15−x50dx. Therefore,
$\frac{1}{4}$14 | $=$= | $\left[\frac{x}{5}-\frac{x^2}{100}\right]_0^a$[x5−x2100]a0 |
$=$= | $\frac{a}{5}-\frac{a^2}{100}$a5−a2100 |
This is a quadratic in $a$a and the only solution that lies within the interval $(0,10)$(0,10) is $10-5\sqrt{3}$10−5√3.
Similarly, if we solve $\frac{3}{4}=\int_0^b\ \frac{1}{5}-\frac{x}{50}\mathrm{d}x$34=∫b0 15−x50dx, we find $b=5$b=5 is the only useful solution.
So, we would expect the quartiles to be at approximately $1.3$1.3 and $5$5.
Consider the probability density function $p$p where $p\left(x\right)=\frac{1}{20}$p(x)=120 when $25\le x\le45$25≤x≤45 and $p\left(x\right)=0$p(x)=0 otherwise.
Use integration to determine the expected value of $p\left(x\right)$p(x).
Use integration to determine the variance of $p\left(x\right)$p(x).
Round your answer to two decimal places if necessary
Consider the probability density function $p$p, where $p\left(x\right)=kx^2$p(x)=kx2 when $2\le x\le5$2≤x≤5 and $p\left(x\right)=0$p(x)=0 otherwise.
Use integration to determine the value of $k$k.
Round your answer to four decimal places if necessary.
Use integration to determine the expected value of a random variable $X$X if it is distributed according to $p\left(x\right)$p(x).
Round to four decimal places if necessary.
By performing an integration similar to the one in part (b), we can find that $E\left(X^2\right)=\frac{1031}{65}$E(X2)=103165.
Hence, calculate the variance, $V\left(X\right)$V(X), of $p\left(x\right)$p(x).
Round to four decimal places if necessary.
Consider the probability density function $p$p, where $p\left(x\right)=k\cos\left(\frac{\pi}{4}x\right)$p(x)=kcos(π4x) when $0\le x\le2$0≤x≤2 and $p\left(x\right)=0$p(x)=0 otherwise.
Use integration to find the value of $k$k.
Using the product rule, find $\frac{d}{dx}\left(x\sin\left(\frac{\pi}{4}x\right)\right)$ddx(xsin(π4x)).
You may let $u=x$u=x and $v=\sin\left(\frac{\pi}{4}x\right)$v=sin(π4x) in your working.
Hence determine the expected value, $E\left(X\right)$E(X), of $p\left(x\right)$p(x).
By performing an integration similar to the one in part (c), we can find that $E\left(X^2\right)=\frac{\pi^2-8}{\frac{\pi^2}{4}}$E(X2)=π2−8π24.
Hence calculate the standard deviation, $\sigma$σ, of $p\left(x\right)$p(x). Give your answer correct to two decimal places.