Univariate Data

Hong Kong

Stage 4 - Stage 5

Lesson

When we find the median value of a data set, we are finding a central value with the property that there are as many data points below this value as there are above it. When the data set contains an odd number of elements, there is only one number that can be called the median, according to this definition. But, for a data set with an even number of elements, the median, by the above definition, could be any number that lies between the two central numbers. Consider the following examples.

There are as many numbers below $31$31 as there are above it. So, the median is $31$31.

Now consider this set,

*Any *number between $33$33 and $35$35 would have as many data points below it as there are above. To get a definite value for the median, in this case, we adopt the rule that the median should be the *arithmetic mean* (or exactly in the middle) of the two numbers at the ends of the smallest interval containing it.

The median must be $\frac{33+35}{2}=34$33+352=34.

We extend this idea in order to define *quartiles*. These are three numbers that split a data set into four subsets of equal size. Roughly speaking, the first quartile is the median of the lower half of the set and the third quartile is the median of the upper half. (The second quartile is, of course, just the median of the whole set.)

The locations of the first and third quartiles are often used to gauge the spread of the data. We define the *interquartile range* for this purpose: $IQR=Q_3-Q_1$`I``Q``R`=`Q`3−`Q`1 where $Q_1$`Q`1 and $Q_3$`Q`3 are the respective quartiles. So the interquartile range is just the range between the quartiles. It is actaully the range that the middle 50% of the data falls between.

In this data set that we used before,

$14$14 | $19$19 | $23$23 | $24$24 | $31$31 | $33$33 | $40$40 | $42$42 | $56$56 |

we might say the first quartile is $23$23 and the third quartile is $40$40. There could be doubts about this because we included the median, $31$31, in both the lower and the upper halves.

The more common approach however would be to ignore the median value and then consider the lower values $14,19,23$14,19,23 and $24$24 and then find the median of this lower set (to find the first quartile). To do this we will take the average of $19$19 and $23$23, arriving at $21$21. Repeating the same for the upper values $33,40,42,56$33,40,42,56 we take the average of $40$40 and $42$42 to get $41$41.

In fact, there is no single definition that would allow quartiles to be located consistently by everyone. Statistical software packages use various slightly different methods.

The spreadsheet program Excel, for example, provides two different functions for calculating the quartiles. One of them gives $23$23 and $40$40 respectively, consistent with the values that were found above. But the other method gives $Q_1=21$`Q`1=21 and $Q_3=41$`Q`3=41.

The idea extended still further, leads to the concept of *quantiles*. If there are $N$`N` numbers in the data and we are given a fraction $k$`k` between $0$0 and $1$1, the associated quantile is a number such that, as nearly as possible, $kN$`k``N` of the numbers are below it in the ordered set and $\left(1-k\right)N$(1−`k`)`N` are above it.

Most often we let $k\in\left\{\frac{1}{10},\frac{2}{10},...,\frac{9}{10}\right\}$`k`∈{110,210,...,910} and the resulting quantiles are called *deciles*.

Or, we allow $k\in\left\{\frac{1}{100},\frac{2}{100},...,\frac{99}{100}\right\}$`k`∈{1100,2100,...,99100} and we call the resulting quantiles *percentiles*.

Clearly, the median should be the same as the $50$50th percentile, the first quartile should be the same as the $25$25th percentile and the third quartile should equal the $75$75th percentile. Similarly, for example, the $4$4th decile is the same as the $40$40th percentile. Thus, if we know how to calculate the percentiles, we automatically have a way of determining the quartiles and the deciles.

Again, there are different methods for determining the percentiles of a data set, each giving slightly different results. The differences disappear when the data sets are large.

The simplest method is the following. To find the $p$`p`th percentile of a data set with $N$`N` elements, calculate $\frac{p}{100}\times N$`p`100×`N`. The smallest integer that is greater than or equal to the result is the rank of the number in the data that will be taken to be the required percentile.

Find the $30$30th percentile of the list of nine numbers given above: $14,19,23,24,31,33,40,42,56$14,19,23,24,31,33,40,42,56.

We calculate $\frac{30}{100}\times9=2.7$30100×9=2.7. The nearest integer above this is $3$3. So, we take the third number in the list to be the $30$30th percentile. Hence, the $30$30th percentile is $23$23.

It is apparent that the $25$25th percentile would also be $23$23 in this case. In more sophisticated methods, the two percentiles are different.

Answer the following, given this set of scores:

$33,38,50,12,33,48,41$33,38,50,12,33,48,41

Sort the scores in ascending order.

Find the number of scores.

Find the median.

Find the first quartile of the set of scores.

Find the third quartile of the set of scores.

Find the interquartile range.

The attached chart shows the range of heights for boys aged $2$2 to $18$18 years.

(Credit: State Government of Victoria, Department of Education and Training)

Lachlan is $15$15 and his height is at the $9$9th decile. How tall could he be?

$177-181$177−181 cm

A$167-172$167−172 cm

B$158-161$158−161 cm

C

Consider the data set $9,5,6,3,9,8,4,2,3,2$9,5,6,3,9,8,4,2,3,2.

Calculate the mean to two decimal places.

Calculate the median.

Calculate the value of quartile $1$1.

Calculate the value of quartile $3$3.

Calculate the value of decile $2$2.

Calculate the value of decile $8$8.

Calculate the value of the percentile $43$43.

Calculate the value of the percentile $88$88.