NZ Level 8 (NZC) Level 3 (NCEA) [In development]

Standardisation and quantiles

Lesson

In another chapter, it is explained how *z*-scores or standardised scores are obtained. We standardise data sets in order to make them comparable.

Consider two mathematics classes in the same year level at the same school. The classes are given different tests.

In one class the test scores have a mean of $54$54 and a standard deviation of $14$14, while in the other class the mean is $72$72 and the standard deviation is $3$3.

Can we compare the achievement of a student from the first class who gained $96%$96% with a student from the second class whose mark was $82%$82%?

The $96%$96% mark is $3$3 standard deviations above the mean for the class. So, that student's *z*-score is $3.0$3.0. The $82%$82% mark is slightly more than $3$3 standard deviations above the mean of $72$72. in fact, the second student's *z*-score $3.3$3.3.

Relative to their respective classes, it appears that the second student achieved a slightly better result than the first. However, we can be less certain about their achievements relative to each other.

The information about the means and standard deviations of the test scores may be more informative about the tests than about the students who sat the test. Since it's mean was higher, the test done by the second class may have been easier than the one done by the first class. Also, judging by the standard deviations, the second test did not differentiate between the students' abilities as effectively as did the first test.

To get better information about the students and the tests, we would need to have both classes doing the same test or the same class doing both tests.

A quantile is a number between $0$0 and $1$1 that indicates in a data set a number at or below which that proportion of the data points are to be found.

For example, the $0.6$0.6 quantile is a number such that $0.6$0.6 of the data points are at or below that number.

The median of a data set is the $0.5$0.5 quantile and the first and third quartiles are the $0.25$0.25 and $0.75$0.75 quantiles respectively.

The $90$90th percentile is a number below which $90%$90% of the observations are found. It is the $0.9$0.9 quantile.

Unfortunately, this definition is imprecise and different software applications calculate the quantiles of a data set in slightly different ways.

Consider the set of measurements: $13,13.5,14.5,14.5,15,17,17,18.5,19,22$13,13.5,14.5,14.5,15,17,17,18.5,19,22.

Half of the observations are at or below the number $15$15. So, according to the given definition, $15$15 could be called the $0.5$0.5 quantile. But, so could any number between $15$15 and $17$17.

By convention, we choose $16$16, the average of $15$15 and $17$17, to be called the $0.5$0.5 quantile, also called the median.

A somewhat different problem arises when we consider the $0.25$0.25 quantile, also called the first quartile. By convention, we would say the third number in the list, $14.5$14.5 is the first quartile. But, according to the definition above, this means $\frac{1}{4}$14 of the $10$10 data points are at or below $14.5$14.5. That is, $2\frac{1}{2}$212 data points ar at or below $14.5$14.5 and it is not easy to see what this could mean.

The difficulties mentioned in Example $1$1 can lead to thoughts about the distinction between a data set and an assumed underlying probability distribution.

In Example $1$1 we had $10$10 observations in the data set. But, suppose we had $10000$10000 observations or more. The difficulties would still be there but they would be far less significant and we would feel much more confident that the numbers chosen for the various quantiles were in some sense correct.

We might imagine the $10$10 observations or the $10000$10000 to be samples drawn from an infinitely large population of possible observations and the precise numbers found in the samples reflect the underlying abstract probability distribution. Thus, the precise locations of the quantiles calculated for a given sample are governed by the assumed probability distribution.

We explore this idea in the next paragraphs.

The *z*-score of a data point is its number of standard deviations above the mean of the data set.

If we assume the data is a sample from a normal distribution, then we can deduce from the *z*-score the probable number of observations at or below this level. Thus, there is a correspondence between a *z*-score and a quantile if we know that the data set has a normal distribution.

The proportion of observations below a particular z-score is represented by the area under the probability density curve up to the given value of $z$`z`.

For a *z*-score of $1$1, it is likely that $\frac{0.68}{2}=0.34$0.682=0.34 of the scores will be between $0$0 and this level. For a *z*-score of $2$2, about $\frac{0.95}{2}=0.475$0.952=0.475 of the observations will be between $0$0 and this level, and for a $z$`z`-score of $3$3, about $0.4985$0.4985 of the data points will have *z*-scores between $0$0 and this $z$`z`. In fact, for every *z*-score, there is a likely proportion of observations that will be found between $0$0 and this score.

These proportions (or probabilities) are available in tables that give the likely proportion of observations between $0$0 and the particular *z*-score. Examples of such tables are given in the questions that follow this chapter.

A *z*-score of $1$1 in a normal distribution has beneath it the scores that are less than zero and the scores that are between $0$0 and $1$1. As proportions of the data set, these are $0.5$0.5 and $0.34$0.34 respectively. So, the proportion of scores less than $1$1 is $0.85$0.85. Thus, the *z*-score of $1$1 corresponds to the $0.85$0.85 quantile.

Similarly, a *z*-score of $-1$−1 corresponds to the $0.5-0.34=0.16$0.5−0.34=0.16 quantile.

Suppose we have a reason to think that the ten observations in Example $1$1 may be from a normal distribution.

The mean of the measurements is $16.4$16.4 and the standard deviation is $2.69$2.69.

When standardised, the scores are:

$-1.26,-1.08,-0.71,-0.71,-0.52,0.22,0.22,0.78,0.97,2.08$−1.26,−1.08,−0.71,−0.71,−0.52,0.22,0.22,0.78,0.97,2.08

Now, a z-score of $0$0 corresponds to the mean and also to the $0.5$0.5 quantile. So, the $0.5$0.5 quantile is $16.4$16.4 which is between the $5\text{th}$5th and $6\text{th}$6th observation, as expected. This is close to the value for the median found in Example $1$1.

In the table, which you can find in the questions with this chapter, we see that the probability $0.25$0.25 is associated with a z-score of about $0.67$0.67. From this, we deduce that the first quartile should be the *z*-score $0.5-0.67=-0.17$0.5−0.67=−0.17 and the third quartile should be the *z*-score $0.5+0.67=1.17$0.5+0.67=1.17.

To reverse the standardisation process we should multiply by the standard deviation and then add the mean. So, the first quartile should be $-0.17\times2.69+16.4=15.9$−0.17×2.69+16.4=15.9 and the third quartile should be $1.17\times2.69+16.4=19.5$1.17×2.69+16.4=19.5.

In fact, we determined the first quartile to be $14.5$14.5 and the third quartile would be $18.5$18.5.

The quartiles determined under the assumption that the data set has a normal distribution are different from the quartiles determined in the conventional manner. This suggests that the $10$10 observations in Example $1$1 were not from a normal distribution after all.

If Dave scores $96$96 in a test that has a mean score of $128$128 and a standard deviation of $16$16, what is his $z$`z`-score?

The given table gives us the area between $0$0 and a given $z$`z`-score.

Using the table, find the percent of data that is less than $z=0.72$`z`=0.72.

Give your answer as a percentage to two decimal places.

For the standard normal variable $X$`X`$~$~$N\left(0,1\right)$`N`(0,1), use a graphics calculator to determine the following values.

Round your answers to three decimal places.

The $0.7$0.7 quantile

The $65$65th percentile

The lowest score in the top $20$20 percent

Investigate situations that involve elements of chance: A calculating probabilities of independent, combined, and conditional events B calculating and interpreting expected values and standard deviations of discrete random variables C applying distributions such as the Poisson, binomial, and normal

Apply probability distributions in solving problems