When we describe the shape of data sets, we focus on how the scores are distributed and whether the shape is symmetrical or not.
Data may be described as symmetrical or asymmetrical.
There are many cases where the data tends to be around a central value with no bias left or right. In other words, $50%$50% of scores will be above the mean and $50%$50% of scores will be below the mean.
The normal distribution is the most common example of symmetrical data. The normal distribution looks like this:
The picture below shows how the normal distribution occurs on a histogram. The dark line shows the nice, symmetrical pattern in the histogram.
The '0' point right in the middle of the distribution represents the mean, the median and the mode- all these measures of central tendency are equal. If we use that '0' line as our axis of symmetry, do you notice how the left hand side is a perfect reflection of the right hand side? That means it's symmetrical.
If your data is asymmetrical (ie. it isn't symmetrical), it may be described as skewed.
A positive skew means that the majority of the scores are low and the long tail is on the positive side of the peak of the graph. This means the mean is greater than the median, which is greater than the mode.
MODE < MEDIAN < MEAN
A positively skewed graph looks something like this.
Notice how most of the scores are in the lower half of the graph?
The tail to the right, pulls the mean up, this is the positive skew.
A negative skew means that the majority of the scores are high and the long tail is on the negative side of the peak. This means the mode is greater than the median, which is greater than the mean.
MEAN < MEDIAN < MODE
A negatively skewed graph looks something like this.
Notice how most of the scores are in the higher half of the graph?
The tail to the left, pulls the mean down, this is the negative skew.
A cluster is a number of similar things collected together. Similarly, in data displays, if lots of the scores in a data set are grouped together within a very small range, we also call this clustering.
The shape of a data also shows us whether there are any outliers or unusually high or low scores in our data set.
For example, in the dot plot below, do you see how all the scores range between $12$12 and $14$14 except one? This means that $24$24 is an outlier.
In this case the data is very obviously way outside the range of the rest of the data set.
More formally we define that a score is an outlier if it is $1.5\times$1.5× IQR above the upper quartile, or $1.5\times$1.5× IQR below the lower quartile.
Now let's try and use this knowledge to describe some data sets!
State whether the scores in each histogram are positively skewed, negatively skewed or symmetrical (approximately).
For the Stem and Leaf plot attached:
|$3$3||$0$0 $4$4 $6$6 $7$7 $8$8 $9$9|
|$4$4||$1$1 $3$3 $5$5 $8$8 $8$8 $8$8|
Are there any outliers?
Identify the outlier.
Is there any clustering of data?
Where does the clustering occur?
What is the modal class(es)?
The distribution of the data is:
Plan and conduct investigations using the statistical enquiry cycle: A justifying the variables and measures used B managing sources of variation, including through the use of random sampling C identifying and communicating features in context (trends, relationships between variables, and differences within and between distributions), using multiple displays D making informal inferences about populations from sample data E justifying findings, using displays and measures.
Investigate a given multivariate data set using the statistical enquiry cycle