Measures of Spread

In the mock data below, the scores of a class test (total score: 10) for two classes, each with 20 students, are plotted in the dot diagram below.

Which of the two classes did better?

This question is vague. What do we mean by “better”? We would usually like to make this decision according to some summarised data (i.e. statistics). Previously, we have learned that the most computationally convenient statistic to describe the centre of the data is the mean, given by the formula

\displaystyle \bar x = \frac{\Sigma x}{n}.

Running the calculations, Class Epsilon has mean 7.15 and Class Delta has mean 7.45.

Using the mean as our measurement of aura, we might conclude that Class Epsilon is stronger than Class Delta in the exam.

But you can—and should—object: Class Delta has not just one, but two students who scored full marks! Furthermore, we notice that it seems like Class Delta’s mean score is lowered due to some poor-performing outliers. That is, Class Delta has a larger spread of data when compared to the data of Class Epsilon.

The tool that statisticians use is called the standard deviation. The intuitive idea is that we want to find the average of the deviations of the data points from the sample mean. To ensure that this calculation is mathematically convenient, we square these deviations.

Definition 1. For each data point x, determine its squared deviation by \varepsilon := (x - \bar x)^2. The sample variance s_x^2 is then simply defined by s_x^2 := \bar \varepsilon, and the standard deviation is defined by s_x := \sqrt{\bar{\varepsilon}}.

Remark 1. This squared-deviation idea is responsible for linear regression—a fundamental algorithm in modern machine learning.

Theorem 1. The formula to compute the standard deviation s_x of the sample is given by

\displaystyle s_x := \sqrt{ \frac{ \Sigma x^2 }{n} - \left( \frac{\Sigma x}{n} \right)^2 }.

Proof. Denote the data set by \{x_1, x_2,\dots, x_n\}. Compute the squared deviations by

\begin{aligned} \varepsilon_1 &= (x_1 - \bar x)^2 = x_1^2 - 2 \cdot x_1 \cdot \bar x + \bar x^2, \\ \varepsilon_2 &= (x_2 - \bar x)^2 = x_2^2 - 2 \cdot x_2 \cdot \bar x + \bar x^2, \\ \vdots & \quad \quad \quad \vdots  \quad \quad \quad  \quad \quad \quad \quad \vdots \\ \varepsilon_n &= (x_n - \bar x)^2 = x_2^2 - 2 \cdot x_n \cdot \bar x + \bar x^2. \end{aligned}

By definition, \bar{\varepsilon} = (\varepsilon_1 + \varepsilon_2 + \dots + \varepsilon_n)/n. Therefore,

\begin{aligned} n \cdot \bar{\varepsilon} &= \varepsilon_1 + \varepsilon_2 + \cdots + \varepsilon_n \\ &= (x_1 - \bar x)^2 + (x_2 - \bar x)^2 + \cdots + (x_n - \bar x)^2 \\ &= (x_1^2 + x_2^2 + \cdots + x_n^2) - 2 \cdot \underbrace{ (x_1 + x_2 + \cdots  + x_n) }_{n \bar x} \cdot\, \bar x + n\bar x^2 \\ &= (x_1^2 + x_2^2 + \cdots + x_n^2) - 2 \cdot n \bar x^2 + n\bar x^2 \\ &= (x_1^2 + x_2^2 + \cdots + x_n^2) - n\bar x^2. \end{aligned}

Dividing by n on all sides,

\displaystyle \bar{\varepsilon} = \frac{x_1^2 + x_2^2 + \cdots + x_n^2}{n} - \bar x^2 = \frac{\Sigma x^2}{n} - \left( \frac{ \Sigma x }{n} \right)^2.

Taking square roots,

\displaystyle s_x = \sqrt{\frac{\Sigma x^2}{n} - \left( \frac{ \Sigma x }{n} \right)^2}.

Remark 2. If we had a collection of paired data \{(x_1,y_1),\dots, (x_n,y_n)\}, we can compute the sample covariance c_{x,y} between the data set \{x_1,\dots, x_n\} and \{y_1,\dots, y_n\} by

\displaystyle c_{x,y} = \frac{ \Sigma (x - \bar x)(y - \bar y)}{n}.

Observe that \bar{\varepsilon} = c_{x,x}. In this regard, the sample covariance generalises the sample variance. Here, the covariance measures the extent of connection between the two data sets.

Example 1. Using the standard deviation as the measure of spread, Class Epsilon has a standard deviation of approximately 2.01 and Class Delta has a standard deviation of approximately 1.28.

Since the latter is larger, Class Epsilon has a larger spread of scores than Class Delta.

In layperson’s terms, the scores of students in Class Epsilon are more “bunched” together, and thus we can say that the students in Class Epsilon perform more consistently than the students in Class Delta.

However, we should object to this conclusion once again: why did we use the mean and the standard deviation? These statistics are sensitive to outlier data, be it exceedingly high-performing students or exceedingly low-performing students. Why not use the median?

We can, and should: in this case, Class Epsilon has a median score of 7 and Class Delta has a median score of 7. Not helpful. How would we measure the spread of the data?

Definition 2. Sort the dataset into a non-decreasing order

x_1 \leq x_2 \leq \cdots \leq x_n.

Denote the:

  • minimum by Q_0
  • the median by Q_2,
  • the maximum by Q_4,

Define the range of the data set by Q_4 - Q_0.

Obviously, Q_0 = x_1 and Q_4 = x_n. If n is even, then

\displaystyle Q_2 = \textstyle \frac 12 \cdot (x_{n/2} + x_{(n+1)/2}).

If n is odd, then Q_2 = x_{(n+1)/2}.

Remark 3. The latter Q denotes the word ‘quartile’. Therefore, the minimum can be thought of as the “zeroth” quartile, the median as the second quartile, and the maximum as the fourth quartile.

Example 2. The range in Class Epsilon is 8 and the range in Class Delta is 5. Therefore, there is larger spread in Class Epsilon than Class Delta.

But you should, once again, object to this conclusion. This measure of spread accounts for the vast outliers! Can we obtain a measure of spread that disregards outliers, just like how the median disregards outliers?

Definition 3. Suppose a data set \{x_1,\dots, x_n\} where n is odd, and it has a median of x_{(n+1)/2}. Define:

  • the lower quartile Q_1 by the median of the data set \{x_1,\dots, x_{(n+1)/2}\},
  • the upper quartile Q_3 by the median of the data set \{x_{(n+1)/2},\dots, x_n\},
  • the interquartile range by Q_3 - Q_1.

Question 1. How would you define the interquartile range if n were even?

Example 3. By definition,

  • Class Epsilon has a lower quartile of 6.5 and upper quartile of 8.5, and hence, an interquartile range of 2.
  • Class Delta has a lower quartile of 7 and upper quartile of 8.5, and hence, an interquartile range of 1.5.

Since the former is larger than the latter, we conclude that there is larger spread in Class Epsilon than Class Delta.

We can visualise the ordered information using box-and-whisker diagrams. The endpoints denote the minimum and maximum, the box denotes the interquartile range, and the centre line denotes the median. We can plot both box-and-whisker diagrams below.

Therefore, the box-and-whisker diagram helps us visualise the data in a sufficiently meaningful manner. The distinct vertical lines denote Q_0, Q_1, Q_2, Q_3, Q_4 respectively.

Remark 4. For Class Delta, Q_1 = Q_2 = 7, explaining why it appears to have only four lines instead of the expected five.

I have one more idea to discuss—large data sets. So far, our class sizes are small, just 20 sample points. However, if we consider all of the students in the school, we would need to deal with large data sets, say 1000. Suppose also the total score of the assessment is 100, rather than 10. How do we interpret such data? We can use a cumulative frequency diagram.

The y-axis denotes the number of data points, with 0 \leq y \leq 1000. The x-axis denotes the score of the assessment, out of 100. The curve y = f(x) plots the following information: (x,y) lies on the curve precisely when y students scored at most x marks in the assessment.

Remark 5. Cumulative frequency diagrams, being discrete, tend to be more jagged than what we see displayed. Nevertheless, this smooth approximation turns out to be mostly accurate relative to our original data.

Example 4. Estimate the median, range, and interquartile range of the data. Use your estimates to represent the data using a box-and-whisker diagram.

Solution. It is clear that Q_0 = 0 and Q_4 = 100, so that the range is 100. We estimate Q_1,Q_2,Q_3 as follows.

Therefore, we estimate the median to be 69 marks, and the interquartile range to be 18 marks.

Example 5. Using intervals of 10 marks each, estimate the mean and the standard deviation of the data.

Solution. We leave it as an exercise to tabulate the following summarised data.

In particular,

n = 100,\quad \Sigma x = 68\ 500,\quad \Sigma x^2 = 4\ 897\ 000.

Therefore, we estimate the mean of the data to be

\displaystyle \bar x = \frac{68\ 550}{1000} = 68.55,

and by Theorem 1, the standard deviation of the data to be

\displaystyle s_x = \sqrt{\frac{4\ 897\ 000}{1000} - \left(\frac{68\ 550}{1000}\right)^2} \approx 14.1.

Remark 6. In the era of Microsoft Excel and Python, software can compute means and standard deviations of large datasets without using the grouped data approach. They can handle millions of computations—we can’t.

Would you still object? In the spirit of inquiry and scepticism, why not? However, I think my job here is done—I have introduced the key calculations required in secondary school statistics!

Just for fun, for those of you curious about quantitative finance, where you use mathematics and statistics to possibly win the stock market or even the cryptocurrency market. Individuals working in these fields, called quants, use the Sharpe ratio, defined by \bar x / s_x, to determine the riskiness of an asset. Another measure of riskiness known as the mean-variance, defined by \bar x - s_x^2, helps quants optimise the proportion of their assets in order to minimise risk.

Finally, for Singaporeans who (or whose parents) remember the notion of a t-score in the high-stakes Primary School Leaving Examinations (PSLE), the student’s final score for a particular subject is computed using the formula

\displaystyle 50 + 10 \times \frac{x - \bar x}{s_x},

and these numbers are summed over the four subjects: English, Mother Tongue, Mathematics, and Science. My PSLE score was 242—make of that as you will. Contrary to popular expectation, I did *not* get A* for Mathematics due to less-than-academically-important reasons.

All of these statistical analyses arise from random phenomenon, and are general grasps of otherwise un-graspable realities. But can we at least quantify such uncertainty? Our attempt at doing so is probability theory, and we will visit this idea briefly the next time.

—Joel Kindiak, 18 Mar 26, 1435H

,

Published by


Leave a comment