The Real Standard Deviation

In high school, the standard deviation \sigma of a dataset (x_1,\dots, x_n) is usually introduced informally as information about the spread of the dataset, quantified through the unwieldy formula

\displaystyle \sigma = \sqrt{\frac{\Sigma fx^2}{\Sigma f} - \left( \frac{\Sigma fx}{\Sigma f} \right)^2} = \frac 1n \cdot \sqrt{\Sigma x^2 - \frac{\left( \Sigma x \right)^2}{n}}.

Today, we will approach the standard deviation from far more intuitive perspective, and recover the above formula as a special case. Its cousin \sigma^2 is called the variance and is of greater mathematical interest.

We first define the covariance between two random variables X, Y, assuming all quantities exist. Intuitively, it should measure the overall deviation of some kind between X and its expectation \mathbb E[X],

Definition 1. The covariance of X, Y, denoted \mathrm{Cov}(X,Y), is defined by

\mathrm{Cov}(X, Y) = \mathbb E[(X - \mathbb E[X]) \cdot (Y - \mathbb E[Y])].

Lemma 1. By expectation properties, we obtain

\mathrm{Cov}(X, Y) = \mathbb E[XY] - \mathbb E[X]\cdot \mathbb E[Y].

Proof. Expanding within the expectation,

\begin{aligned}(X - \mathbb E[X]) \cdot (Y - \mathbb E[Y]) &= XY - \mathbb E[X] \cdot Y - \mathbb E[Y] \cdot X + \mathbb E[X] \cdot \mathbb E[Y].\end{aligned}

Applying the linearity of \mathbb E[\cdot] (we will formally justify this in measure theory),

\begin{aligned} \mathrm{Cov}(X, Y) &= \mathbb E[(X - \mathbb E[X]) \cdot (Y - \mathbb E[Y])] \\ &=\mathbb E[XY - \mathbb E[X] \cdot Y - \mathbb E[Y] \cdot X + \mathbb E[X] \cdot \mathbb E[Y]] \\ &= \mathbb E[XY] - \mathbb E[\mathbb E[X] \cdot Y] - \mathbb E[\mathbb E[Y] \cdot X] + \mathbb E[\mathbb E[X] \cdot \mathbb E[Y]] \\ &= \mathbb E[XY] - \mathbb E[X] \cdot \mathbb E[Y] - \mathbb E[Y] \cdot \mathbb E[X] + \mathbb E[X] \cdot \mathbb E[Y] \cdot \mathbb E[1] \\ &= \mathbb E[XY] - \mathbb E[X] \cdot \mathbb E[Y]. \end{aligned}

Lemma 2. The covariance satisfies the following properties:

  • \mathrm{Cov}(X, X) \geq 0,
  • \mathrm{Cov}(X, X) = 0 if and only if X = \mathbb E[X],
  • \mathrm{Cov}(X, Y) = \mathrm{Cov}(Y,X),
  • \mathrm{Cov}(kX, Y) = k \cdot \mathrm{Cov}(Y,X) for k \in \mathbb Z,
  • \mathrm{Cov}(W+X, Y) = \mathrm{Cov}(W,X) + \mathrm{Cov}(W, Y).

Proof. We prove the fourth property to illustrate:

\begin{aligned}\mathrm{Cov}(W+X, Y) &= \mathbb E[(W+X)Y] - \mathbb E[W+X] \cdot \mathbb E[Y] \\ &= \mathbb E[WY+XY] - (\mathbb E[W]+\mathbb E[X]) \cdot \mathbb E[Y] \\ &= \mathbb E[WY]+\mathbb E[XY] - (\mathbb E[W] \cdot \mathbb E[Y]+\mathbb E[X] \cdot \mathbb E[Y]) \\ &= \mathbb E[WY] - \mathbb E[W] \cdot \mathbb E[Y] +\mathbb E[XY] - \mathbb E[X] \cdot \mathbb E[Y] \\ &= \mathrm{Cov}(W,Y) + \mathrm{Cov}(X,Y).  \end{aligned}

It is similar to the properties of an inner product, but not exactly, since the second property should require X = 0, rather than just X being a constant. To make such an assertion and more requires technical qualifiers addressed in measure theory. Lemma 2 outlines the main (practical) properties that we need from the covariance.

Lemma 3. Given any constant k, \mathrm{Cov}(X + k, Y) = \mathrm{Cov}(X, Y).

Proof. By Lemma 2, it suffices to prove that \mathrm{Cov}(k, Y) = 0:

\mathrm{Cov}(k, Y) = \mathbb E[k Y] - \mathbb E[k] \cdot \mathbb E[Y] = k \cdot \mathbb E[Y] - k \cdot \mathbb E[Y] = 0.

Lemma 4. For any random variable X, define its variance

\mathrm{Var}(X) := \mathrm{Cov}(X,X) = \mathbb E[X^2] - \mathbb E[X]^2

whenever the right-hand side exists. Then \mathrm{Var}(kX) = k^2 \cdot \mathrm{Var}(X) for k \in \mathbb Z.

Proof. We have

\begin{aligned}\mathrm{Var}(kX) &= \mathrm{Cov}(kX,kX) = k \cdot k \cdot \mathrm{Cov}(X,X) = k^2 \cdot \mathrm{Var}(X). \end{aligned}

Lemma 5. Denote \mu_X := \mathbb E[X]. Define the standard deviation of X by \sigma_X := \sqrt{\mathrm{Var}(X)}. Define the centered random variable \hat X := (X - \mu_X)/\sigma_X whenever \sigma_X \neq 0. Then \mathrm{Var}(\hat X) = 1.

Proof. By Lemma 2,

\begin{aligned} \mathrm{Var}(\hat X) &= \mathrm{Cov}(\hat X, \hat X) \\ &= \mathrm{Cov}\left( \frac{ X - \mu_X }{\sigma_X}, \frac{ X - \mu_X }{\sigma_X} \right) \\ &= \frac{1}{\sigma_X^2} \cdot \mathrm{Cov}(X- \mu_X, X- \mu_X) \\  &= \frac{1}{\sigma_X^2} \cdot \mathrm{Cov}(X, X) = \frac{1}{\mathrm{Var}(X)} \cdot \mathrm{Var}(X) = 1.\end{aligned}

The covariance helps us measure the correlation between the random variables, which we formally define below.

Definition 2. The correlation between X, Y with nonzero variances is defined by

\displaystyle \rho(X, Y) := \mathrm{Cov}(\hat X, \hat Y) \equiv \frac{\mathrm{Cov}(X, Y)}{\sqrt{\mathrm{Var}(X)} \cdot \sqrt{\mathrm{Var}(Y)}}.

Corollary 1. Given the dataset ( (x_1,y_1),\dots,(x_n,y_n)) equipped with the uniform distribution, define the discrete random variables X and Y by X((x_i,y_i)) = x_i and Y((x_i,y_i)) = y_i. Then X,Y follow uniform distributions on the datasets (x_1,\dots,x_n) and (y_1,\dots,y_n) respectively, and

\displaystyle \mathrm{Cov}(X,Y) = \frac{ 1 }{n} \cdot \sum_{i=1}^n (x_i - \bar x)(y_i - \bar y) = \frac{ 1 }{n} \cdot \sum_{i=1}^n x_i y_i - \frac{ 1 }{n^2} \cdot \sum_{i=1}^n x_i \cdot \sum_{i=1}^n y_i,

where \bar x := (x_1+\dots+x_n)/n. When there is no confusion, we suppress the indices and write

\displaystyle \mathrm{Cov}(X,Y) = \frac 1n \cdot \Sigma (x - \bar x)(y - \bar y) = \frac{\Sigma x y}{n} - \frac{ \Sigma x \Sigma y }{n^2} .

for brevity.

Proof. By an equivalent definition for expectation,

\begin{aligned} \mathbb E[XY] &= \sum_{\omega \in \Omega} X(\omega) \cdot Y(\omega) \cdot \mathbb P(\{\omega\}) \\ &= \sum_{i=1}^n x_i \cdot y_i \cdot \frac 1n = \frac{ 1 }{n} \sum_{i=1}^n x_i y_i. \end{aligned}

The other quantities can be computed in a similar manner.

Corollary 2. Setting Y=X in Corollary 1, we obtain

\displaystyle \mathrm{Var}(X) = \frac{ \Sigma (x - \bar x)^2 }{n} = \frac{\Sigma  x^2}{n} - \left( \frac{ \Sigma x }{n} \right)^2.

Furthermore, combining Definition 2 and Corollaries 1 and 2, the product moment correlation coefficient r := \rho(X, Y) is computed via

\begin{aligned} r &= \frac{\Sigma (x - \bar x)(y - \bar y)}{\sqrt{ \Sigma (x - \bar x)^2 } \sqrt{ \Sigma (y - \bar y)^2 } } = \frac{ \Sigma  x y - \frac{ \Sigma x \Sigma y }{ n }}{\sqrt{ \Sigma  x^2 - \frac{ (\Sigma x)^2 }{n}  } \sqrt{ \Sigma  y^2 - \frac{ (\Sigma y)^2 }{n}  }}. \end{aligned}

Finally,

\displaystyle \sigma_X = \sqrt{\frac{\Sigma x^2}{n} - \left( \frac{ \Sigma x }{n} \right)^2} = \frac 1n \cdot \sqrt{ \Sigma x^2 - \frac{ ( \Sigma x )^2 }{n}  }.

Given two random variables X, Y, we know that \mathbb E[X+Y] = \mathbb E[X] + \mathbb E[Y]. What can we deduce about \mathrm{Var}(X + Y)? Interpreting the variance as the covariance with itself and applying covariance properties (where most of the action happens),

\begin{aligned}\mathrm{Var}(X+Y) &= \mathrm{Cov}(X+Y, X+Y) \\ &= \mathrm{Cov}(X,X) + \mathrm{Cov}(X,Y) + \mathrm{Cov}(Y,X) + \mathrm{Cov}(Y,Y) \\ &= \mathrm{Cov}(X,X) + \mathrm{Cov}(Y,Y) + 2\cdot \mathrm{Cov}(X,Y) \\ &= \mathrm{Var}(X) + \mathrm{Var}(Y) + 2 \cdot \mathrm{Cov}(X,Y).\end{aligned}

So it’s not as simple as \mathrm{Var}(X + Y) = \mathrm{Var}(X) + \mathrm{Var}(Y); there’s an extra term involving \mathrm{Cov}(X,Y). However, the equality does hold if \mathrm{Cov}(X,Y) = 0. In this case, we call X and Y uncorrelated. Equivalently,

\mathbb E[XY] = \mathbb E[X] \cdot \mathbb E[Y].

Lemma 6. If X, Y are independent, then X and Y are uncorrelated. In particular,

\mathrm{Var}(X + Y) = \mathrm{Var}(X) + \mathrm{Var}(Y).

Proof. By definition, since X,Y are independent,

\begin{aligned} \mathbb E[XY] &= \sum_{x \in \mathbb Z} \sum_{y \in \mathbb Z} xy \cdot \mathbb P_{X,Y}(x,y) \\ &= \sum_{x \in \mathbb Z} \sum_{y \in \mathbb Z} x \cdot y \cdot \mathbb P_X(x) \cdot \mathbb P_Y(y) \\ &= \sum_{x \in \mathbb Z} x \cdot \mathbb P_X(x) \cdot \sum_{y \in \mathbb Z} y \cdot \mathbb P_Y(y) = \mathbb E[X] \cdot \mathbb E[Y]. \end{aligned}

Therefore, we recover the core variance properties of interest.

Theorem 1. If X, Y are independent random variables, then

\mathrm{Var}(X \pm Y) = \mathrm{Var}(X) + \mathrm{Var}(Y).

Proof. For the minus case,

\begin{aligned} \mathrm{Var}(X - Y) &= \mathrm{Var}(X + (-Y)) \\ &= \mathrm{Var}(X) + \mathrm{Var}(-Y) \\ &= \mathrm{Var}(X) + (-1)^2 \cdot \mathrm{Var}(Y) \\ &= \mathrm{Var}(X) + \mathrm{Var}(Y). \end{aligned}

Example 1. For any \xi \sim \mathrm{Ber}(p), \mathrm{Var}( \xi) = p(1-p). Likewise, for X \sim \mathrm{Bin}(p), \mathrm{Var}(X) = np(1-p).

Proof. We observe that \xi^2 \sim \mathrm{Ber}(p) so that

\mathrm{Var}( \xi) = \mathbb E[ \xi^2] - \mathbb E[ \xi]^2 = p - p^2 = p(1-p).

Writing X = \sum_{i=1}^n \xi_i for i.i.d. \xi_i \sim \mathrm{Ber}(p),

\displaystyle \mathrm{Var}(X) = \mathrm{Var}\left( \sum_{i=1}^n \xi_i \right) = \sum_{i=1}^n \mathrm{Var}(\xi_i) = \sum_{i=1}^n p(1-p) = np(1-p).

Example 2. Let X_1,\dots, X_n be i.i.d. with mean \mu and variance \sigma^2. Define the sample mean by

\displaystyle \bar X_n := \frac{X_1+ \cdots + X_n}{n}.

Then \mathbb E[\bar X_n] = \mu and \mathrm{Var}(\bar X_n) = \sigma^2/n.

Thus, at least intuitively, we should get \mathrm{Var}(\bar X_n) \to 0 as n \to \infty, so that \bar X_n \to \mu. We can formalise this idea as follows.

Lemma 7. For any Y \geq 0 and \delta > 0,

\displaystyle \mathbb P(Y \geq \delta) \leq \frac{ \mathbb E[Y] }{ \delta }.

This result is known as Markov’s inequality.

Proof. For any Y \geq 0, we first note that for \delta > 0,

\begin{aligned} \mathbb E[Y] &= \sum_{y \in \mathbb Z} y \cdot \mathbb P_Y(y) \\ &= \sum_{y < \delta} y \cdot \mathbb P_Y(y) + \sum_{y \geq \delta} y \cdot \mathbb P_Y(y) \\ &\geq \sum_{y < \delta} 0 \cdot \mathbb P_Y(y) + \sum_{y \geq \delta} \delta \cdot \mathbb P_Y(y) \\ &= \delta \cdot \sum_{y \geq \delta} \mathbb P_Y(y) = \delta \cdot \mathbb P(Y \geq \delta). \end{aligned}

Hence, \mathbb P(Y \geq \delta) \leq \mathbb E[Y]/\delta.

Lemma 8. For any random variable X with mean \mu and variance \sigma^2 and \delta > 0,

\displaystyle \mathbb P(|X - \mu| \geq \delta) \leq \frac{ \sigma^2 }{ \delta^2 }.

This result is known as Chebychev’s inequality.

Proof. Setting Y := |X - \mu|^2 \geq 0 and applying Markov’s inequality, since \mathbb E[Y] = \mathrm{Var}(X) = \sigma^2,

\begin{aligned}\mathbb P(|X - \mu| \geq \delta) &= \mathbb P(|X - \mu|^2 \geq \delta^2) = \mathbb P(Y \geq \delta^2) \leq \frac{\mathbb E[Y]}{\delta^2} = \frac{\sigma^2}{\delta^2}.\end{aligned}

Theorem 2. Fix \epsilon > 0 and a distribution \nu with mean \mu and variance \sigma^2. Then for i.i.d. random variables X_1,\dots, X_n \sim \nu,

\displaystyle \mathbb P(|\bar X_n - \mu| > \epsilon)  \leq \frac{\sigma^2}{n \cdot \epsilon^2}.

Hence, with more technical caveats, we can conclude that

\displaystyle \lim_{n \to \infty} \mathbb P(|\bar X_n - \mu| > \epsilon)  = 0.

This result is known as the weak law of large numbers.

Proof. Applying Chebychev’s inequality.,

\begin{aligned}\mathbb P(|\bar X_n - \mu| > \epsilon) &\leq \frac{\mathrm{Var}(\bar X)}{\epsilon^2} = \frac{\sigma^2}{n \cdot \epsilon^2}.\end{aligned}

This writeup brings us to the limit, pun not intended, of our intuitive ideas of probability. Up till now, we assumed that X is discrete (i.e. X(\Omega) is at most countably infinite). But for a more serious discussion, we will need X to be continuous of some kind, so that X(\Omega) \subseteq \mathbb R in a meaningful sense. We’ll take an even more general direction—the measure-theoretic formation of probability theory.

—Joel Kindiak, 3 Jul 25, 0019H

,

Published by


Leave a comment