The Real Standard Deviation

In high school, the standard deviation $\sigma$ of a dataset $(x_1,\dots, x_n)$ is usually introduced informally as information about the spread of the dataset, quantified through the unwieldy formula

$\displaystyle \sigma = \sqrt{\frac{\Sigma fx^2}{\Sigma f} - \left( \frac{\Sigma fx}{\Sigma f} \right)^2} = \frac 1n \cdot \sqrt{\Sigma x^2 - \frac{\left( \Sigma x \right)^2}{n}}.$

Today, we will approach the standard deviation from far more intuitive perspective, and recover the above formula as a special case. Its cousin $\sigma^2$ is called the variance and is of greater mathematical interest.

We first define the covariance between two random variables $X, Y$ , assuming all quantities exist. Intuitively, it should measure the overall deviation of some kind between $X$ and its expectation $\mathbb E[X]$ ,

Definition 1. The covariance of $X, Y$ , denoted $\mathrm{Cov}(X,Y)$ , is defined by

$\mathrm{Cov}(X, Y) = \mathbb E[(X - \mathbb E[X]) \cdot (Y - \mathbb E[Y])].$

Lemma 1. By expectation properties, we obtain

$\mathrm{Cov}(X, Y) = \mathbb E[XY] - \mathbb E[X]\cdot \mathbb E[Y].$

Proof. Expanding within the expectation,

$\begin{aligned}(X - \mathbb E[X]) \cdot (Y - \mathbb E[Y]) &= XY - \mathbb E[X] \cdot Y - \mathbb E[Y] \cdot X + \mathbb E[X] \cdot \mathbb E[Y].\end{aligned}$

Applying the linearity of $\mathbb E[\cdot]$ (we will formally justify this in measure theory),

$\begin{aligned} \mathrm{Cov}(X, Y) &= \mathbb E[(X - \mathbb E[X]) \cdot (Y - \mathbb E[Y])] \\ &=\mathbb E[XY - \mathbb E[X] \cdot Y - \mathbb E[Y] \cdot X + \mathbb E[X] \cdot \mathbb E[Y]] \\ &= \mathbb E[XY] - \mathbb E[\mathbb E[X] \cdot Y] - \mathbb E[\mathbb E[Y] \cdot X] + \mathbb E[\mathbb E[X] \cdot \mathbb E[Y]] \\ &= \mathbb E[XY] - \mathbb E[X] \cdot \mathbb E[Y] - \mathbb E[Y] \cdot \mathbb E[X] + \mathbb E[X] \cdot \mathbb E[Y] \cdot \mathbb E[1] \\ &= \mathbb E[XY] - \mathbb E[X] \cdot \mathbb E[Y]. \end{aligned}$

Lemma 2. The covariance satisfies the following properties:

$\mathrm{Cov}(X, X) \geq 0$ ,
$\mathrm{Cov}(X, X) = 0$ if and only if $X = \mathbb E[X]$ ,
$\mathrm{Cov}(X, Y) = \mathrm{Cov}(Y,X)$ ,
$\mathrm{Cov}(kX, Y) = k \cdot \mathrm{Cov}(Y,X)$ for $k \in \mathbb Z$ ,
$\mathrm{Cov}(W+X, Y) = \mathrm{Cov}(W,X) + \mathrm{Cov}(W, Y)$ .

Proof. We prove the fourth property to illustrate:

$\begin{aligned}\mathrm{Cov}(W+X, Y) &= \mathbb E[(W+X)Y] - \mathbb E[W+X] \cdot \mathbb E[Y] \\ &= \mathbb E[WY+XY] - (\mathbb E[W]+\mathbb E[X]) \cdot \mathbb E[Y] \\ &= \mathbb E[WY]+\mathbb E[XY] - (\mathbb E[W] \cdot \mathbb E[Y]+\mathbb E[X] \cdot \mathbb E[Y]) \\ &= \mathbb E[WY] - \mathbb E[W] \cdot \mathbb E[Y] +\mathbb E[XY] - \mathbb E[X] \cdot \mathbb E[Y] \\ &= \mathrm{Cov}(W,Y) + \mathrm{Cov}(X,Y). \end{aligned}$

It is similar to the properties of an inner product, but not exactly, since the second property should require $X = 0$ , rather than just $X$ being a constant. To make such an assertion and more requires technical qualifiers addressed in measure theory. Lemma 2 outlines the main (practical) properties that we need from the covariance.

Lemma 3. Given any constant $k$ , $\mathrm{Cov}(X + k, Y) = \mathrm{Cov}(X, Y)$ .

Proof. By Lemma 2, it suffices to prove that $\mathrm{Cov}(k, Y) = 0$ :

$\mathrm{Cov}(k, Y) = \mathbb E[k Y] - \mathbb E[k] \cdot \mathbb E[Y] = k \cdot \mathbb E[Y] - k \cdot \mathbb E[Y] = 0.$

Lemma 4. For any random variable $X$ , define its variance

$\mathrm{Var}(X) := \mathrm{Cov}(X,X) = \mathbb E[X^2] - \mathbb E[X]^2$

whenever the right-hand side exists. Then $\mathrm{Var}(kX) = k^2 \cdot \mathrm{Var}(X)$ for $k \in \mathbb Z$ .

Proof. We have

$\begin{aligned}\mathrm{Var}(kX) &= \mathrm{Cov}(kX,kX) = k \cdot k \cdot \mathrm{Cov}(X,X) = k^2 \cdot \mathrm{Var}(X). \end{aligned}$

Lemma 5. Denote $\mu_X := \mathbb E[X]$ . Define the standard deviation of $X$ by $\sigma_X := \sqrt{\mathrm{Var}(X)}$ . Define the centered random variable $\hat X := (X - \mu_X)/\sigma_X$ whenever $\sigma_X \neq 0$ . Then $\mathrm{Var}(\hat X) = 1$ .

Proof. By Lemma 2,

$\begin{aligned} \mathrm{Var}(\hat X) &= \mathrm{Cov}(\hat X, \hat X) \\ &= \mathrm{Cov}\left( \frac{ X - \mu_X }{\sigma_X}, \frac{ X - \mu_X }{\sigma_X} \right) \\ &= \frac{1}{\sigma_X^2} \cdot \mathrm{Cov}(X- \mu_X, X- \mu_X) \\ &= \frac{1}{\sigma_X^2} \cdot \mathrm{Cov}(X, X) = \frac{1}{\mathrm{Var}(X)} \cdot \mathrm{Var}(X) = 1.\end{aligned}$

The covariance helps us measure the correlation between the random variables, which we formally define below.

Definition 2. The correlation between $X, Y$ with nonzero variances is defined by

$\displaystyle \rho(X, Y) := \mathrm{Cov}(\hat X, \hat Y) \equiv \frac{\mathrm{Cov}(X, Y)}{\sqrt{\mathrm{Var}(X)} \cdot \sqrt{\mathrm{Var}(Y)}}.$

Corollary 1. Given the dataset $( (x_1,y_1),\dots,(x_n,y_n))$ equipped with the uniform distribution, define the discrete random variables $X$ and $Y$ by $X((x_i,y_i)) = x_i$ and $Y((x_i,y_i)) = y_i$ . Then $X,Y$ follow uniform distributions on the datasets $(x_1,\dots,x_n)$ and $(y_1,\dots,y_n)$ respectively, and

$\displaystyle \mathrm{Cov}(X,Y) = \frac{ 1 }{n} \cdot \sum_{i=1}^n (x_i - \bar x)(y_i - \bar y) = \frac{ 1 }{n} \cdot \sum_{i=1}^n x_i y_i - \frac{ 1 }{n^2} \cdot \sum_{i=1}^n x_i \cdot \sum_{i=1}^n y_i,$

where $\bar x := (x_1+\dots+x_n)/n$ . When there is no confusion, we suppress the indices and write

$\displaystyle \mathrm{Cov}(X,Y) = \frac 1n \cdot \Sigma (x - \bar x)(y - \bar y) = \frac{\Sigma x y}{n} - \frac{ \Sigma x \Sigma y }{n^2} .$

for brevity.

Proof. By an equivalent definition for expectation,

$\begin{aligned} \mathbb E[XY] &= \sum_{\omega \in \Omega} X(\omega) \cdot Y(\omega) \cdot \mathbb P(\{\omega\}) \\ &= \sum_{i=1}^n x_i \cdot y_i \cdot \frac 1n = \frac{ 1 }{n} \sum_{i=1}^n x_i y_i. \end{aligned}$

The other quantities can be computed in a similar manner.

Corollary 2. Setting $Y=X$ in Corollary 1, we obtain

$\displaystyle \mathrm{Var}(X) = \frac{ \Sigma (x - \bar x)^2 }{n} = \frac{\Sigma x^2}{n} - \left( \frac{ \Sigma x }{n} \right)^2.$

Furthermore, combining Definition 2 and Corollaries 1 and 2, the product moment correlation coefficient $r := \rho(X, Y)$ is computed via

$\begin{aligned} r &= \frac{\Sigma (x - \bar x)(y - \bar y)}{\sqrt{ \Sigma (x - \bar x)^2 } \sqrt{ \Sigma (y - \bar y)^2 } } = \frac{ \Sigma x y - \frac{ \Sigma x \Sigma y }{ n }}{\sqrt{ \Sigma x^2 - \frac{ (\Sigma x)^2 }{n} } \sqrt{ \Sigma y^2 - \frac{ (\Sigma y)^2 }{n} }}. \end{aligned}$

Finally,

$\displaystyle \sigma_X = \sqrt{\frac{\Sigma x^2}{n} - \left( \frac{ \Sigma x }{n} \right)^2} = \frac 1n \cdot \sqrt{ \Sigma x^2 - \frac{ ( \Sigma x )^2 }{n} }.$

Given two random variables $X, Y$ , we know that $\mathbb E[X+Y] = \mathbb E[X] + \mathbb E[Y]$ . What can we deduce about $\mathrm{Var}(X + Y)$ ? Interpreting the variance as the covariance with itself and applying covariance properties (where most of the action happens),

$\begin{aligned}\mathrm{Var}(X+Y) &= \mathrm{Cov}(X+Y, X+Y) \\ &= \mathrm{Cov}(X,X) + \mathrm{Cov}(X,Y) + \mathrm{Cov}(Y,X) + \mathrm{Cov}(Y,Y) \\ &= \mathrm{Cov}(X,X) + \mathrm{Cov}(Y,Y) + 2\cdot \mathrm{Cov}(X,Y) \\ &= \mathrm{Var}(X) + \mathrm{Var}(Y) + 2 \cdot \mathrm{Cov}(X,Y).\end{aligned}$

So it’s not as simple as $\mathrm{Var}(X + Y) = \mathrm{Var}(X) + \mathrm{Var}(Y)$ ; there’s an extra term involving $\mathrm{Cov}(X,Y)$ . However, the equality does hold if $\mathrm{Cov}(X,Y) = 0$ . In this case, we call $X$ and $Y$ uncorrelated. Equivalently,

$\mathbb E[XY] = \mathbb E[X] \cdot \mathbb E[Y].$

Lemma 6. If $X, Y$ are independent, then $X$ and $Y$ are uncorrelated. In particular,

$\mathrm{Var}(X + Y) = \mathrm{Var}(X) + \mathrm{Var}(Y).$

Proof. By definition, since $X,Y$ are independent,

$\begin{aligned} \mathbb E[XY] &= \sum_{x \in \mathbb Z} \sum_{y \in \mathbb Z} xy \cdot \mathbb P_{X,Y}(x,y) \\ &= \sum_{x \in \mathbb Z} \sum_{y \in \mathbb Z} x \cdot y \cdot \mathbb P_X(x) \cdot \mathbb P_Y(y) \\ &= \sum_{x \in \mathbb Z} x \cdot \mathbb P_X(x) \cdot \sum_{y \in \mathbb Z} y \cdot \mathbb P_Y(y) = \mathbb E[X] \cdot \mathbb E[Y]. \end{aligned}$

Therefore, we recover the core variance properties of interest.

Theorem 1. If $X, Y$ are independent random variables, then

$\mathrm{Var}(X \pm Y) = \mathrm{Var}(X) + \mathrm{Var}(Y).$

Proof. For the minus case,

$\begin{aligned} \mathrm{Var}(X - Y) &= \mathrm{Var}(X + (-Y)) \\ &= \mathrm{Var}(X) + \mathrm{Var}(-Y) \\ &= \mathrm{Var}(X) + (-1)^2 \cdot \mathrm{Var}(Y) \\ &= \mathrm{Var}(X) + \mathrm{Var}(Y). \end{aligned}$

Example 1. For any $\xi \sim \mathrm{Ber}(p)$ , $\mathrm{Var}( \xi) = p(1-p)$ . Likewise, for $X \sim \mathrm{Bin}(p)$ , $\mathrm{Var}(X) = np(1-p)$ .

Proof. We observe that $\xi^2 \sim \mathrm{Ber}(p)$ so that

$\mathrm{Var}( \xi) = \mathbb E[ \xi^2] - \mathbb E[ \xi]^2 = p - p^2 = p(1-p).$

Writing $X = \sum_{i=1}^n \xi_i$ for i.i.d. $\xi_i \sim \mathrm{Ber}(p)$ ,

$\displaystyle \mathrm{Var}(X) = \mathrm{Var}\left( \sum_{i=1}^n \xi_i \right) = \sum_{i=1}^n \mathrm{Var}(\xi_i) = \sum_{i=1}^n p(1-p) = np(1-p).$

Example 2. Let $X_1,\dots, X_n$ be i.i.d. with mean $\mu$ and variance $\sigma^2$ . Define the sample mean by

$\displaystyle \bar X_n := \frac{X_1+ \cdots + X_n}{n}.$

Then $\mathbb E[\bar X_n] = \mu$ and $\mathrm{Var}(\bar X_n) = \sigma^2/n$ .

Thus, at least intuitively, we should get $\mathrm{Var}(\bar X_n) \to 0$ as $n \to \infty$ , so that $\bar X_n \to \mu$ . We can formalise this idea as follows.

Lemma 7. For any $Y \geq 0$ and $\delta > 0$ ,

$\displaystyle \mathbb P(Y \geq \delta) \leq \frac{ \mathbb E[Y] }{ \delta }.$

This result is known as Markov’s inequality.

Proof. For any $Y \geq 0$ , we first note that for $\delta > 0$ ,

$\begin{aligned} \mathbb E[Y] &= \sum_{y \in \mathbb Z} y \cdot \mathbb P_Y(y) \\ &= \sum_{y < \delta} y \cdot \mathbb P_Y(y) + \sum_{y \geq \delta} y \cdot \mathbb P_Y(y) \\ &\geq \sum_{y < \delta} 0 \cdot \mathbb P_Y(y) + \sum_{y \geq \delta} \delta \cdot \mathbb P_Y(y) \\ &= \delta \cdot \sum_{y \geq \delta} \mathbb P_Y(y) = \delta \cdot \mathbb P(Y \geq \delta). \end{aligned}$

Hence, $\mathbb P(Y \geq \delta) \leq \mathbb E[Y]/\delta$ .

Lemma 8. For any random variable $X$ with mean $\mu$ and variance $\sigma^2$ and $\delta > 0$ ,

$\displaystyle \mathbb P(|X - \mu| \geq \delta) \leq \frac{ \sigma^2 }{ \delta^2 }.$

This result is known as Chebychev’s inequality.

Proof. Setting $Y := |X - \mu|^2 \geq 0$ and applying Markov’s inequality, since $\mathbb E[Y] = \mathrm{Var}(X) = \sigma^2$ ,

$\begin{aligned}\mathbb P(|X - \mu| \geq \delta) &= \mathbb P(|X - \mu|^2 \geq \delta^2) = \mathbb P(Y \geq \delta^2) \leq \frac{\mathbb E[Y]}{\delta^2} = \frac{\sigma^2}{\delta^2}.\end{aligned}$

Theorem 2. Fix $\epsilon > 0$ and a distribution $\nu$ with mean $\mu$ and variance $\sigma^2$ . Then for i.i.d. random variables $X_1,\dots, X_n \sim \nu$ ,

$\displaystyle \mathbb P(|\bar X_n - \mu| > \epsilon) \leq \frac{\sigma^2}{n \cdot \epsilon^2}.$

Hence, with more technical caveats, we can conclude that

$\displaystyle \lim_{n \to \infty} \mathbb P(|\bar X_n - \mu| > \epsilon) = 0.$

This result is known as the weak law of large numbers.

Proof. Applying Chebychev’s inequality.,

$\begin{aligned}\mathbb P(|\bar X_n - \mu| > \epsilon) &\leq \frac{\mathrm{Var}(\bar X)}{\epsilon^2} = \frac{\sigma^2}{n \cdot \epsilon^2}.\end{aligned}$

This writeup brings us to the limit, pun not intended, of our intuitive ideas of probability. Up till now, we assumed that $X$ is discrete (i.e. $X(\Omega)$ is at most countably infinite). But for a more serious discussion, we will need $X$ to be continuous of some kind, so that $X(\Omega) \subseteq \mathbb R$ in a meaningful sense. We’ll take an even more general direction—the measure-theoretic formation of probability theory.

—Joel Kindiak, 3 Jul 25, 0019H

KindiakMath

The Real Standard Deviation

Leave a comment Cancel reply

The Real Standard Deviation

Share this:

Leave a comment Cancel reply