In high school, the standard deviation of a dataset
is usually introduced informally as information about the spread of the dataset, quantified through the unwieldy formula
Today, we will approach the standard deviation from far more intuitive perspective, and recover the above formula as a special case. Its cousin is called the variance and is of greater mathematical interest.
We first define the covariance between two random variables , assuming all quantities exist. Intuitively, it should measure the overall deviation of some kind between
and its expectation
,
Definition 1. The covariance of , denoted
, is defined by
Lemma 1. By expectation properties, we obtain
Proof. Expanding within the expectation,
Applying the linearity of (we will formally justify this in measure theory),
Lemma 2. The covariance satisfies the following properties:
,
if and only if
,
,
for
,
.
Proof. We prove the fourth property to illustrate:
It is similar to the properties of an inner product, but not exactly, since the second property should require , rather than just
being a constant. To make such an assertion and more requires technical qualifiers addressed in measure theory. Lemma 2 outlines the main (practical) properties that we need from the covariance.
Lemma 3. Given any constant ,
.
Proof. By Lemma 2, it suffices to prove that :
Lemma 4. For any random variable , define its variance
whenever the right-hand side exists. Then for
.
Proof. We have
Lemma 5. Denote . Define the standard deviation of
by
. Define the centered random variable
whenever
. Then
.
Proof. By Lemma 2,
The covariance helps us measure the correlation between the random variables, which we formally define below.
Definition 2. The correlation between with nonzero variances is defined by
Corollary 1. Given the dataset equipped with the uniform distribution, define the discrete random variables
and
by
and
. Then
follow uniform distributions on the datasets
and
respectively, and
where . When there is no confusion, we suppress the indices and write
for brevity.
Proof. By an equivalent definition for expectation,
The other quantities can be computed in a similar manner.
Corollary 2. Setting in Corollary 1, we obtain
Furthermore, combining Definition 2 and Corollaries 1 and 2, the product moment correlation coefficient is computed via
Finally,
Given two random variables , we know that
. What can we deduce about
? Interpreting the variance as the covariance with itself and applying covariance properties (where most of the action happens),
So it’s not as simple as ; there’s an extra term involving
. However, the equality does hold if
. In this case, we call
and
uncorrelated. Equivalently,
Lemma 6. If are independent, then
and
are uncorrelated. In particular,
Proof. By definition, since are independent,
Therefore, we recover the core variance properties of interest.
Theorem 1. If are independent random variables, then
Proof. For the minus case,
Example 1. For any ,
. Likewise, for
,
.
Proof. We observe that so that
Writing for i.i.d.
,
Example 2. Let be i.i.d. with mean
and variance
. Define the sample mean by
Then and
.
Thus, at least intuitively, we should get as
, so that
. We can formalise this idea as follows.
Lemma 7. For any and
,
This result is known as Markov’s inequality.
Proof. For any , we first note that for
,
Hence, .
Lemma 8. For any random variable with mean
and variance
and
,
This result is known as Chebychev’s inequality.
Proof. Setting and applying Markov’s inequality, since
,
Theorem 2. Fix and a distribution
with mean
and variance
. Then for i.i.d. random variables
,
Hence, with more technical caveats, we can conclude that
This result is known as the weak law of large numbers.
Proof. Applying Chebychev’s inequality.,
This writeup brings us to the limit, pun not intended, of our intuitive ideas of probability. Up till now, we assumed that is discrete (i.e.
is at most countably infinite). But for a more serious discussion, we will need
to be continuous of some kind, so that
in a meaningful sense. We’ll take an even more general direction—the measure-theoretic formation of probability theory.
—Joel Kindiak, 3 Jul 25, 0019H