Category: Probability

Measuring the Reals
What is the length of the interval $[a, b]$ ? It is simply $b-a$ . That bit is trivial. But can we assign lengths to meaningful subsets $K \subseteq \mathbb R$ in general? This endeavour requires a lot more effort.

Lemma 1. Let $\mathcal F^0$ be the collection of finite (and without loss of generality, disjoint) unions of intervals of the form $[a, b)$ and $(-\infty, c)$ . Then $\mathcal F^0$ forms an algebra (i.e. it is closed under set complementation and finite unions).

The proof of Lemma 1 is relatively trivial, and it gives us an algebra on $\mathcal F^0$ . Clearly, we want to define $\lambda_0([a,b)) := b-a$ , $\lambda_0((-\infty, c)) = \infty$ . If we can prove that $\lambda_0$ is countably additive, then we can take advantage of the Carathéodory extension theorem to extend it to a proper measure $\lambda$ such that $\lambda|_{\mathcal F^0} = \lambda_0$ . We shall ensure this is the case in Lemma 2.

Lemma 2. Extend the map $\lambda_0 : \mathcal F^0 \to [0, \infty]$ by

$\displaystyle \lambda_0\left( \bigsqcup_{i=1}^\infty [a_i, b_i) \right) := \sum_{i=1}^\infty \lambda_0([a_i, b_i)).$

Then $\lambda_0$ is countably additive: for any pairwise disjoint $\{K_j\} \subseteq \mathcal F^0$ ,

$\displaystyle \lambda_0 \left( \bigsqcup_{j=1}^\infty K_j \right) = \sum_{j=1}^\infty \lambda_0(K_j),\quad \bigsqcup_{j=1}^\infty K_j \in \mathcal F^0.$

Proof. Firstly, for any $n$ , $\bigsqcup_{j=1}^n K_j \in \mathcal F^0$ . By monotonicity,

$\displaystyle \sum_{j=1}^n \lambda_0(K_j) = \lambda_0 \left( \bigsqcup_{j=1}^n K_j \right) \leq \lambda_0 \left( \bigsqcup_{j=1}^\infty K_j \right).$

Taking $n \to \infty$ , we have

$\displaystyle \lambda_0 \left( \bigsqcup_{j=1}^\infty K_j \right) \geq \sum_{j=1}^\infty \lambda_0(K_j).$

It remains to prove that $\lambda_0$ is countably subadditive:

$\displaystyle \lambda_0 \left( \bigsqcup_{j=1}^\infty K_j \right) \leq \sum_{j=1}^\infty \lambda_0(K_j) =: M.$

If $M = \infty$ , then we are done. Suppose $M < \infty$ . Under the assumption that $K := \bigsqcup_{j=1}^\infty K_j \in \mathcal F^0$ , $K$ is a finite union of intervals of the form $[a, b)$ . Assume $K = [a, b)$ then, so that

$\displaystyle \bigsqcup_{j=1}^\infty [a_j, b_j) = [a, b).$

Fix $\epsilon > 0$ . We observe that $\{ (a_j - \epsilon / 2^j, b_j + \epsilon / 2^j) \}$ forms an open cover for the compact space $[a, b]$ , and without loss of generality admits a finite sub-cover $\{ (a_j - \epsilon / 2^j , b_j + \epsilon / 2^j ) : j = 1, \dots, n\}$ . Hence,

$\displaystyle \bigsqcup_{j=1}^\infty [a_j, b_j) \subseteq \bigsqcup_{j=1}^n [a_j - \epsilon / 2^j , b_j + \epsilon / 2^j).$

By monotonicity,

$\begin{aligned} \lambda_0\left( \bigsqcup_{j=1}^\infty [a_j, b_j)\right) &\leq \sum_{j=1}^n \lambda_0([a_j - \epsilon / 2^j , b_j + \epsilon / 2^j)) \\ &= \sum_{j=1}^n \left( \lambda_0([a_j, b_j)) + \frac{\epsilon}{2^{j-1}} \right) \\ &= \sum_{j=1}^n \lambda_0([a_j, b_j)) + \epsilon \cdot \sum_{j=1}^n \frac{1}{2^{j-1}} \\ &\leq \sum_{j=1}^\infty \lambda_0([a_j, b_j)) + \epsilon \cdot \sum_{j=1}^\infty \frac{1}{2^{j-1}} \\ &= M + \epsilon \cdot \frac{1}{1-1/2} \\ &= M + 2\epsilon. \end{aligned}$

Taking $\epsilon \to 0^+$ yields the desired result.

At last, we can define the Lebesgue measure on $\mathbb R$ .

Theorem 1. There exists a $\sigma$ -algebra $\mathcal F \supseteq \frak{B}(\mathbb R) \supseteq \mathcal F^0$ on $\mathbb R$ such that there exists a measure $\lambda : \mathcal F \to [0, \infty]$ , called the Lebesgue measure, such that $\lambda|_{\mathcal F} = \lambda_0$ . In particular, $\lambda([a, b)) = b - a$ .

Proof. Apply Carathéodory’s extension theorem via Lemma 2. All that remains is to check the subset relation. We have

$\displaystyle [a, b) = \bigcap_{k=1}^\infty (a-1/k, b) \in \frak{B}(\mathbb R),\quad (a, b) = \bigcup_{k=1}^\infty [a+1/k, b) \in \sigma(\mathcal F^0).$

Since $\mathcal F$ is a $\sigma$ -algebra containing $\mathcal F^0$ and $\frak{B}(\mathbb R) = \sigma(\mathcal F^0)$ by the above argument, we have $\mathcal F^0 \subseteq \sigma(\mathcal F^0) \subseteq \frak{B}(\mathbb R) \subseteq \mathcal F$ .

Let’s now compute the lengths of various sets.

Example 1. We have the following lengths of the following subsets of $\mathbb R$ .
- $\lambda(\{x\}) = 0$ ,
- $\lambda((a, b)) = \lambda((a, b]) = \lambda([a, b]) = \lambda([a, b))$ ,
- $\lambda(\mathbb Q) = 0$ ,
- $\lambda(\mathbb R) = \infty$ .
Proof. For the first result, we use the outer measure formulation. For any $\epsilon > 0$ , $[x-\epsilon, x+\epsilon) \supseteq \{x\}$ . Therefore, $\lambda^*(\{x\}) \leq (x+\epsilon) - (x-\epsilon) = 2\epsilon$ . Taking $\epsilon \to 0^+$ , $\lambda(\{x\}) =\lambda^*(\{x\}) = 0$ .

For the second result, we use countable additivity to obtain

$\lambda ([a, b)) = \lambda (\{a\} \cup (a, b)) = \lambda(\{a\}) + \lambda((a, b)) = 0 + \lambda ((a, b)) = \lambda ((a, b)).$

For the third result, we first let $f : \mathbb N \to \mathbb Q$ denote a bijection (since the latter is countable), and denote $q_i := f(i)$ . Fix $\epsilon > 0$ . For each $i$ , $\lambda(\{ q_i \}) < \epsilon/2^i$ . Therefore, by countable additivity,

$\displaystyle 0\leq \lambda(\mathbb Q) = \lambda \left(\bigcup_{i=1}^\infty \{q_i\} \right) = \sum_{i=1}^\infty \lambda(\{q_i\}) \leq \sum_{i=1}^\infty \frac{\epsilon}{2^i} = \epsilon.$

Hence, $\lambda (\mathbb Q) = 0$ .

For the fourth result, for any $n \in \mathbb N$ , $\mathbb R \supseteq [0, n]$ so that monotonicity yields

$\lambda(\mathbb R) \geq \lambda([0, n]) = n.$

Taking $n \to \infty$ yields $\lambda(\mathbb R) = \infty$ .

The computation $\lambda(\mathbb Q) = 0$ is the motivation for the convention $0 \cdot \infty = 0$ , since we are interpreting $\infty$ in this context to mean countable infinity. If we adopted this convention, the computation simplifies to

$\displaystyle \lambda(\mathbb Q) = \sum_{i=1}^\infty \lambda(\{q_i\}) = \sum_{i=1}^\infty 0 = \infty \cdot 0 = 0 \cdot \infty = 0.$

What’s with all this hard work? Couldn’t we just define $\lambda$ for all subsets of $\mathbb R$ ? Sadly, we cannot.

Lemma 3 (Translational Invariance). For any $K \subseteq \mathbb R$ and $\alpha \in \mathbb R$ , define $\alpha + K := \{\alpha + x : x \in K\}$ . If $K \in \mathcal F$ , then $\alpha + K \in \mathcal F$ and $\lambda (\alpha + K) = \lambda(K)$ .

Proof. Fix $\epsilon > 0$ . By definition of $\lambda^* = \lambda$ on $\mathcal F$ , there exists a countable cover $\{K_i\} \equiv \{[a_i,b_i)\}$ of $K$ such that

$\displaystyle \sum_{i=1}^\infty \lambda^*(K_i) < \lambda(K) + \epsilon.$

We observe that $\{\alpha + K_i\}$ is a countable cover of $\alpha + K$ . Furthermore,

$\lambda (\alpha + K_i) = \lambda ([\alpha + a_i, \alpha + b_i)) = \lambda([a_i, b_i)) = \lambda (K_i).$

Therefore,

$\displaystyle \lambda^*(\alpha + K) \leq \sum_{i=1}^\infty \lambda^*(\alpha + K_i) = \sum_{i=1}^\infty \lambda^*(K_i) < \lambda^* (K) + \epsilon.$

Taking $\epsilon \to 0^+$ , we have $\lambda^*(\alpha + K) \leq \lambda^*(\alpha + K) \leq \lambda^*(K)$ . For the reverse inequality,

$\lambda^*(K) = \lambda^*(-\alpha + (\alpha + K)) \leq \lambda^*(\alpha + K).$

Therefore, $\lambda^*(\alpha + K) = \lambda^*(K) = \lambda(K)$ . It is not hard then to verify that $\alpha + K \in \mathcal F$ , so that

$\lambda(\alpha + K) = \lambda^*(\alpha + K) = \lambda(K).$

Theorem 2. There exists some set $V \subseteq \mathbb R$ , called a Vitali set, such that $\lambda(V)$ is undefined.

Proof. Define the equivalence relation $\sim$ on $\mathbb R$ by $x \sim y$ if and only if $x - y \in \mathbb Q$ . Then we obtain the quotient set $\mathbb R/\mathbb Q := \mathbb R/{\sim}$ whose take the form $x + \mathbb Q$ . Furthermore, for each $x \in \mathbb R$ , $(x + \mathbb Q) \cap [0, 1) \neq \emptyset$ . By the axiom of choice, select $\tilde x \in (x + \mathbb Q) \cap [0, 1)$ . Define the Vitali set by

$\displaystyle V := \bigcup_{x \in \mathbb R} \{\tilde x\}.$

Suppose for a contradiction that $V \in \mathcal F$ and $\lambda(V) \in [0, \infty]$ . Let $f : \mathbb N \to \mathbb Q \cap [-1, 1] =: \mathbb Q_{[-1,1]}$ be an enumeration of $\mathbb Q$ defined by $q_k := f(k)$ (as per the countability of $\mathbb Q$ ). Defining $V_k := q_k + V \in \mathcal F$ , it is not hard to check that $\{V_k\}$ is pairwise disjoint. Hence,

$\displaystyle \bigsqcup_{k \in \mathbb Q_{[-1,1]}} V_k \in \mathcal F.$

We leave it as an exercise to verify that

$\displaystyle [0, 1] \subseteq \bigsqcup_{k \in \mathbb Q_{[-1,1]}} V_k \subseteq [-1, 2].$

By monotonicity,

$\displaystyle 1 = \lambda([0, 1]) \leq \sum_{k \in \mathbb Q_{[-1,1]}} \lambda(V_k) \leq \lambda ([-1, 2]) = 3.$

Since $\lambda(V_k) = \lambda(q_k + V) = \lambda(V)$ by translational invariance, $1 \leq \infty \cdot \lambda(V) \leq 3$ . However, $\infty \cdot \lambda(V) \in \{0, \infty \}$ , a contradiction.

Theorem 2 is one key motivator for all of this measure theoretic language—so that we can still discuss integration and measure without running into contradictions. All of the sets we care about are still present in $\frak{B}(\mathbb R)$ and some of them appear in $\mathcal F$ as well, meaning that we have the correct logical basis to discuss these ideas in further detail.

—Joel Kindiak, 11 Jul 25, 1219H
October 14, 2025
The Union Bound

Let $(\Omega, \mathcal F, \mathbb P)$ be a probability space.

Problem 1. For $K_1, K_2, \dots\in \mathcal F$ , prove Boole’s inequality:

$\displaystyle \mathbb P \left( \bigcup_{i=1}^\infty K_i \right) \leq \sum_{i=1}^\infty \mathbb P(K_i).$

This result is also known as the union bound.

(Click for Solution)

Solution. Define $L_1 := K_1$ and inductively, $L_{i+1} = K_{i+1} \backslash \bigcup_{j=1}^i K_j$ . We can check that $\{L_i\}$ is pairwise mutually exclusive and $L_i \subseteq K_i$ for each $i$ . Furthermore, it is not hard to prove that

$\displaystyle \bigcup_{i=1}^\infty K_i = \bigsqcup_{i=1}^\infty L_i.$

To see this, we first note that $(\supseteq)$ is obvious. For the direction $(\subseteq)$ , fix $x \in K_i$ for some $i$ . Take $i$ to be the smallest by the well-ordering principle, so that $x \notin \bigcup_{j=1}^{i-1} K_j$ . Therefore, $x \in L_i$ , as required. Therefore,

$\displaystyle \mathbb P\left( \bigcup_{i=1}^\infty K_i\right) = \mathbb P \left( \bigsqcup_{i=1}^\infty L_i \right) = \sum_{i=1}^\infty \mathbb P(L_i) \leq \sum_{i=1}^\infty \mathbb P(K_i) .$

This is known as the disjoint union trick.

Problem 2. For $K_1, K_2, \dots, K_n \in \mathcal F$ , prove Bonferroni’s inequality.

$\displaystyle \mathbb P \left( \bigcap_{i=1}^n K_i \right) \geq \sum_{i=1}^n \mathbb P(K_i) - (n-1).$

(Click for Solution)

Solution. Firstly, setting $K_{n+i} = \emptyset$ for $i \geq 0$ ,

$\displaystyle \mathbb P \left( \bigcup_{i=1}^n K_i \right) = \mathbb P \left( \bigcup_{i=1}^\infty K_i \right) \leq \sum_{i=1}^\infty \mathbb P(K_i) = \sum_{i=1}^n\mathbb P(K_i).$

Defining $L_i := \Omega \backslash K_i$ for $i \geq 1$ , use Problem 1 to deduce

$\begin{aligned} \mathbb P \left( \bigcap_{i=1}^n K_i \right) &= \mathbb P \left( \bigcap_{i=1}^n ( \Omega \backslash L_i ) \right) = \mathbb P \left( \Omega \backslash \bigcup_{i=1}^n L_i \right) \\ &= 1 - \mathbb P \left( \bigcup_{i=1}^n L_i \right) \geq 1 - \sum_{i=1}^n \mathbb P(L_i) \geq 1 - \sum_{i=1}^n \mathbb P(\Omega \backslash K_i) \\ &\geq 1 - \sum_{i=1}^n \mathbb (1 - \mathbb P(K_i)) = \sum_{i=1}^n \mathbb P(K_i) - (n-1). \end{aligned}$

Corollary 1. We have the inequality

$\displaystyle \mathbb P \left( \bigcup_{i=1}^n K_i \right) \leq \sum_{i=1}^n \mathbb P(K_i) \leq \mathbb P \left( \bigcap_{i=1}^n K_i \right) + (n-1).$

—Joel Kindiak, 30 Aug 25, 2048H

October 8, 2025
The Infinite Coin Toss
The infinite coin toss is an intuitively straightforward idea that needs effort to properly construct: what do we mean by tossing a coin infinitely many times? And yet, such constructions are crucial to discuss discrete distributions like the geometric and Poisson distributions that find meaningful applications in mathematical finance and other fields of applied mathematics.

Consider the following experiment. Flip a biased coin with $\mathbb P(\mathrm H) = p \in [0, 1]$ . Let $\xi_n$ denote the outcome of the $n$ -th flip (i.e. $\xi_n = 1$ if the $n$ -th flip is Head, and $\xi_n=0$ otherwise, excluding the case when the coin lands on its side). Let $X$ denote the number of flips needed for you to obtain the first Head. What is the value of $\mathbb P(X = x)$ for $x \in \mathbb N^+$ ? Intuitively, the first $(x-1)$ flips has to be Tails, and the last should be Head. Assuming all coin flips are independent, we should have the probability

$\begin{aligned} \mathbb P(X = x) &= \mathbb P(\xi_1 = 0, \dots, \xi_{x-1} = 0, \xi_x = 1) \\ &= \mathbb P(\xi_1 = 0) \cdot \dots \cdot \mathbb P(\xi_{x-1} = 0) \cdot \mathbb P(\xi_x = 1) \\ &= (1-p)^{x-1} \cdot p. \end{aligned}$

The problem is that $x$ could take on any of the infinitely many values in $\mathbb N^+$ , and so a reasonable sample space $\Omega$ will require outcomes of the form $(\omega_1,\omega_2,\dots)$ . To be fair, that is not the challenging bit: define

$\Omega := \{0, 1\}^{\mathbb N^+},\quad \mathcal F := \mathcal P(\Omega).$

The problem arises in defining the underlying probability measure. The issue isn’t Head and Tail; we obtain effectively the same probability space working with $0, 1$ . The problem is this: how do we evaluate the quantity $\mathbb P((0, 0, 0, \dots))$ ?

Your first instinct should be to take the limit of $(1-p)^n$ as $n \to \infty$ , and your instinct is not wrong. But how do we set up the probability space so that this instinct does, in fact, yield a valid answer? Furthermore, by that principle, $\mathbb P(\omega) = 0$ for any sequence $\omega$ . How do we obtain nonzero probabilities in that case?

To that end, let’s scale down our analyses a bit, lest we overthink in excessive panic. Rather than think of infinitely many coin tosses, let’s start with the first coin toss. No matter which infinite outcome $\omega = (\omega_1,\omega_2,\dots)$ we obtain, we know that $\omega_1 = 1$ or $\omega_1 = 0$ , and that these outcomes are assigned the probabilities $p, 1-p$ respectively. This observation motivates us to define sub- $\sigma$ -algebras of $\mathcal F$ that agree with our intuitions.

To that end, equip $\Omega_0 \equiv \Omega_1 := \{ 0, 1 \}$ with the usual $\sigma$ -algebra $\mathcal G_0 \equiv \mathcal G_1 = \mathcal P(\Omega_1)$ and probability measure $\mathbb P_0(1) = \mathbb P_1(1) = p$ . Then for any $K \in \mathcal P(\Omega_1)$ , define

$G_1(K) := \{ \omega \in \Omega : \omega_1 \in K\} \subseteq \Omega.$

Then define the $\sigma$ -algebra $\mathcal F_1 := \{ G_1(K) : K \in \mathcal G_1 \}$ on $\Omega$ and equip it with the probability measure $\mathbb P(G_1(K)) := \mathbb P_1(K)$ . We can repeat this procedure $n$ times as follows.

Definition 1. For any $n \in \mathbb N^+$ , equip $\Omega_n := \{ 0, 1 \}^n$ with the usual $\sigma$ -algebra $\mathcal G_n = \mathcal P(\Omega_n)$ and probability measure $\mathbb P_n(\omega) = \prod_{i=1}^n \mathbb P_0( \omega_i )$ . Then for any $K \in \mathcal P(\Omega_n)$ , define

$G_n(K) := \{ \omega \in \Omega : (\omega_1,\dots,\omega_n) \in K\} \subseteq \Omega.$

Then define the $\sigma$ -algebra $\mathcal F_n := \{ G_n(K) : K \in \mathcal G_n \}$ on $\Omega$ and equip it with the probability measure $\mathbb P(G_n(K)) := \mathbb P_n(K)$ . Define

$\displaystyle \mathcal F^0 := \bigcup_{j=1}^\infty \mathcal F_j.$

Lemma 1. We have $\mathcal F_1 \subseteq \mathcal F_2 \subseteq \cdots \subseteq \mathcal F^0$ . Furthermore, $\mathcal F^0$ is closed under finite unions, though not necessarily countably infinite unions. We call $\mathcal F^0$ an algebra of subsets of $\Omega$ .

Lemma 2. $\mathbb P(\emptyset) = 0$ and $\mathbb P$ is countably additive:

$\displaystyle \mathbb P\left( \bigsqcup_{i=1}^\infty K_i \right) = \sum_{i=1}^\infty \mathbb P(K_i),\quad \bigsqcup_{i=1}^\infty K_i \in \mathcal F^0.$

Proof. Consider $\Omega = \{0, 1\}^{\mathbb N^+}$ as the topological space

$\displaystyle \Omega = \prod_{n \in \mathbb N^+} \{0, 1\}.$

Equip $\{0, 1\}$ with the discrete topology, and it is clear that $\{0, 1\}$ is compact. Then $\Omega$ equipped with the product topology is compact by the Tychonoff theorem.

Now fix $K_1, K_2,\dots$ such that $K:=\bigsqcup_{i=1}^\infty K_i \in \mathcal F^0$ . Then $\{K_1,K_2,\dots\}$ forms an open cover of $K$ . Since $K \subseteq \Omega$ is closed, it is compact, and thus is covered by a finite subcollection $\{K_1,\dots, K_n\}$ without loss of generality. Since the union is disjoint, we can conclude that $K_j = \emptyset$ for $j > N$ . Hence,

$\displaystyle \mathbb P\left( \bigsqcup_{i=1}^\infty K_i \right) = \mathbb P\left( \bigsqcup_{i=1}^n K_i \right) = \sum_{i=1}^n \mathbb P(K_i) = \sum_{i=1}^{\infty} \mathbb P(K_i).$

Unfortunately, we still need a $\sigma$ -algebra on $\Omega$ . Currently, we only have an algebra $\mathcal F^0$ , which is a $\sigma$ -algebra except that it is closed under only finite, rather than countable, unions. Thankfully, an algebra and a countably additive measure allows us to recover a $\sigma$ -algebra and a bona fide measure. This result is called Carathéodory’s extension theorem.

Theorem 1. Let $\mathcal F^0$ be an algebra of subsets of $\Omega$ . Suppose there exists a countably additive map $\mu_0 : \mathcal F^0 \to [0, \infty]$ , which we call a pre-measure on $\mathcal F^0$ . Then there exists a $\sigma$ -algebra $\mathcal F \supseteq \mathcal F_0$ and a measure $\mu : \mathcal F \to [0, \infty]$ such that $\mu|_{\mathcal F_0} = \mu_0$ . We say that $\mu_0$ extends to a measure $\mu$ on $\mathcal F$ .

Carathéodory’s extension theorem does more than just help us establish the logical validity of infinite coin flips—it will help us create suitable $\sigma$ -algebras on $\mathbb R$ . But we will need to put in some nontrivial effort to prove it.

Henceforth, assume the hypotheses of Theorem 1. Let $\mathcal F^0 \subseteq \mathcal P(\Omega)$ be an algebra of subsets of $\Omega$ and $\mu_0$ be a pre-measure on $\mathcal F^0$ .

Lemma 3. Define the outer measure $\mu^* : \mathcal P(\Omega) \to [0, \infty]$ by

$\displaystyle \mu^*(K) := \inf \left\{ \sum_{i=1}^\infty \mu_0(K_i) : K \subseteq \bigcup_{i=1}^\infty K_i,\ K_i \in \mathcal F^0 \right\}.$

Then $\mu^*(K) = \mu_0(K)$ for $K \in \mathcal F^0$ and, in addition, satisfies the following properties:
- $\mu^*(K) \leq \mu^*(L)$ if $K \subseteq L$ ,
- $\mu^*(K \cup L) \leq \mu^*(K) + \mu^*(L)$ ,
- $\mu^*\left( \bigcup_{i=1}^\infty K_i \right) \leq \sum_{i=1}^\infty \mu^*(K_i)$ .
Proof. For the subset claim, any cover of $L$ is a cover of $K$ , so the result is immediate.

For the subadditivity claim, fix $K, L \subseteq \Omega$ . Fix $\epsilon > 0$ . By the definition of $\mu^*(K), \mu^*(L)$ , there exist $\{K_i\}, \{L_j\} \subseteq \mathcal F^0$ such that

$\displaystyle \sum_{i=1}^\infty \mu_0(K_i) \leq \mu^*(K) + \frac{\epsilon}{2},\quad \sum_{j=1}^\infty \mu_0(L_j) \leq \mu^*(L) + \frac{\epsilon}{2},$

and $K \subseteq \bigcup_{i=1}^\infty K_i$ , $L \subseteq \bigcup_{j=1}^\infty L_j$ . Define $\{M_k\} := \{K_i\} \cup \{L_j\}$ so that $K \cup L \subseteq \bigcup_{k=1}^\infty M_k$ . By the countable additivity of $\mu_0$ ,

$\begin{aligned} \mu^*(K \cup L) \leq \sum_{k=1}^\infty \mu_0(M_k) &\leq \sum_{i=1}^\infty \mu_0(K_i) + \sum_{j=1}^\infty \mu_0(L_j) \\ &\leq \mu^*(K) + \mu^*(L) + \epsilon. \end{aligned}$

Taking $\epsilon \to 0^+$ yields the desired result.

For countable subadditivity, if there exists $i$ such that $\mu^*(K_i) = \infty$ , then the inequality holds trivially. Suppose therefore that $\mu^*(K_i) < \infty$ for any $i$ . For each $i$ , find $\{K_{i,j} \} \subseteq \mathcal F^0$ that covers $K_i$ (i.e. $K_i \subseteq \bigcup_{j=1}^\infty K_{i,j}$ ) and

$\displaystyle \sum_{j=1}^\infty \mu^*(K_{i,j}) \leq \mu^*(K_i) + \frac{\epsilon}{2^i}.$

Then we notice that $\{K_{i,j}\}_{i,j}$ covers $\bigcup_{i=1}^\infty K_i$ , and by the sum of a geometric series

$\begin{aligned} \sum_{i=1}^\infty \sum_{j=1}^\infty \mu^*(K_{i,j}) &\leq \sum_{i=1}^\infty \left( \mu^*(K_i) + \frac{\epsilon}{2^i} \right) \\ &= \sum_{i=1}^\infty \mu^*(K_i) + \epsilon \cdot \frac{1/2}{1-1/2} \\ &= \sum_{i=1}^\infty \mu^*(K_i) + \epsilon. \end{aligned}$

Taking $\epsilon \to 0^+$ yields the desired result.

Lemma 4. The subset $\mathcal F \subseteq \mathcal P(\Omega)$ defined by

$\mathcal F := \{K \in \mathcal P(\Omega) : \mu^*(L) = \mu^*(K \cap L) + \mu^*(\Omega \backslash K \cap L), L \subseteq \Omega \}$

forms an algebra over $\Omega$ , and $\mu^*$ is countably additive on $\mathcal F$ . We say that each $K \in \mathcal F$ satisfies the Carathéodory condition.

Proof. By Lemma 3, the inequality

$\mu^*(K) \leq \mu^*(K \cap L) + \mu^*(\Omega \backslash K \cap L)$

holds automatically, so that the only direction that needs to be checked is the direction

$\mu^*(L) \geq \mu^*(K \cap L) + \mu^*(\Omega \backslash K \cap L).$

It is not hard to see that $\emptyset, \Omega \in \mathcal F$ since $\mu^*(\emptyset) = \mu_0(\emptyset) = 0$ . Furthermore, closure under complementation is obvious, since for any $L \subseteq \Omega$ ,

$\mu^*(\Omega \backslash K \cap L) + \mu^*(\Omega \backslash (\Omega \backslash K) \cap L) \leq \mu^*(L)$

since $K \in \mathcal F$ . It remains to prove closure under finite unions, and by induction, it suffices to prove the two-set case. Fix $K_1, K_2 \in \mathcal F$ . We claim that for any $L \subseteq \Omega$ ,

$\displaystyle \mu^* ((K_1 \cup K_2) \cap L) + \mu^*(\Omega \backslash (K_1 \cup K_2) \cap L ) \leq \mu^*(L).$

By subadditivity in Lemma 3,

$\begin{aligned} &\mu^* ((K_1 \cup K_2) \cap L) \\ &= \mu^* ((K_1 \cup (\Omega \backslash K_1 \cap K_2)) \cap L) \\ &= \mu^* ((K_1 \cap L) \cup (\Omega \backslash K_1 \cap K_2 \cap L)) \\ &= \mu^*(K_1 \cap L) + \mu^* (\Omega \backslash K_1 \cap K_2 \cap L) \\ &\leq \mu^*(K_1 \cap K_2 \cap L) + \mu^*(K_1 \cap \Omega \backslash K_2 \cap L) + \mu^*(\Omega \backslash K_1 \cap K_2 \cap L) . \end{aligned}$

By definition,

$\mu^*(\Omega \backslash (K_1 \cup K_2) \cap L ) = \mu^*(\Omega \backslash K_1 \cap \Omega \backslash K_2 \cap L ).$

Applying the Carathéodory criterion,

$\begin{aligned} \mu^*(K_1 \cap K_2 \cap L) + \mu^*(\Omega \backslash K_1 \cap K_2 \cap L) &\leq \mu^*(K_2 \cap L), \\ \mu^*(K_1 \cap \Omega \backslash K_2 \cap L) + \mu^*(\Omega \backslash K_1 \cap \Omega \backslash K_2 \cap L ) &\leq \mu^*(\Omega \backslash K_2 \cap L). \end{aligned}$

Since $K_2$ also satisfies the Carathéodory criterion, adding yields the finite union result. For the countably additive claim, we first observe that

$\begin{aligned} \mu^*(K_1 \sqcup K_2) &= \mu^*((K_1 \sqcup K_2) \cap K_1) + \mu^*((K_1 \sqcup K_2) \cap \Omega \backslash K_1) \\ &= \mu^*(K_1) + \mu^*(K_2). \end{aligned}$

Inductively, $\mu^*(\bigsqcup_{i=1}^n K_i) = \sum_{i=1}^n \mu^*(K_i)$ . Now, fix disjoint $\{K_i\} \subseteq \mathcal F$ . Then $\mu^*(\bigsqcup_{i=1}^\infty K_i) \geq \sum_{i=1}^n \mu^*(K_i)$ by monotonicity, so that coupled with countable subadditivity in Lemma 3,

$\displaystyle \sum_{i=1}^n\mu^*(K_i) \leq \mu^* \left(\bigsqcup_{i=1}^\infty K_i \right) \leq \sum_{i=1}^\infty \mu^*(K_i).$

Taking $n \to \infty$ yields the desired result.

Lemma 5. The set $\mathcal F$ as defined in Lemma 4 forms a $\sigma$ -algebra over $\Omega$ . Furthermore, $\mathcal F^0 \subseteq \mathcal F$ .

Proof. Fix $\{K_i\} \subseteq \mathcal F$ . Since

$\displaystyle K := \bigcup_{i=1}^\infty K_i = \bigsqcup_{i=1}^\infty L_i,\quad L_0 := \emptyset, \quad L_{i+1} := K_{i+1} \backslash L_i,$

we may assume without loss of generality that $\{ K_i \}$ is pairwise disjoint. We need to establish Carathéodory’s criterion for $K$ , that is, prove that for any $L \subseteq \Omega$ ,

$\mu^*(K \cap L) + \mu^*(\Omega \backslash K \cap L) \leq \mu^*(L).$

By countable subadditivity,

$\displaystyle \mu^*(K \cap L) \leq \sum_{i=1}^\infty \mu^*(K_i \cap L) < \infty.$

By monotonicity, for any $n \in \mathbb N^+$ ,

$\displaystyle \sum_{i=1}^n \mu^*(K_i \cap L) + \mu^*(\Omega \backslash K \cap L) \leq \mu^*(L).$

Taking $n \to \infty$ yields the desired result.

For the subset claim, fix $K \in \mathcal F^0$ . Fix $L \subseteq \Omega$ . Consider the cover $\{L_i\}$ for $L$ . Then $\{ K \cap L_i \}$ forms a cover for $K \cap L$ and $\{\Omega \backslash K \cap L_i\}$ forms a cover for $\Omega \backslash K \cap L$ . The result follows from bookkeeping:

$\begin{aligned} \mu^*(K \cap L) + \mu^*(\Omega \backslash K \cap L) &\leq \sum_{i=1}^\infty \mu^*(K \cap L_i) + \sum_{i=1}^\infty \mu^*(\Omega \backslash K \cap L_i) \\ &\leq \sum_{i=1}^\infty (\mu^*(K \cap L_i) + \mu^*(\Omega \backslash K \cap L_i)) \\ &\leq \sum_{i=1}^\infty \mu^*(L_i) \leq \mu^*(L) + \epsilon. \end{aligned}$

Proof of Theorem 1. The map $\mu := {\mu^*}|_{\mathcal F} : \mathcal F \to [0, \infty]$ is a rigorously defined measure (by Lemma 4) on the rigorously defined $\sigma$ -algebra $\mathcal F$ (by Lemma 5) that extends the pre-measure $\mu_0$ (by Lemma 3).

Corollary 1. The sample space $\Omega := \{0, 1\}^{\mathbb N^+}$ equipped with the algebra $\mathcal F^0$ can be extended to a probability space $(\Omega, \mathcal F, \mathbb P)$ such that $\mathbb P(K) = \mathbb P_n(K)$ for any $K \in \mathcal F_n$ .

Finally, we can rigorously define the geometric distribution.

Theorem 2. Define the random variable $X : \Omega \to \mathbb N^+$ by

$X(\omega) := \min\{ n \in \mathbb N^+ : \omega_n = 1 \}.$

Then for any $x \in \mathbb N^+$ and $p \in (0, 1]$ ,

$\mathbb P(X = x) = (1-p)^{x-1} \cdot p.$

We say that $X$ follows a geometric distribution, denoted $X \sim \mathrm{Geo}(p)$ . Furthermore, $\mathbb E[X] = 1/p$ and $\mathrm{Var}(X) = (1-p)/p^2$ .

Remark 1. We also call $X$ a stopping time of some stochastic process.

Proof. We observe that

$\begin{aligned} \mathbb P(X = x) = \mathbb P(X^{-1}(x)) &= \mathbb P(G_x(\{ (0,\dots,0, 1) \})) \\ &= \mathbb P_x((0,\dots,0,1)) \\ &= \left( \prod_{i=1}^{x-1} \mathbb P_0(0) \right) \cdot \mathbb P_0(1) \\ &= (1-p) \cdot \dots \cdot (1-p) \cdot p \\ &= (1-p)^{x-1} \cdot p. \end{aligned}$

For the expectation,

$\begin{aligned} \mathbb E[X] &= \sum_{x=1}^{\infty} x \cdot \mathbb P(X = x) = \sum_{x=1}^\infty x \cdot (1-p)^{x-1} \cdot p, \end{aligned}$

if the limit on the right-hand side exists. We first define

$\displaystyle f(n) := \sum_{x=1}^n x \cdot (1-p)^{x-1} \cdot p$

and observe that

$\begin{aligned} \sum_{x=1}^n x \cdot (1-p)^x &= \sum_{x=1}^n x \cdot (1-p)^{x-1}(1-p) \\ &= \sum_{x=1}^n x \cdot (1-p)^{x-1} -\sum_{x=1}^n x \cdot (1-p)^{x-1} \cdot p \\ &= \sum_{x=1}^n x \cdot (1-p)^{x-1} - f(n) \\ &= \sum_{x=1}^n (x-1) \cdot (1-p)^{x-1} + \sum_{x=1}^n (1-p)^{x-1} - f(n). \end{aligned}$

By algebruh,

$\begin{aligned} f(n) &= \frac {1 - (1-p)^n}{1-(1-p)} - n (1-p)^n.\end{aligned}$

Taking $n \to \infty$ , we have $f_0(n) \cdot (1-p)^n \to 0$ for any polynomial $f_0$ , so that $f(n) \to 1/p$ . Therefore, $\mathbb E[X] = 1/p$ . For the variance, we first define

$\displaystyle g(n) := \sum_{x=1}^n x^2 \cdot (1-p)^{x-1} \cdot p,$

which converges as $n \to \infty$ by the ratio test. By observation,

$\begin{aligned} g(n)& = \sum_{x=1}^n x^2 \cdot (1-p)^{x-1} \cdot p \\ &= \sum_{x=1}^n (x-1)^2 \cdot (1-p)^{x-1} \cdot p + 2 \cdot \sum_{x=1}^n x \cdot \mathbb P(X = x) - \sum_{x=1}^n \mathbb P(X = x) \\ &= (1-p) \cdot g(n-1) + 2 \cdot \sum_{x=1}^n x \cdot \mathbb P(X = x) - \sum_{x=1}^n \mathbb P(X = x).\end{aligned}$

Taking $n \to \infty$ on both sides,

$\mathbb E[X^2] = (1-p) \cdot \mathbb E[X^2] + 2 \cdot \mathbb E[X] - 1.$

By algebruh,

$\displaystyle \mathbb E[X^2] = \frac 1p \left( 2 \cdot \frac 1p - 1 \right) = \frac {2-p}{p^2}.$

Hence,

$\displaystyle \mathrm{Var}(X) = \mathbb E[X^2] - \mathbb E[X]^2 = \frac {2-p}{p^2} - \frac 1{p^2} = \frac {1-p}{p^2}.$

Not only can we define the geometric distribution, but we can finally define the length measure on $\mathbb R$ in a measure-theoretically useful manner. This we will do next time.

—Joel Kindiak, 7 Jul 25, 2310H
October 7, 2025
Extending the Reals
Having discussed several results in discrete probability, we now turn our eyes to continuous probability. It’s a nontrivial task, and we don’t need to stray too far to see why. Recall that counting measure: given any finite set $\Omega$ , the map $| \cdot | : \mathcal P(\Omega) \to [0, \infty)$ counts the number of elements in a subset (e.g. $|K|$ counts the number of elements in $K$ ).

What is $|\mathbb N|$ ? Intuitively, since $\{1,\dots,n\} \subseteq \mathbb N$ , if we accept the properties of a measure, we must conclude that $|\mathbb N| \geq |\{1,\dots,n\}| = n$ . However, since $n \in \mathbb N$ is arbitrary, we must conclude that $|\mathbb N|$ is not finite. Can we say that it is infinite? What do we even mean by infinite?

The answer is yes. We will extend the reals in a meaningful way so as to allow for infinite measures. We need to control this extension, but extend we can, so that we can say that $|\mathbb N| = \infty$ in a meaningful way. Later on, we want to say that the length $\lambda(\mathbb R)$ of $\mathbb R$ also equals $\infty$ .

Before discussing measures, it helps to discuss $\sigma$ -algebras in their unadulterated countably infinite setting.

Definition 1. For any set $\Omega$ , a collection of subsets $\mathcal F \subseteq \mathcal P(\Omega)$ is a $\sigma$ -algebra on $\Omega$ if it satisfies the following properties:
- $\emptyset \in \Omega$ ,
- for any $K \in \mathcal F$ , $\Omega \backslash K \in \mathcal F$ ,
- for any $K_1,K_2,\dots \in \mathcal F$ , $\bigcup_{i=1}^\infty K_i \in \mathcal F$ .
For example, $\mathcal P(\Omega)$ is a $\sigma$ -algebra on $\Omega$ . The pair $(\Omega, \mathcal F)$ is called a measurable space. A function $f : (\Omega, \mathcal F) \to (\Psi, \mathcal G)$ is called $\mathcal F/\mathcal G$ -measurable if for any $K \in \mathcal G$ , $f^{-1}(K) \in \mathcal F$ .

Lemma 1. For any $K_1,K_2,\dots \in \mathcal F$ , $\bigcap_{i=1}^\infty K_i \in \mathcal F$ .

Lemma 2. For any $\mathcal K \subseteq \mathcal P(\Omega)$ , there exists a unique $\sigma$ -algebra $\sigma(\mathcal K)$ that contains $\mathcal K$ . Furthermore, $\sigma(\mathcal K)$ is the “smallest” in the following sense: for any $\sigma$ -algebra $\mathcal F$ containing $\sigma(\mathcal K)$ , $\sigma(\mathcal K) \subseteq \mathcal F$ .

Proof. Let $\Sigma$ denote all $\sigma$ -algebras that contain $K$ , which is nonempty since $\mathcal P(\Omega) \in \Sigma$ . Verify that

$\displaystyle \sigma(\mathcal K) := \bigcap_{\mathcal F \in \Sigma} \mathcal F$

is a $\sigma$ -algebra, and by construction, it must be the smallest.

We call $\sigma(\mathcal K)$ the $\sigma$ -algebra generated by $\mathcal K$ .

Example 1. For any $n$ , let $\mathcal B$ denote the usual topological basis on $\mathbb R^n$ (resp. $\mathbb R^\omega)$ generated by the Euclidean metric (resp. product topology). For example $\mathcal B = \{(a,b) : a, b \in \mathbb Q\}$ . Define the Borel $\sigma$ -algebra of $\mathbb R^n$ by $\frak{B}(\mathbb R^n) := \sigma(\mathcal B)$ . Define $\frak{B}(\mathbb R^\omega)$ are similarly. Henceforth, we regard $(\mathbb R^n, \frak{B}(\mathbb R^n))$ as a measurable space.

Henceforth, let $(\Omega, \mathcal F)$ be a measurable space.

We now want to set up a meaningful arithmetic for $[0, \infty]$ that agrees with past intuitions. If we can do so, then we can define the measure on $(\Omega, \mathcal F)$ .

Definition 2. Assume that we have defined $[0, \infty]$ . A measure on $(\Omega, \mathcal F)$ is a map $\mu : \mathcal F \to [0, \infty]$ that satisfies the following properties:
- $\mu(\emptyset) = 0$ ,
- $\mu\left(\bigsqcup_{i=1}^\infty K_i \right) = \sum_{i=1}^\infty \mu(K_i)$ .
We call $(\Omega, \mathcal F, \mu) \equiv \Omega$ a measure space and abbreviate to the right-hand side when there is no ambiguity. We call $\mu = \mathbb P$ a probability measure if $\mu(\Omega) = 1$ .

To construct $[0, \infty]$ meaningfully, we will take inspiration from the notation

$\displaystyle \sum_{i=1}^\infty a_i := \lim_{n \to \infty} \sum_{i=1}^n a_i = \infty,\quad a_i \geq 0,$

whenever the left-hand side diverges. In particular, for any $c \geq 0$ ,

$\displaystyle \sum_{i=1}^\infty (c \cdot a_i) = c \cdot \sum_{i=1}^\infty a_i = \left( \sum_{i=1}^\infty a_i \right) \cdot c,$

where all sides either converge or diverge. In the case $c = 0$ , the left-hand side converges to $0$ , and so the right-hand side must converge to $0$ as well. In the case $c > 0$ , if one side diverges, so does the other.

Definition 3. For any $a \in [0, \infty]$ , define multiplication as follows:

$\begin{aligned} 0 \cdot a &= a \cdot 0 = 0, \\ \infty \cdot a, &= a \cdot \infty = \infty. \end{aligned}$

Furthermore, if $a \neq \infty$ ,

$\displaystyle \frac{\infty}{a} := \frac 1a \cdot \infty = \infty,\quad a > 0.$

Addition works similarly. For real numbers $a_i, b_i \geq 0$ ,

$\displaystyle \sum_{i=1}^\infty (a_i + b_i) = \sum_{i=1}^\infty a_i + \sum_{i=1}^\infty b_i.$

If both sums on the left-hand side converge, so does the right-hand side. On the other hand, if at least one side diverges, so does the right-hand side. This yields the addition arithmetic for $[0, \infty]$ .

Definition 4. For any $a \in [0, \infty]$ , define addition as follows:

$a + \infty = \infty + a = \infty.$

What about negative numbers? Either we stick to the usual infinite series interpretation, or we extend one of its implications. We do a bit of both as follows. Given $a_i \geq 0$ , $-a_i \leq 0$ , and the equation

$\displaystyle \sum_{i=1}^\infty (-a_i) = (-1) \cdot \sum_{i=1}^\infty a_i$

still makes sense if both sides converge or diverge. As such, we define $-\infty := (-1) \cdot \infty$ . To agree with real number arithmetic, we also agree that

$-(-\infty) = (-1) \cdot ((-1)\cdot \infty) = ((-1) \cdot (-1)) \cdot \infty = \infty.$

Denote $[-\infty, \infty] := \mathbb R \cup \{-\infty, \infty\}$ . Using similar reasoning, we make similar definitions to complete our construction.

Definition 5. Define addition on $[-\infty, \infty]$ as follows:

$a \pm \infty = \infty \pm a = \infty,\quad a \in (-\infty, \infty).$

Furthermore, $(\pm \infty) + (\pm \infty) = \pm \infty$ . For subtraction, define $a - b := a + (-b)$ whenever the right-hand side is well-defined (e.g. we leave $\infty -\infty$ undefined).

For multiplication, we define for $a \in [-\infty, \infty]$ ,

$a \cdot \infty = \infty \cdot a = \begin{cases} \infty, & a \in (0, \infty], \\ -\infty, & a \in [-\infty, 0), \\ 0, & a = 0. \end{cases}$

For division, define $1/(\pm {\infty}) := 0$ and $a/b := a \cdot (1/b)$ whenever $b \neq 0$ and at least one of $a,b$ is finite. It gets too unhelpfully complicated otherwise.

As such, we are not saying that $\infty$ is a real infinity. We are abbreviating our interpretations of the various kinds of sums that arise.

Now that we have properly defined $[0, \infty]$ , we can officially declare the counting measure $|\cdot | : \mathcal P(\mathbb N) \to [0, \infty]$ as a properly meaningful measure, even an infinite one. We can meaningfully say that $|\mathbb N| = \infty$ , and derive many useful properties for measures.

Lemma 3. Let $\mu : \mathcal F \to [0, \infty]$ be a measure on $(\Omega, \mathcal F)$ .
- For any $K, L \in \mathcal F$ such that $K \subseteq L$ , $\mu(K) \leq \mu(L)$ . Furthermore, if $\mu(K) < \infty$ , then $\mu(L \backslash K) = \mu(L) - \mu(K)$ .
- If $K_1 \subseteq K_2 \subseteq \cdots$ , then $\mu\left( \bigcup_{i=1}^\infty K_i \right) = \lim_{n \to \infty} \mu(K_n)$ .
- If $K_1 \supseteq K_2 \supseteq \cdots$ and $\mu(K_1) < \infty$ , then $\mu\left( \bigcap_{i=1}^\infty K_i \right) = \lim_{n \to \infty} \mu(K_n)$ .
Proof. The first claim is immediate from $K \sqcup L \backslash K$ . The second claim comes from the observation that if we defined $L_i$ via

$L_1 = K_1,\quad L_{i+1} = K_{i+1} \backslash K_i,$

then $\bigcup_{i=1}^n K_i = \bigsqcup_{i=1}^n L_i$ , even when $n = \infty$ . The third claim comes from defining $L_i := L_1 \backslash K_i$ . Applying the second result yields

$\displaystyle \mu(L_1) - \mu \left( \bigcap_{i=1}^\infty K_i \right) = \mu \left( L_1 \backslash \bigcap_{i=1}^\infty K_i \right) = \mu\left( \bigcup_{i=1}^\infty L_i \right) = \lim_{n \to \infty} \mu(L_n).$

Therefore,

$\displaystyle \mu \left( \bigcap_{i=1}^\infty K_i \right) = \mu(L_1) - \lim_{n \to \infty} \mu(L_n) = \lim_{n \to \infty} \mu(L_1 \backslash L_n) = \lim_{n \to \infty} \mu(K_n).$

Lemma 4. There does not exist a probability measure on $\mathcal P(\mathbb N)$ with the following property: there exists some $p \in [0, 1]$ such that for any $n \in \mathbb N$ , $\mathbb P(\{n\}) = p$ .

Proof. Fix $p \in (0, 1]$ . For any integer $n > 1/p$ , $\mathbb P(\{0,\dots, n-1\}) = np > 1$ , which implies that

$1 = \mathbb P(\mathbb N) \geq np > 1,$

a contradiction. On the other hand, if $p = 0$ , then

$\displaystyle 1 = \mathbb P(\mathbb N) = \sum_{i=0}^\infty \mathbb P(\{i\}) = \lim_{n \to \infty} \sum_{i=0}^n \mathbb P(\{i\}) = \lim_{n \to \infty} \sum_{i=0}^n 0 = \lim_{n \to \infty} 0 = 0,$

a contradiction.

There are applications of such a definition, even to the more “finite” probability theory. Of course, however, such applications invoke a price for the infinite. We will construct such probability spaces the next time, before we turn to making sense of the length of subsets of $\mathbb R$ .

—Joel Kindiak, 4 Jul 25, 1357H
September 30, 2025
The Real Standard Deviation
In high school, the standard deviation $\sigma$ of a dataset $(x_1,\dots, x_n)$ is usually introduced informally as information about the spread of the dataset, quantified through the unwieldy formula

$\displaystyle \sigma = \sqrt{\frac{\Sigma fx^2}{\Sigma f} - \left( \frac{\Sigma fx}{\Sigma f} \right)^2} = \frac 1n \cdot \sqrt{\Sigma x^2 - \frac{\left( \Sigma x \right)^2}{n}}.$

Today, we will approach the standard deviation from far more intuitive perspective, and recover the above formula as a special case. Its cousin $\sigma^2$ is called the variance and is of greater mathematical interest.

We first define the covariance between two random variables $X, Y$ , assuming all quantities exist. Intuitively, it should measure the overall deviation of some kind between $X$ and its expectation $\mathbb E[X]$ ,

Definition 1. The covariance of $X, Y$ , denoted $\mathrm{Cov}(X,Y)$ , is defined by

$\mathrm{Cov}(X, Y) = \mathbb E[(X - \mathbb E[X]) \cdot (Y - \mathbb E[Y])].$

Lemma 1. By expectation properties, we obtain

$\mathrm{Cov}(X, Y) = \mathbb E[XY] - \mathbb E[X]\cdot \mathbb E[Y].$

Proof. Expanding within the expectation,

$\begin{aligned}(X - \mathbb E[X]) \cdot (Y - \mathbb E[Y]) &= XY - \mathbb E[X] \cdot Y - \mathbb E[Y] \cdot X + \mathbb E[X] \cdot \mathbb E[Y].\end{aligned}$

Applying the linearity of $\mathbb E[\cdot]$ (we will formally justify this in measure theory),

$\begin{aligned} \mathrm{Cov}(X, Y) &= \mathbb E[(X - \mathbb E[X]) \cdot (Y - \mathbb E[Y])] \\ &=\mathbb E[XY - \mathbb E[X] \cdot Y - \mathbb E[Y] \cdot X + \mathbb E[X] \cdot \mathbb E[Y]] \\ &= \mathbb E[XY] - \mathbb E[\mathbb E[X] \cdot Y] - \mathbb E[\mathbb E[Y] \cdot X] + \mathbb E[\mathbb E[X] \cdot \mathbb E[Y]] \\ &= \mathbb E[XY] - \mathbb E[X] \cdot \mathbb E[Y] - \mathbb E[Y] \cdot \mathbb E[X] + \mathbb E[X] \cdot \mathbb E[Y] \cdot \mathbb E[1] \\ &= \mathbb E[XY] - \mathbb E[X] \cdot \mathbb E[Y]. \end{aligned}$

Lemma 2. The covariance satisfies the following properties:
- $\mathrm{Cov}(X, X) \geq 0$ ,
- $\mathrm{Cov}(X, X) = 0$ if and only if $X = \mathbb E[X]$ ,
- $\mathrm{Cov}(X, Y) = \mathrm{Cov}(Y,X)$ ,
- $\mathrm{Cov}(kX, Y) = k \cdot \mathrm{Cov}(Y,X)$ for $k \in \mathbb Z$ ,
- $\mathrm{Cov}(W+X, Y) = \mathrm{Cov}(W,X) + \mathrm{Cov}(W, Y)$ .
Proof. We prove the fourth property to illustrate:

$\begin{aligned}\mathrm{Cov}(W+X, Y) &= \mathbb E[(W+X)Y] - \mathbb E[W+X] \cdot \mathbb E[Y] \\ &= \mathbb E[WY+XY] - (\mathbb E[W]+\mathbb E[X]) \cdot \mathbb E[Y] \\ &= \mathbb E[WY]+\mathbb E[XY] - (\mathbb E[W] \cdot \mathbb E[Y]+\mathbb E[X] \cdot \mathbb E[Y]) \\ &= \mathbb E[WY] - \mathbb E[W] \cdot \mathbb E[Y] +\mathbb E[XY] - \mathbb E[X] \cdot \mathbb E[Y] \\ &= \mathrm{Cov}(W,Y) + \mathrm{Cov}(X,Y). \end{aligned}$

It is similar to the properties of an inner product, but not exactly, since the second property should require $X = 0$ , rather than just $X$ being a constant. To make such an assertion and more requires technical qualifiers addressed in measure theory. Lemma 2 outlines the main (practical) properties that we need from the covariance.

Lemma 3. Given any constant $k$ , $\mathrm{Cov}(X + k, Y) = \mathrm{Cov}(X, Y)$ .

Proof. By Lemma 2, it suffices to prove that $\mathrm{Cov}(k, Y) = 0$ :

$\mathrm{Cov}(k, Y) = \mathbb E[k Y] - \mathbb E[k] \cdot \mathbb E[Y] = k \cdot \mathbb E[Y] - k \cdot \mathbb E[Y] = 0.$

Lemma 4. For any random variable $X$ , define its variance

$\mathrm{Var}(X) := \mathrm{Cov}(X,X) = \mathbb E[X^2] - \mathbb E[X]^2$

whenever the right-hand side exists. Then $\mathrm{Var}(kX) = k^2 \cdot \mathrm{Var}(X)$ for $k \in \mathbb Z$ .

Proof. We have

$\begin{aligned}\mathrm{Var}(kX) &= \mathrm{Cov}(kX,kX) = k \cdot k \cdot \mathrm{Cov}(X,X) = k^2 \cdot \mathrm{Var}(X). \end{aligned}$

Lemma 5. Denote $\mu_X := \mathbb E[X]$ . Define the standard deviation of $X$ by $\sigma_X := \sqrt{\mathrm{Var}(X)}$ . Define the centered random variable $\hat X := (X - \mu_X)/\sigma_X$ whenever $\sigma_X \neq 0$ . Then $\mathrm{Var}(\hat X) = 1$ .

Proof. By Lemma 2,

$\begin{aligned} \mathrm{Var}(\hat X) &= \mathrm{Cov}(\hat X, \hat X) \\ &= \mathrm{Cov}\left( \frac{ X - \mu_X }{\sigma_X}, \frac{ X - \mu_X }{\sigma_X} \right) \\ &= \frac{1}{\sigma_X^2} \cdot \mathrm{Cov}(X- \mu_X, X- \mu_X) \\ &= \frac{1}{\sigma_X^2} \cdot \mathrm{Cov}(X, X) = \frac{1}{\mathrm{Var}(X)} \cdot \mathrm{Var}(X) = 1.\end{aligned}$

The covariance helps us measure the correlation between the random variables, which we formally define below.

Definition 2. The correlation between $X, Y$ with nonzero variances is defined by

$\displaystyle \rho(X, Y) := \mathrm{Cov}(\hat X, \hat Y) \equiv \frac{\mathrm{Cov}(X, Y)}{\sqrt{\mathrm{Var}(X)} \cdot \sqrt{\mathrm{Var}(Y)}}.$

Corollary 1. Given the dataset $( (x_1,y_1),\dots,(x_n,y_n))$ equipped with the uniform distribution, define the discrete random variables $X$ and $Y$ by $X((x_i,y_i)) = x_i$ and $Y((x_i,y_i)) = y_i$ . Then $X,Y$ follow uniform distributions on the datasets $(x_1,\dots,x_n)$ and $(y_1,\dots,y_n)$ respectively, and

$\displaystyle \mathrm{Cov}(X,Y) = \frac{ 1 }{n} \cdot \sum_{i=1}^n (x_i - \bar x)(y_i - \bar y) = \frac{ 1 }{n} \cdot \sum_{i=1}^n x_i y_i - \frac{ 1 }{n^2} \cdot \sum_{i=1}^n x_i \cdot \sum_{i=1}^n y_i,$

where $\bar x := (x_1+\dots+x_n)/n$ . When there is no confusion, we suppress the indices and write

$\displaystyle \mathrm{Cov}(X,Y) = \frac 1n \cdot \Sigma (x - \bar x)(y - \bar y) = \frac{\Sigma x y}{n} - \frac{ \Sigma x \Sigma y }{n^2} .$

for brevity.

Proof. By an equivalent definition for expectation,

$\begin{aligned} \mathbb E[XY] &= \sum_{\omega \in \Omega} X(\omega) \cdot Y(\omega) \cdot \mathbb P(\{\omega\}) \\ &= \sum_{i=1}^n x_i \cdot y_i \cdot \frac 1n = \frac{ 1 }{n} \sum_{i=1}^n x_i y_i. \end{aligned}$

The other quantities can be computed in a similar manner.

Corollary 2. Setting $Y=X$ in Corollary 1, we obtain

$\displaystyle \mathrm{Var}(X) = \frac{ \Sigma (x - \bar x)^2 }{n} = \frac{\Sigma x^2}{n} - \left( \frac{ \Sigma x }{n} \right)^2.$

Furthermore, combining Definition 2 and Corollaries 1 and 2, the product moment correlation coefficient $r := \rho(X, Y)$ is computed via

$\begin{aligned} r &= \frac{\Sigma (x - \bar x)(y - \bar y)}{\sqrt{ \Sigma (x - \bar x)^2 } \sqrt{ \Sigma (y - \bar y)^2 } } = \frac{ \Sigma x y - \frac{ \Sigma x \Sigma y }{ n }}{\sqrt{ \Sigma x^2 - \frac{ (\Sigma x)^2 }{n} } \sqrt{ \Sigma y^2 - \frac{ (\Sigma y)^2 }{n} }}. \end{aligned}$

Finally,

$\displaystyle \sigma_X = \sqrt{\frac{\Sigma x^2}{n} - \left( \frac{ \Sigma x }{n} \right)^2} = \frac 1n \cdot \sqrt{ \Sigma x^2 - \frac{ ( \Sigma x )^2 }{n} }.$

Given two random variables $X, Y$ , we know that $\mathbb E[X+Y] = \mathbb E[X] + \mathbb E[Y]$ . What can we deduce about $\mathrm{Var}(X + Y)$ ? Interpreting the variance as the covariance with itself and applying covariance properties (where most of the action happens),

$\begin{aligned}\mathrm{Var}(X+Y) &= \mathrm{Cov}(X+Y, X+Y) \\ &= \mathrm{Cov}(X,X) + \mathrm{Cov}(X,Y) + \mathrm{Cov}(Y,X) + \mathrm{Cov}(Y,Y) \\ &= \mathrm{Cov}(X,X) + \mathrm{Cov}(Y,Y) + 2\cdot \mathrm{Cov}(X,Y) \\ &= \mathrm{Var}(X) + \mathrm{Var}(Y) + 2 \cdot \mathrm{Cov}(X,Y).\end{aligned}$

So it’s not as simple as $\mathrm{Var}(X + Y) = \mathrm{Var}(X) + \mathrm{Var}(Y)$ ; there’s an extra term involving $\mathrm{Cov}(X,Y)$ . However, the equality does hold if $\mathrm{Cov}(X,Y) = 0$ . In this case, we call $X$ and $Y$ uncorrelated. Equivalently,

$\mathbb E[XY] = \mathbb E[X] \cdot \mathbb E[Y].$

Lemma 6. If $X, Y$ are independent, then $X$ and $Y$ are uncorrelated. In particular,

$\mathrm{Var}(X + Y) = \mathrm{Var}(X) + \mathrm{Var}(Y).$

Proof. By definition, since $X,Y$ are independent,

$\begin{aligned} \mathbb E[XY] &= \sum_{x \in \mathbb Z} \sum_{y \in \mathbb Z} xy \cdot \mathbb P_{X,Y}(x,y) \\ &= \sum_{x \in \mathbb Z} \sum_{y \in \mathbb Z} x \cdot y \cdot \mathbb P_X(x) \cdot \mathbb P_Y(y) \\ &= \sum_{x \in \mathbb Z} x \cdot \mathbb P_X(x) \cdot \sum_{y \in \mathbb Z} y \cdot \mathbb P_Y(y) = \mathbb E[X] \cdot \mathbb E[Y]. \end{aligned}$

Therefore, we recover the core variance properties of interest.

Theorem 1. If $X, Y$ are independent random variables, then

$\mathrm{Var}(X \pm Y) = \mathrm{Var}(X) + \mathrm{Var}(Y).$

Proof. For the minus case,

$\begin{aligned} \mathrm{Var}(X - Y) &= \mathrm{Var}(X + (-Y)) \\ &= \mathrm{Var}(X) + \mathrm{Var}(-Y) \\ &= \mathrm{Var}(X) + (-1)^2 \cdot \mathrm{Var}(Y) \\ &= \mathrm{Var}(X) + \mathrm{Var}(Y). \end{aligned}$

Example 1. For any $\xi \sim \mathrm{Ber}(p)$ , $\mathrm{Var}( \xi) = p(1-p)$ . Likewise, for $X \sim \mathrm{Bin}(p)$ , $\mathrm{Var}(X) = np(1-p)$ .

Proof. We observe that $\xi^2 \sim \mathrm{Ber}(p)$ so that

$\mathrm{Var}( \xi) = \mathbb E[ \xi^2] - \mathbb E[ \xi]^2 = p - p^2 = p(1-p).$

Writing $X = \sum_{i=1}^n \xi_i$ for i.i.d. $\xi_i \sim \mathrm{Ber}(p)$ ,

$\displaystyle \mathrm{Var}(X) = \mathrm{Var}\left( \sum_{i=1}^n \xi_i \right) = \sum_{i=1}^n \mathrm{Var}(\xi_i) = \sum_{i=1}^n p(1-p) = np(1-p).$

Example 2. Let $X_1,\dots, X_n$ be i.i.d. with mean $\mu$ and variance $\sigma^2$ . Define the sample mean by

$\displaystyle \bar X_n := \frac{X_1+ \cdots + X_n}{n}.$

Then $\mathbb E[\bar X_n] = \mu$ and $\mathrm{Var}(\bar X_n) = \sigma^2/n$ .

Thus, at least intuitively, we should get $\mathrm{Var}(\bar X_n) \to 0$ as $n \to \infty$ , so that $\bar X_n \to \mu$ . We can formalise this idea as follows.

Lemma 7. For any $Y \geq 0$ and $\delta > 0$ ,

$\displaystyle \mathbb P(Y \geq \delta) \leq \frac{ \mathbb E[Y] }{ \delta }.$

This result is known as Markov’s inequality.

Proof. For any $Y \geq 0$ , we first note that for $\delta > 0$ ,

$\begin{aligned} \mathbb E[Y] &= \sum_{y \in \mathbb Z} y \cdot \mathbb P_Y(y) \\ &= \sum_{y < \delta} y \cdot \mathbb P_Y(y) + \sum_{y \geq \delta} y \cdot \mathbb P_Y(y) \\ &\geq \sum_{y < \delta} 0 \cdot \mathbb P_Y(y) + \sum_{y \geq \delta} \delta \cdot \mathbb P_Y(y) \\ &= \delta \cdot \sum_{y \geq \delta} \mathbb P_Y(y) = \delta \cdot \mathbb P(Y \geq \delta). \end{aligned}$

Hence, $\mathbb P(Y \geq \delta) \leq \mathbb E[Y]/\delta$ .

Lemma 8. For any random variable $X$ with mean $\mu$ and variance $\sigma^2$ and $\delta > 0$ ,

$\displaystyle \mathbb P(|X - \mu| \geq \delta) \leq \frac{ \sigma^2 }{ \delta^2 }.$

This result is known as Chebychev’s inequality.

Proof. Setting $Y := |X - \mu|^2 \geq 0$ and applying Markov’s inequality, since $\mathbb E[Y] = \mathrm{Var}(X) = \sigma^2$ ,

$\begin{aligned}\mathbb P(|X - \mu| \geq \delta) &= \mathbb P(|X - \mu|^2 \geq \delta^2) = \mathbb P(Y \geq \delta^2) \leq \frac{\mathbb E[Y]}{\delta^2} = \frac{\sigma^2}{\delta^2}.\end{aligned}$

Theorem 2. Fix $\epsilon > 0$ and a distribution $\nu$ with mean $\mu$ and variance $\sigma^2$ . Then for i.i.d. random variables $X_1,\dots, X_n \sim \nu$ ,

$\displaystyle \mathbb P(|\bar X_n - \mu| > \epsilon) \leq \frac{\sigma^2}{n \cdot \epsilon^2}.$

Hence, with more technical caveats, we can conclude that

$\displaystyle \lim_{n \to \infty} \mathbb P(|\bar X_n - \mu| > \epsilon) = 0.$

This result is known as the weak law of large numbers.

Proof. Applying Chebychev’s inequality.,

$\begin{aligned}\mathbb P(|\bar X_n - \mu| > \epsilon) &\leq \frac{\mathrm{Var}(\bar X)}{\epsilon^2} = \frac{\sigma^2}{n \cdot \epsilon^2}.\end{aligned}$

This writeup brings us to the limit, pun not intended, of our intuitive ideas of probability. Up till now, we assumed that $X$ is discrete (i.e. $X(\Omega)$ is at most countably infinite). But for a more serious discussion, we will need $X$ to be continuous of some kind, so that $X(\Omega) \subseteq \mathbb R$ in a meaningful sense. We’ll take an even more general direction—the measure-theoretic formation of probability theory.

—Joel Kindiak, 3 Jul 25, 0019H
September 23, 2025
The Mathematical Average
Given a discrete random variable $X$ taking values in $\mathbb Z$ , what is its expectation, if it exists?

Let’s suppose $X$ takes on finitely many values, i.e. the set

$\mathrm{supp}(X) := \{x \in \mathbb Z : \mathbb P(X = x) > 0\}$

is finite. Denote $\mathrm{supp}(X) = \{x_1,\dots,x_n\}$ . The expectation of $X$ is the value that we expect $X$ to take. In a sense, we want to obtain the “center value” $\mu$ that the random variable $X$ will take.

We can think of the center value $\mu$ as a “pivot” on a “balance beam”. Each $x_i < \mu$ will induce a “weight” of $\mathbb P(X = x_i)$ that tilts the beam anticlockwise, and similarly, each $x_i > \mu$ will induce a “weight” of $\mathbb P(X = x_i)$ that tilts the beam clockwise. Intuitively, the total contributions ought to cancel out, yielding the equality

$\displaystyle \sum_{x \in \mathrm{supp}(X)} (x - \mu) \cdot \mathbb P(X = x) = 0.$

Expanding the left-hand side,

$\begin{aligned} \sum_{x \in \mathrm{supp}(X)} (x - \mu) \cdot \mathbb P(X = x) &= \sum_{x \in \mathrm{supp}(X)} x \cdot \mathbb P(X = x) - \mu \cdot \sum_{x \in \mathrm{supp}(X)} \mathbb P(X = x) \\ &= \sum_{x \in \mathrm{supp}(X)} x \cdot \mathbb P(X = x) - \mu \cdot 1 \\ &= \sum_{x \in \mathrm{supp}(X)} x \cdot \mathbb P(X = x) - \mu. \end{aligned}$

Therefore,

$\displaystyle \sum_{x \in \mathrm{supp}(X)} x \cdot \mathbb P(X = x) = \mu.$

Using measure notation $\mathbb P_X \equiv \mathbb P(X \in \cdot)$ and $\mathbb P_X(x) \equiv \mathbb P_X(\{x\})$ for brevity,

$\displaystyle \sum_{x \in \mathbb Z} x \cdot \mathbb P_X(x) = \mu,$

where the left-hand side is well-defined if $\mathrm{supp}(X)$ is finite. This quantity we will formally define as the expectation, if the sum exists (even if the sum is infinite).

Let $X$ be a $\mathbb Z$ -valued random variable.

Definition 1. The expectation of $X$ , denoted $\mathbb E[X]$ , is defined by

$\displaystyle \mathbb E[X] := \sum_{x \in \mathbb Z} x \cdot \mathbb P_X(x)$

whenever the right-hand side exists.

Example 1. For $p \in [0, 1]$ , if $X \sim \mathrm{Ber}(p)$ , then

$\begin{aligned} \mathbb E[X] &= 0 \cdot \mathbb P_X(0) + 1 \cdot \mathbb P_X(1) \\ &= 0 \cdot (1-p) + 1 \cdot p = p. \end{aligned}$

Lemma 1. Let $(\Omega, \mathcal F, \mathbb P)$ be a probability space and $X : \Omega \to \mathbb Z$ be a random variable. Then

$\displaystyle \mathbb P_X(x) = \sum_{\omega \in \Omega} \mathbb I\{X(\omega) = x\} \cdot \mathbb P(\{\omega\}).$

Whenever both sides are well-defined,

$\displaystyle \mathbb E[X] = \sum_{\omega \in \Omega} X(\omega) \cdot \mathbb P(\{ \omega \}).$

Proof. We first observe that

$\displaystyle \mathbb P_X(x) = \mathbb P(X(\omega) = x) = \sum_{\omega : X(\omega)=x} \mathbb P(\{ \omega \}) = \sum_{\omega \in \Omega} \mathbb I\{X(\omega)=x\} \cdot \mathbb P(\{ \omega \}).$

Hence,

$\begin{aligned} \sum_{x \in \mathbb Z} x \cdot \mathbb P_X(x) &= \sum_{x \in \mathbb Z} x \cdot \sum_{\omega \in \Omega} \mathbb I\{X(\omega)=x\} \cdot \mathbb P(\{ \omega \}) \\ &=\sum_{x \in \mathbb Z} \sum_{\omega \in \Omega} x \cdot \mathbb I\{X(\omega)=x\} \cdot \mathbb P(\{ \omega \}) \\ &= \sum_{\omega \in \Omega} \sum_{x \in \mathbb Z} x \cdot \mathbb I\{X(\omega)=x\} \cdot \mathbb P(\{ \omega \}) \\ &= \sum_{\omega \in \Omega} X(\omega) \cdot \mathbb P(\{ \omega \}) . \end{aligned}$

Lemma 2. For any map $g : \mathbb Z \to \mathbb Z$ , $g(X) := g \circ X$ is a $\mathbb Z$ -valued random variable.

Theorem 1. If $\mathbb E[g(X)]$ exists, then

$\displaystyle \mathbb E[g(X)] = \sum_{x \in \mathbb Z} g(x) \cdot \mathbb P_X(x),$

whenever both sides are well-defined.

Proof. Defining $Y := g(X)$ by Lemma 2, the proof and result in Lemma 1 yields

$\begin{aligned} \mathbb E[g(X)] = \mathbb E[Y] &= \sum_{\omega \in \Omega} Y(\omega) \cdot \mathbb P(\{\omega\}) \\ &= \sum_{\omega \in \Omega} g(X(\omega)) \cdot \mathbb P(\{\omega\}) \\ &= \sum_{\omega \in \Omega} \sum_{x \in \mathbb Z} g(X(\omega)) \cdot \mathbb I\{X(\omega) = x\} \cdot \mathbb P(\{\omega\}) \\ &= \sum_{x \in \mathbb Z} \sum_{\omega \in \Omega} g(X(\omega)) \cdot \mathbb I\{X(\omega) = x\} \cdot \mathbb P(\{\omega\}) \\ &= \sum_{x \in \mathbb Z} g(x) \cdot \sum_{\omega \in \Omega} \mathbb I\{X(\omega) = x\} \cdot \mathbb P(\{\omega\}) \\ &= \sum_{x \in \mathbb Z} g(x) \cdot \mathbb P_X(x). \end{aligned}$

Corollary 1. Let $Y$ be a $\mathbb Z$ -valued random variable. For any map $g : \mathbb Z^2 \to \mathbb Z$ , $g(X,Y) := g \circ (X,Y)$ is a $\mathbb Z$ -valued random variable. Furthermore,

$\displaystyle \mathbb E[g(X,Y)] = \sum_{(x, y) \in \mathbb Z^2} g(x,y) \cdot \mathbb P_{X,Y}(x,y).$

Furthermore, if $\mathbb E[X], \mathbb E[Y]$ exist, then the following hold:
- $\mathbb E[X + Y] = \mathbb E[X] + \mathbb E[Y]$ ,
- $\mathbb E[\alpha X] = \alpha \cdot \mathbb E[X]$ for any $\alpha \in \mathbb Z$ ,
- $\mathbb E[X - \mathbb E[X]] = 0$ .
Proof. We prove the first identity for simplicity. Define $g(x,y) = x+y$ . Then

$\begin{aligned} \mathbb E[g(X,Y)] &= \sum_{(x, y) \in \mathbb Z^2} g(x,y) \cdot \mathbb P_{X,Y}(x,y) \\ \mathbb E[X+Y]&= \sum_{(x, y) \in \mathbb Z^2} (x+y) \cdot \mathbb P_{X,Y}(x,y) \\ &= \sum_{(x, y) \in \mathbb Z^2} x \cdot \mathbb P_{X,Y}(x,y) + \sum_{(x, y) \in \mathbb Z^2} y \cdot \mathbb P_{X,Y}(x,y) \\ &= \sum_{x \in \mathbb Z} x \cdot \mathbb P_{X}(x) + \sum_{y \in \mathbb Z} y \cdot \mathbb P_{Y}(y) \\ &= \mathbb E[X] + \mathbb E[Y], \end{aligned}$

where the simplifications arise from

$\begin{aligned} \sum_{(x, y) \in \mathbb Z^2} x \cdot \mathbb P_{X,Y}(x,y) &= \sum_{x \in \mathbb Z} \sum_{y \in \mathbb Z} x \cdot \mathbb P_{X,Y}(x,y) \\ &= \sum_{x \in \mathbb Z} \sum_{y \in \mathbb Z} x \cdot \mathbb P(Y = y \mid X = x) \cdot \mathbb P_X(x) \\ &= \sum_{x \in \mathbb Z} x \cdot \mathbb P_X(x) \cdot \sum_{y \in \mathbb Z} \mathbb P(Y = y \mid X = x) \\ &= \sum_{x \in \mathbb Z} x \cdot \mathbb P_X(x) \cdot 1 \\ &= \sum_{x \in \mathbb Z} x \cdot \mathbb P_X(x).\end{aligned}$

Thankfully, all series here are valid due to linearity.

Example 2. For $n \in \mathbb N$ and $p \in [0, 1]$ , if $X \sim \mathrm{Bin}(n, p)$ , then $\mathbb E[X] = np$ .

Proof. Find independent identically distributed (i.i.d.) Bernoulli random variables $\xi_1, \dots, \xi_n \sim \mathrm{Ber}(p)$ such that

$\displaystyle X = \sum_{i=1}^n \xi_i.$

By Corollary 1,

$\displaystyle \mathbb E[X] = \mathbb E\left[ \sum_{i=1}^n \xi_i \right] = \sum_{i=1}^n \mathbb E[\xi_i] = \sum_{i=1}^n p = np.$

Example 3. Equip the finite sample $(x_1,\dots,x_n) \subseteq \mathbb Z$ with the uniform probability measure and induced random variable $X := \mathrm{id}$ . Then

$\begin{aligned} \mathbb E[X] &= \sum_{x \in \mathbb Z} x \cdot \mathbb P_X(x) \\ &= \sum_{i=1}^n x_i \cdot \mathbb P_X(x_i) \\ &= \sum_{i=1}^n x_i \cdot \frac 1n = \frac 1n \cdot \sum_{i=1}^n x_i =: \bar x. \end{aligned}$

Thus, the right-hand side is called the mean of the sample $(x_1,\dots,x_n)$ .

The expectation has a cousin—the covariance and its child the variance. We will discuss these ideas more in the next post.

—Joel Kindiak, 1 Jul 25, 1915H
September 16, 2025
Adding Random Variables

Given two random variables $X, Y$ with known distributions (for example, $X \sim \mathrm{Ber}(p)$ and $Y \sim \mathrm{Ber}(q)$ ), what is the distribution of their sum $X + Y$ ? This question is highly nontrivial, and is worth some elementary experimentation.

Let’s work with the Bernoulli example and see what we can get. The first case is, in a sense, the “easiest” case to work with: $X = 1$ and $Y = 1$ . Using ordered-pair notation, $(X, Y) = (1, 1)$ . In this case, $X + Y = 2$ . Furthermore, if $X = 0$ , then $X + Y \leq 1 < 2$ , so that there is, in a sense, only one possible case to handle. Therefore, at least informally,

$\mathbb P(X + Y = 2) = \mathbb P((X, Y) = (1, 1)).$

However, the probability on the right side raises a new question: what do we mean by $\mathbb P((X, Y) = ( 1, 1))$ ? If we insist (rather intuitively) that

$\mathbb P((X, Y) = ( 1, 1)) = \mathbb P(X = 1) \cdot \mathbb P(Y = 1),$

we are implicitly assuming that $X,Y$ are independent in some sense (but we need to formalise this notion later on), which may not always be the case! Our first task therefore isn’t even to compute $X + Y$ , but to examine what we mean by $\mathbb P((X, Y) = \cdot )$ , i.e. the joint distribution of $X$ and $Y$ .

Lemma 1. Let $(\Omega, \mathcal F, \mathbb P)$ be a probability space and $X_1, \dots, X_n : \Omega \to \mathbb Z$ be discrete random variables. The product map $\mathbf X_n := (X_1, \dots, X_n) : \Omega \to \mathbb Z^n$ defined by

$\mathbf X_n(\omega) := (X_1(\omega), \dots, X_n(\omega)),\quad \omega \in \Omega$

is measurable, when $\mathbb Z$ is equipped with the usual $\sigma$ -algebra $\mathcal P(\mathbb Z^n)$ , and induces a push-forward measure $\mathbb P_{\mathbf X_n} : \mathcal P(\mathbb Z^n) \to [0, 1]$ defined by

$\begin{aligned} \mathbb P_{\mathbf X_n}(K) &:= \mathbb P(\mathbf X_n \in K) = \mathbb P(\mathbf X_n^{-1}(K)),\quad K \in \mathcal P(\mathbb Z^n).\end{aligned}$

While $\mathbf X_n$ shares many properties of discrete random variables, its range is not a subset of $\mathbb Z$ , and to reduce confusion we restrict the term “discrete random variables” to the original definition.

But what would be the joint distribution of $(X,Y)$ ? That is, what is the value of

$\begin{aligned} \mathbb P_{X,Y}(\{ (x, y) \}) &= \mathbb P(\{ X = x \} \cap \{Y = y\}) \equiv \mathbb P( X = x , Y = y ) \end{aligned}$

for any $x, y \in \mathbb Z$ ? Perhaps $\mathbb P_{X,Y}$ could be defined arbitrarily, as long as we have each value in $[0, 1]$ and

$\displaystyle \sum_{ (x, y) \in \mathbb Z^2 } \mathbb P( X = x , Y = y ) = 1.$

However, there are two more sneaky conditions.

Lemma 2. For any $y \in \mathbb Z$ with $\mathbb P(Y = y) > 0$ ,

$\displaystyle \sum_{x \in \mathbb Z} \mathbb P( X =x , Y = y ) = \mathbb P(Y = y).$

Similarly, for any $x \in \mathbb Z$ with $\mathbb P(X = x) > 0$ ,

$\displaystyle \sum_{y \in \mathbb Z} \mathbb P( X = x , Y = y ) = \mathbb P(X = x).$

Proof. Recall that for any $y \in \mathbb Z$ such that $\mathbb P(Y = y) \equiv \mathbb P_Y(\{y\}) > 0$ , the conditional probability

$\displaystyle \mathbb P( \cdot \mid Y = y ) : \mathcal F \to [0, 1]$

is a probability measure, which induces a probability measure

$\mathbb P( X \in \cdot \mid Y = y ) : \mathcal P(\mathbb Z) \to [0, 1].$

Thus, we must require that

$\displaystyle \sum_{x \in \mathbb Z} \mathbb P( X \in x \mid Y = y ) = 1.$

Hence,

$\begin{aligned} \sum_{x \in \mathbb Z} \mathbb P( X =x , Y = y ) &= \sum_{x \in \mathbb Z} \mathbb P( X =x \mid Y = y ) \cdot \mathbb P(Y = y) \\ &= \mathbb P(Y = y) \cdot \sum_{x \in \mathbb Z} \mathbb P( X =x \mid Y = y ) \\ & = \mathbb P(Y = y) \cdot 1 \\ &= \mathbb P(Y = y). \end{aligned}$

Thankfully, convergence issues resolve themselves here.

Still, do we have an algorithmic way to compute the joint distribution? If the events $X^{-1}(\{ x \}) \equiv X^{-1}(x), Y^{-1}(y) \in \mathcal F$ are independent for any $x, y \in \mathbb N$ , then

$\begin{aligned} \mathbb P(X = x, Y = y) &= \mathbb P(\{X = x \} \cap \{ Y = y\}) \\ &= \mathbb P(X^{-1}(x) \cap Y^{-1}(y)) \\ &= \mathbb P(X^{-1}(x)) \cdot \mathbb P(Y^{-1}(y)) \\ &= \mathbb P(X = x) \cdot \mathbb P (Y = y).\end{aligned}$

Hence, we define the independence of the random variables in this manner.

Definition 1. The discrete random variables $X_1,\dots , X_n : \Omega \to \mathbb N$ are independent if for any $x_1,\dots,x_n \in \mathbb N$ , the events $X_1^{-1}(x_1),\dots, X_n^{-1}(x_n)$ are independent. In this case,

$\mathbb P((X_1,\dots,X_n) = (x_1,\dots,x_n)) = \mathbb P(X_1 = x_1) \cdot \cdots \cdot \mathbb P(X_n = x_n).$

Using p.m.f. notation,

$f_{X_1,\dots,X_n}(x_1,\dots, x_n) = f_{X_1}(x_1) \cdot \cdots \cdot f_{X_n}(y),\quad (x_1,\dots,x_n) \in \mathbb Z^n.$

Example 1. Suppose $X \sim \mathrm{Ber}(p)$ and $Y \sim \mathrm{Ber}(q)$ are independent random variables. Evaluate the distribution of $X + Y : \Omega \to \mathbb Z$ .

Solution. We note that the joint distribution $f_{X,Y}$ is given by

$\begin{aligned} f_{X,Y}(0, 0) &= (1-p) \cdot (1-q),\\ f_{X,Y}(0, 1) &= (1-p) \cdot q,\\ f_{X,Y}(1, 0) &= p \cdot (1-q), \\ f_{X,Y}(1, 1) &= p \cdot q.\end{aligned}$

We observe that $X + Y \in \{0, 1, 2\}$ . Hence,

$\begin{aligned} f_{X+Y}(0) &= f_{X,Y}(0, 0) = (1-p) \cdot (1-q), \\ f_{X+Y}(1) &= f_{X,Y}(0, 1) + f_{X,Y}(1, 0) \\ &= (1-p) \cdot q + p \cdot (1-q), \\ f_{X+Y}(2) &= f_{X,Y}(1, 1) = p \cdot q. \end{aligned}$

Notice that $f_{X+Y}(1)$ is evaluated by summing the cases of $(0, 1)$ and $(1, 0)$ . More generally, we have

$\begin{aligned} f_{X+Y}(k) &= \sum_{x + y = k} f_{X,Y}(x, y) \\ &= \sum_{x + y = k} f_{X}(x) \cdot f_Y(y) \\ &= \sum_{x \in \mathbb Z} f_X(x) \cdot f_Y(k-x) \\ &=: (f_X * f_Y)(k), \end{aligned}$

where the quantity on the right is called the discrete convolution of $f_X$ and $f_Y$ . Time to generalise!

Theorem 1. Let $X, Y : \Omega \to \mathbb Z$ be discrete random variables. Given any function $g : \mathbb Z^2 \to \mathbb Z$ , the random variable $g(X,Y) : \Omega \to \mathbb Z$ exists and is given by the distribution

$\displaystyle \mathbb P_{g(X,Y)}(k) = \sum_{g(x,y) = k} f_{X,Y}(x,y) = \sum_{(x,y) \in g^{-1}(k)} f_{X,Y}(x,y).$

If $X, Y$ are independent, then the distribution is given by

$\displaystyle \mathbb P_{g(X,Y)}(k) = \sum_{(x,y) \in g^{-1}(k)} f_{X}(x) \cdot f_Y(y).$

In particular,

$\displaystyle \mathbb P_{X+Y}(k) = \sum_{x\in \mathbb Z} f_{X}(x) \cdot f_Y(k-x).$

Example 2. Fix $p \in [0, 1]$ . Given that $X_i \sim \mathrm{Ber}(p)$ are independent, $X_1 + X_2$ has a distribution given by

$\displaystyle \mathbb P_{X_1+X_2}(k) = \begin{cases} (1-p)^2, & k = 0, \\ 2p(1-p), & k = 1, \\ p^2, & k = 2. \end{cases}$

Furthermore, for any $n \in \mathbb N^+$ , the distribution of $S_n := X_1 + \cdots + X_n$ is given by

$\displaystyle \mathbb P_{S_n}(k) = {n \choose k} p^k(1-p)^{n-k}.$

Proof. The case $n = 1$ is trivial. Example 1 establishes the case $n = 2$ . For the general case, we prove by induction. Suppose for any $m$ ,

$\displaystyle \mathbb P_{S_m}(k) = {m \choose k} p^k(1-p)^{m-k},\quad 0 \leq k \leq m.$

Suppose $X_{m+1} \sim \mathrm{Ber}(p)$ . Then $S_{m+1} = S_m + X_{m+1}$ . By the discrete convolution and Pascal’s identity,

$\begin{aligned} \mathbb P_{S_{m+1}}(k) &= \mathbb P_{S_m + X_{m+1}}(k) \\ &= \sum_{x\in \mathbb Z} \mathbb P_{S_m}(x) \cdot \mathbb P_{X_{m+1}}(k-x) \\ &= \mathbb P_{S_m}(k) \cdot \mathbb P_{X_{m+1}}(0) + \mathbb P_{S_m}(k-1) \cdot \mathbb P_{X_{m+1}}(1) \\ &= {m \choose k} p^k(1-p)^{m-k} \cdot (1-p) + {m \choose k-1} p^{k-1}(1-p)^{m-(k-1)} \cdot p \\ &=\left( {m \choose k} + {m \choose k-1} \right) \cdot p^k(1-p)^{(m+1)-k} \\ &= {m+1 \choose k} p^k(1-p)^{(m+1)-k}.\end{aligned}$

Example 2 is basically the definition of the Binomial distribution.

Definition 2. Fix $n \in \mathbb N$ and $p \in [0, 1]$ . We say that the random variable $X$ follows a Binomial distribution with parameters $n, p$ , denoted $X \sim \mathrm{Bin}(n, p)$ , if there exists independent Bernoulli random variables $\xi_n \sim \mathrm{Ber}(p)$ such that

$\displaystyle X = \xi_1 + \cdots + \xi_n.$

Trivially, $\mathrm{Bin}(1, p) = \mathrm{Ber}(p)$ .

Corollary 1. If $X \sim \mathrm{Bin}(m, p)$ and $Y \sim \mathrm{Bin}(n, p)$ are independent, then $X + Y \sim \mathrm{Bin}(m+n, p)$ .

Proof. Find independent random variables $\xi_i$ such that

$X = \xi_1 + \cdots + \xi_m,\quad Y = \xi_{m+1} + \dots + \xi_{m+n}.$

Then

$X + Y = \xi_1 + \cdots + \xi_m + \xi_{m+1} + \dots + \xi_{m+n} \sim \mathrm{Bin}(m+n, p).$

With some distributions at hand, we ask a reasonable question: how do we calculate the averages of these distributions? We will answer this question using expectations next time.

—Joel Kindiak, 29 Jun 25, 2229H

September 9, 2025
Neither Random Nor Variable

A random variable is neither random nor a variable. Let me explain.

Flip a fair coin twice. Let $X$ denote the number of Heads that you obtain. What is the value of $X$ ? Well, it depends on each outcome in the sample space $\Omega$ :

$\Omega := \{(\mathrm H, \mathrm H), (\mathrm H, \mathrm T), (\mathrm T, \mathrm H), (\mathrm T, \mathrm T)\}.$

Denote $\mathcal F = \mathcal P(\Omega)$ and the uniform probability measure on $\mathcal F$ by $\mathbb P(\cdot)$ . Since $X$ denotes the number of Heads, we have

$X((\mathrm H, \mathrm H)) = 2, \quad X((\mathrm H, \mathrm T)) = X ((\mathrm T, \mathrm H)) = 1,\quad X((\mathrm T, \mathrm T)) = 0.$

Notice then the randomness arises in selecting the outcome $\omega \in \Omega$ , not in the measurement $X(\omega)$ (though the former inevitably influences the latter). The quantity $X(\omega)$ just records the number of Heads in the outcome $\omega$ , and each occurs with some probability (which we will explore later on).

Thus, while we call $X$ a random variable, it is neither random (though that is captured by the randomness of $\omega$ ), nor a variable (though that is captured by $X(\Omega)$ ).

Let $(\Omega, \mathcal F)$ and $(\Psi, \mathcal G)$ be measurable spaces.

Definition 1. A map $f : \Omega \to \Psi$ is $\mathcal F/\mathcal G$ –measurable if for any $K \in \mathcal G$ , $f^{-1}(K) \in \mathcal F$ . We omit the prefix $\mathcal F/\mathcal G$ when the context is clear.

Lemma 1. Suppose $\mathcal F = \mathcal P(\Omega)$ . Any map $f : \Omega \to \Psi$ is measurable, where $\Psi$ is equipped with the $\sigma$ -algebra $\mathcal P(\Psi)$ .

Proof. For any $K \in \mathcal P(\Psi)$ ,

$K \subseteq \Psi \quad \Rightarrow \quad f^{-1}(K) \subseteq \Omega \quad \Rightarrow \quad f^{-1}(K) \in \mathcal P(\Omega) = \mathcal F.$

Lemma 2. Suppose furthermore there exists a probability measure $\mathbb P$ on $(\Omega, \mathcal F)$ . Then the map $\mathbb P_X := \mathbb P \circ X^{-1} : \mathcal G \to [0, 1]$ is a probability measure on the measurable space $( \Psi, \mathcal G )$ , called the push-forward measure of $X$ .

Lemmas 1 and 2 are crucial for this purpose: they illustrate that, all things considered, the underlying sample space $(\Omega, \mathcal F, \mathbb P)$ is not as nearly as relevant as the measures induced on $\mathbb N$ . Eventually, we will want to develop measures on $\mathbb R$ too, though that will take substantially more effort.

Equip $\mathbb N$ with the $\sigma$ -algebra $\mathcal P(\mathbb N)$

Definition 2. Let $(\Omega, \mathcal F, \mathbb P)$ be any probability space. A discrete random variable is a map $X : \Omega \to \mathbb N$ . Since we equipped $\mathbb N$ with the $\sigma$ -algebra $\mathcal P(\mathbb N)$ , $X$ is automatically measurable. We call its pushforward measure $\mathbb P_X$ the distribution of $X$ , and for convenience denote, for $x \in \mathbb N$ ,

$\mathbb P_X(x) = \mathbb P(X^{-1}(x)) = \mathbb P(\{\omega \in \Omega : X(\omega) = x \} ) \equiv \mathbb P(X=x).$

Without loss of generality, given that a discrete random variable $X$ has distribution $\mathbb P_X$ , we assume that $(\Omega, \mathcal F, \mathbb P) = (\mathbb N, \mathcal P(\mathbb N), \mathbb P_X)$ and that with respect to this choice, $X = \mathrm{id}$ .

Lemma 3. For any discrete random variable $X$ , $\displaystyle \sum_{x = 0}^\infty \mathbb P(X = x) = 1$ .

The support of a discrete random variable is $\mathbb P_X^{-1}((0, 1])$ . The probability mass function or p.m.f. of $X$ is defined by $f_X := \mathbb P(X = \cdot) : \mathbb R \to [0, \infty)$ . The cumulative distribution function or c.d.f. of $X$ is defined by

$\displaystyle F_X : \mathbb R \to [0, 1],\quad F_X(x) := \mathbb P(X \leq x) \equiv \sum_{t = 0}^{\lfloor x \rfloor} f_X(t).$

To illustrate meaningful examples, let’s consider the flipping of a biased coin with parameter $p \in [0, 1]$ . This means the probability of getting ‘Head’ is some fixed value $p$ , which may or may not be $1/2$ .

Example 1. Consider the usual sample space $\Omega := \{ \mathrm H, \mathrm T\}$ equipped with the $\sigma$ -algebra $\mathcal F := \mathcal P(\Omega)$ . Define the probability measure $\mathbb P : \mathcal F \to [0, 1]$ by $\mathbb P(\{\mathrm H\}) = p$ , which implies that

$\mathbb P(\{\mathrm T\}) = \mathbb P(\Omega \backslash \{\mathrm H\}) = 1 - p.$

Define the random variable $X : \Omega \to \mathbb N$ defined by $X(\mathrm H) = 1$ and $X(\mathrm T) = 0$ . Then

$\mathbb P(X = 1) = \mathbb P(X^{-1}(\{1\}) = \mathbb P(\{\mathrm H\}) = p,\quad \mathbb P(X = 0) = 1- p,$

so that $\mathbb P(X = x) = 0$ for $x \notin \{0, 1\}$ .

Definition 3. Let $(\Omega, \mathcal F, \mathbb P)$ be any probability space and $X : \Omega \to \mathbb N$ be a discrete random variable. We say that $X$ follows a Bernoulli distribution with parameter $p \in [0, 1]$ , denoted $X \sim \mathrm{Ber}(p)$ , if

$\mathbb P(X = x) = \begin{cases}1-p, & x = 0, \\ p, & x = 1, \\ 0, & \text{otherwise.}\end{cases}$

Example 2. The random variable $X$ defined in Example 1 has the distribution $\mathrm{Ber}(p)$ . The random variable $Y := 1 - X$ has the distribution $\mathrm{Ber}(1-p)$ since

$\mathbb P(Y = 1) = \mathbb P(X = 0) = 1 - p.$

Example 3. The discrete random variable $U$ is uniform on $(a, b]$ given integers $-1 \leq a < b$ if

$\displaystyle \mathbb P(U = u) = \frac{1}{b-a},\quad u \in (a, b] \cap \mathbb N.$

Having properly defined a discrete random variable, a natural question arises: are there approaches to create new random variables from old ones? For instance, given two random variables $X$ and $Y$ , what is the distribution of their sum $X + Y$ ? In the special case of Example 2, it is obvious that by construction, $X + Y = X+ ( 1-X) = 1$ . But what if $X,Y$ don’t have any obvious connection?

Next time, we explore the idea of independence of random variables and think about ways to combine them.

—Joel Kindiak, 27 Jun 25, 1300H

September 2, 2025
Bayes’ Theorem
The 21st century mathematician Rick Durrett claimed in his lecture notes that measure theory ends and probability begins with the definition of independence. We explore that idea in this topic, as well as its generalisation in the form of Bayes’ theorem.

Let $(\Omega, \mathcal F, \mathbb P)$ be a probability space.

Lemma 1. Let $K$ be any event with positive probability: $\mathbb P(K) > 0$ . Then the map

$\displaystyle \mathbb P( \cdot \mid K) := \frac{ \mathbb P(\cdot \cap K) }{ \mathbb P(K) } : \mathcal F \to [0, 1]$

is a well-defined probability measure. In particular,
- $\mathbb P(L \mid K) = 1$ if $K \subseteq L$ , and
- $\mathbb P(L \mid K) = 0$ if $K \cap L = \emptyset$ .
Equivalently,

$\mathbb P(L \cap K) = \mathbb P(L \mid K) \cdot \mathbb P(K), \quad L \in \mathcal F.$

Proof. It suffices to prove that $\mathbb P(\Omega \mid K) = 1$ , which holds since $K \subseteq \Omega$ implies

$\displaystyle \mathbb P( \Omega \mid K) := \frac{ \mathbb P( \Omega \cap K) }{ \mathbb P(K) } = \frac{ \mathbb P( K) }{ \mathbb P(K) } = 1.$

Corollary 1. The law of total probability then reduces to the following: if $\Omega = \bigsqcup_{i=1}^n K_i$ and each $K_i$ has positive probability, then

$\displaystyle \mathbb P(K) = \sum_{i=1}^n \mathbb P(K \mid K_i) \cdot \mathbb P(K_i).$

What the quantity $\mathbb P(L \mid K)$ as defined in Lemma 1 measures, roughly speaking, is the probability that the event $L$ occurs under the assumption that the event $K$ occurs. This model fits the conclusion of Corollary 1 quite well: to evaluate $\mathbb P(K)$ , we condition on each of the varied $K_i$ , then evaluate $\mathbb P(K \mid K_i)$ for each $i$ , followed by summing up via the addition principle. Furthermore, we recover the following intuitions:
- if $K \subseteq L$ , then any outcome $\omega \in K$ automatically yields $\omega \in L$ ,
- if $K \cap L = \emptyset$ , then any outcome in $\omega \in K$ automatically yields $\omega \notin L$ .
In a sense, these two scenarios are the extremes: the first means that $K$ is totally included in $L$ , and the second means that $K$ is totally excluded from $L$ . But could there be a middle ground? What if, in a sense, event $L$ “does not care” about event $K$ ? Assuming that $\mathbb P(K) > 0$ , this observation amounts to the equality

$\displaystyle \mathbb P(L \mid K) = \mathbb P(L) \quad \Rightarrow \quad \mathbb P(K \cap L) = \mathbb P(K) \cdot \mathbb P(L).$

The equation on the left means that regardless of whether the event $K$ occurs or not, the event $L$ will hold with the same probability. The equation on the right is equivalent to the equation on the left if $\mathbb P(K) > 0$ , but deserves its own attention since it handles the situation where $\mathbb P(K) = 0$ . We use this definition for the notion of independence:

Definition 1. Two events $K, L \in \mathcal F$ are independent if

$\mathbb P(K \cap L) = \mathbb P(K) \cdot \mathbb P(L).$

Example 1. Flip a fair coin twice so that its sample space is

$\Omega := \{ \mathrm H, \mathrm T \}^2 = \{ (\mathrm H, \mathrm H), (\mathrm H, \mathrm T), (\mathrm T, \mathrm H), (\mathrm T, \mathrm T) \}.$

Let $\mathcal F := \mathcal P(\Omega)$ denote the discrete $\sigma$ -algebra and $\mathbb P$ denote the uniform counting measure $\mathbb P(\cdot) = |\cdot|/4$ on $\mathcal F$ .

Let $K := \{ (\mathrm H, \mathrm H), (\mathrm H, \mathrm T) \}$ denote the event that the first coin is a Head, and $L := \{ (\mathrm H, \mathrm H), (\mathrm T, \mathrm H) \}$ denote the event that the second coin is a Head. Then

$\displaystyle \mathbb P(K \cap L) = \mathbb P(\{ (\mathrm H, \mathrm H) \}) = \frac{1}{4} = \frac 12 \cdot \frac 12 = \mathbb P(K) \cdot \mathbb P(L).$

Thus, the events $K$ and $L$ are independent, and misunderstanding such may lead to financial ruin.

Independence of events is an insanely useful idea in probability theory, especially in the sense of independent random variables, which we will explore in future write-ups. In fact, many kinds of events are independent:

Theorem 1. We have the following independence (or not) properties:
- for any $K \in \mathcal F$ , $K$ and $\emptyset$ are independent,
- if $K, L$ have positive probability, then they cannot be mutually exclusive and independent at the same time,
- if $K$ has positive probability, then $K, K$ cannot be independent.
Furthermore, if $K, L$ are independent, the following pairs of events are independent as well:

$L,K, \quad K, \Omega \backslash L,\quad \Omega \backslash K, L, \quad \Omega \backslash K, \Omega \backslash L.$

Proof. We illustrate the proof of $K, \Omega \backslash L$ being independent and leave the rest as exercises. Since $K, L$ are independent, we use finite additivity to obtain

$\begin{aligned} \mathbb P(K) &= \mathbb P(K \cap (L \sqcup \Omega \backslash L)) \\ &= \mathbb P(K \cap L) + \mathbb P(K \cap \Omega \backslash L) \\ &= \mathbb P(K) \cdot \mathbb P(L) + \mathbb P(K \cap \Omega \backslash L). \end{aligned}$

Since $L \subseteq \Omega$ , we have $\Omega \cap L = L$ , so that

$\begin{aligned} \mathbb P(K \cap \Omega \backslash L) &= \mathbb P(K) - \mathbb P(K) \cdot \mathbb P(L) \\ &= \mathbb P(K) \cdot (1 - \mathbb P(L)) \\ &= \mathbb P(K) \cdot (\mathbb P(\Omega) - \mathbb P(\Omega \cap L)) \\ &= \mathbb P(K) \cdot \mathbb P(\Omega \backslash L). \end{aligned}$

The reality, however, is that in most situations, the quantity $\mathbb P(K \mid L)$ is nontrivial. Furthermore, the quantity seems trivial if, at least sequentially, the event $L$ either occurs (or not), and then the event $K$ occurs (or not).

For example, $L$ could indicate the event “I suspect that this patient has COVID-19” and $K$ could indicate the event “the patient tests positive for COVID-19”. Since we are finite human beings, there is no harm assuming that $\mathbb P(K) > 0$ and $\mathbb P(L) > 0$ . Now the quantity $\mathbb P(K \mid L)$ has a natural interpretation—it measures the probability that, under the assumption that a patient has COVID-19, the patient tests positive for the virus. Summarised using intuitive terms, $\mathbb P(K \mid L)$ measures the sensitivity of the COVID-19 test.

In a similar vein (no pun intended), $\mathbb P(\Omega \backslash K \mid \Omega \backslash L)$ measures the specificity of the COVID-19 test. The quantities $\mathbb P(K \mid \Omega \backslash L)$ and $\mathbb P(\Omega \backslash K \mid L)$ then measure the false positive rate and false negative rates respectively, and are connected to the sensitivity and specificity measures as follows:

$\begin{aligned} \mathbb P(K \mid \Omega \backslash L) = 1 - \text{specificity}, \quad \mathbb P(\Omega \backslash K \mid L) = 1 - \text{sensitivity}. \end{aligned}$

Now we present our case-study that generalises nicely into Bayes’ theorem.

Example 2. Suppose $100p\%$ of the population has been infected with COVID-19, and your COVID-19 test kit has a sensitivity of $95\%$ and a specificity of $90\%$ . Given that a patient tests positive using your test kit, what is the probability, in terms of $p$ , that this patient actually has COVID-19?

Solution. Using $K, L$ as our events, we observe that

$\mathbb P(K \mid L) = 0.95,\quad \mathbb P(\Omega \backslash K \mid \Omega \backslash L) = 0.9,\quad \mathbb P(L) = p \in [0, 1].$

We are interested in the quantity $\mathbb P(L \mid K)$ . Nothing in our problem-solving arsenal helps us at the moment, except perhaps for $\mathbb P(K \mid L)$ . We note that $K \cap L = L \cap K$ , so that

$\mathbb P(K \mid L) \cdot \mathbb P(L) = \mathbb P(K \cap L) = \mathbb P(L \cap K) = \mathbb P(L \mid K) \cdot \mathbb P(K).$

Doing some algebruh and substituting relevant values,

$\begin{aligned} \mathbb P(L \mid K) &= \frac{\mathbb P(K \mid L) \cdot \mathbb P(L)}{ \mathbb P(K)} \end{aligned}$

Now taking advantage of the law of total probability,

$\begin{aligned} f(p) := \mathbb P(L \mid K) &= \frac{\mathbb P(K \mid L) \cdot \mathbb P(L)}{ \mathbb P(K \mid L) \cdot \mathbb P(L) + \mathbb P(K \mid \Omega \backslash L) \cdot \mathbb P(\Omega \backslash L) } \\ &= \frac{0.95 \cdot p}{ 0.95 \cdot p + (1-0.9) \cdot (1-p) }. \end{aligned}$

The fascinating observation is that $f(5\%) \approx 33\%$ , $f(10\%) \approx 51\%$ , and $f(20\%) \approx 70\%$ . This means if we know that $20\%$ of the population has been infected with COVID-19, and assuming constant sensitivity and specificity of the COVID-19 test we tested positive for it, there is a $70\%$ chance that we have been infected with the virus too. For better or for worse, these data points inform governments on various public health policies and measures taken to curb the virus and “flatten the curve“.

Our computations above prove Bayes’ theorem:

Theorem 2 (Bayes). Given events $K, L$ with positive probability,

$\displaystyle \mathbb P(L \mid K) = \frac{\mathbb P(K \mid L) \cdot \mathbb P(L)}{\mathbb P(K)}.$

Furthermore, if $L = \bigsqcup_{i=1}^n L_i$ and each $L_i$ has positive probability, then

$\displaystyle \mathbb P(L \mid K) = \frac{\mathbb P(K \mid L) \cdot \mathbb P(L)}{\sum_{i=1}^n \mathbb P(K \mid L_i) \cdot \mathbb P(L_i)}.$

There is more to discuss on Bayes’ theorem, but we conclude with one more application. We notice that if we are given a constant $\mathbb P(K)$ , then

$\displaystyle \mathbb P(L \mid K) = \frac{1}{\mathbb P(K)} \cdot \mathbb P(K \mid L) \cdot \mathbb P(L).$

Then the higher the value of $\mathbb P(K \mid L)$ and $\mathbb P(L)$ , the higher the value of $\mathbb P(L \mid K)$ . This connection is often used in Bayesian statistics to model an update of belief: given a prior probability $\mathbb P(L)$ , the posterior probability is the new probability $\mathbb P(L \mid K)$ given that we have new evidence, namely $K$ .

We need to introduce one more key star player of probability, namely that of random variables. This we will do next time.

—Joel Kindiak, 26 Jun 25, 1847H
August 26, 2025
Probabilistic Events
Previously, we convinced ourselves that given various kinds of finite sets $K$ , we have many techniques to compute $|K|$ . Why we care is to compute probabilities: given a finite set $\Omega$ , we can define the probability measure $\mathbb P := | \cdot |/|\Omega| : \mathcal P(\Omega) \to [0, 1]$ . We say that we are computing the probabilities of events, since the elements of $\mathcal P(\Omega)$ satisfy several desirable properties.

Theorem 1. Call a collection $\mathcal F \subseteq \mathcal P(\Omega)$ a $\sigma$ -algebra over $\Omega$ if it satisfies the following properties:
- $\emptyset, K \in \mathcal F$ ,
- for any $K \in \mathcal F$ , $\Omega\backslash K \in\mathcal F$ ,
- for any $K_1,K_2,\dots \in \mathcal F$ , $\bigcup_{i=1}^\infty K_i \in \mathcal F$ .
Elements of $\mathcal F$ are then called events. Then $\mathcal P(\Omega)$ is a $\sigma$ -algebra.

Henceforth, let $\Omega$ denote a sample space and $\mathcal F$ denote any $\sigma$ -algebra on $\Omega$ .

For any measure $\mu$ on $\mathcal F$ with $\mu(\Omega) < \infty$ , $\mathbb P_{\mu} := \mu(\cdot) / \mu(\Omega)$ is a probability measure on $\mathcal F$ . In particular, these properties hold if $\Omega$ is a finite set and $\mu = | \cdot |$ .

The reason for these seemingly trivial observations and the definition of a $\sigma$ -algebra becomes more apparent when we consider $\Omega$ being an infinite set, say $\mathbb R$ . The collection $\mathcal P(\Omega)$ does still form a $\sigma$ -algebra over $\Omega$ , but it’s entirely unclear how we might assign a meaningful probability measure on $\mathcal P(\Omega)$ . We will explore this idea later on.

For now, let’s take a look at the elements of $\mathcal F$ .

Lemma 1. For any $K, L \in \mathcal F$ , $K \cap L, K \backslash L \in \mathcal F$ . Furthermore, for any $K, L \in \mathcal F$ , there exists $K_0, L_0 \in \mathcal F$ such that $K \cup L = K_0 \cup L_0$ and $K_0 \cap L_0 = \emptyset$ . In the latter, we say that $K, L$ are mutually exclusive.

Proof. By set complementation and de Morgan’s laws,

$K \cap L = \Omega \backslash (\Omega \backslash (K \cap L)) = \Omega \backslash (\Omega \backslash K \cup \Omega \backslash L) \in \mathcal F.$

Similarly, $K \backslash L = K \cap (\Omega \backslash L) \in \mathcal F$ . For the disjoint union claim, define $L_0 := K \backslash L \in \mathcal F$ .

Furthermore, we state the various measure properties on $\mathcal F$ . One key property defining measures is countable additivity, of which we get finite additivity as a special case:

$\mu(K \sqcup L) = \mu(K) + \mu(L),\quad K, L \in \mathcal F.$

Here, we write the disjoint union $K \sqcup L$ to refer to the set $K \cup L$ if $K \cap L = \emptyset$ .

Lemma 2. We have the following properties for any measure $\mu$ :
- for $K, L \in \mathcal F$ , $\mu(K) = \mu(K \backslash L) + \mu(K \cap L)$ ,
- if there exists $K \in \mathcal F$ such that $\mu(K) < \infty$ , then $\mu(\emptyset) = 0$ ,
- for $K, L \in \mathcal F$ , $\mu(K \cup L) + \mu(K \cap L) = \mu(K) + \mu(L)$ ,
- for $K, L \in \mathcal F$ , if $K \subseteq L$ , then $\mu(K) \leq \mu(L)$ .
Proof. For the set-difference claim,

$\begin{aligned} K &= K \cap \Omega \\ &= K \cap (L \sqcup \Omega \backslash L) \\ &= (K \cap L) \sqcup (K \cap \Omega \backslash L) \\ &= (K \cap L) \sqcup K \backslash L. \end{aligned}$

Then the result becomes immediate since

$\mu(K) = \mu(K \cap L) + \mu(K \backslash L).$

For the empty-set claim, the first result yields

$\mu(K) = \mu(K \cap K) + \mu(K \backslash K) = \mu(K) + \mu(\emptyset).$

Since $\mu(K) < \infty$ , subtracting on both sides yields $\mu(\emptyset) = 0$ .

For the set-union claim,

$\begin{aligned} \mu(K \cup L) &= \mu(K \sqcup L \backslash K) \\ &= \mu(K) + \mu(L \backslash K).\end{aligned}$

Adding $\mu(K \cap L)$ on both sides,

$\begin{aligned} \mu(K \cup L) + \mu(K \cap L) &= \mu(K) + \mu(L \backslash K) + \mu(K \cap L) \\ &= \mu(K) + \mu(L). \end{aligned}$

Finally, for the subset claim, we recall that $K \subseteq L$ implies that $K \cup L = L$ and replace $L$ with $L \backslash K$ and use the empty-set claim to obtain

$\begin{aligned} \mu(K) &\leq \mu(K) + \mu(L \backslash K) \\ &= \mu(K \cup L \backslash K) + \mu(K \cap L \backslash K) \\ &= \mu(K \cup L) + \mu(\emptyset) \\ &= \mu(L) + 0 = \mu(L). \end{aligned}$

Theorem 2. For any probability measure $\mathbb P$ on $\mathcal F$ , we have the following properties:
- $\mathbb P(\emptyset) = 0$ ,
- for $K, L \in \mathcal F$ , $\mathbb P(K \cup L) = \mathbb P(K) + \mathbb P(L) - \mathbb P(K \cap L)$ ,
- for $K, L \in \mathcal F$ , if $K \subseteq L$ , then $\mathbb P(K) \leq \mathbb P(L)$ .
In fact, these properties hold if $\mathbb P$ is replaced by any finite measure $\mu$ .

Proof. For any $K \in \mathcal F$ , $K \subseteq \Omega$ and $\mathbb P(K) \leq \mathbb P(\Omega) = 1$ , and all properties hold for Lemma 2.

Mutually exclusive events therefore help us take advantage of the measure properties to do case-splitting analysis whenever needed, summarised by the law of total probability:

Theorem 3. Suppose there exists $K_1, \cdots , K_n \in \mathcal F$ such that $\Omega = \bigsqcup_{i=1}^n K_i$ . Then for any $K \in \mathcal F$ ,

$\displaystyle \mathbb P(K) = \sum_{i=1}^n \mathbb P(K \cap K_i).$

Proof. Apply finite additivity onto the decomposition

$\displaystyle K = K \cap \Omega = K \cap \bigsqcup_{i=1}^n K_i = \bigsqcup_{i=1}^n (K \cap K_i).$

—Joel Kindiak, 25 Jun 25, 2312H
August 19, 2025