Bayes’ Theorem

The 21st century mathematician Rick Durrett claimed in his lecture notes that measure theory ends and probability begins with the definition of independence. We explore that idea in this topic, as well as its generalisation in the form of Bayes’ theorem.

Let (\Omega,  \mathcal F, \mathbb P) be a probability space.

Lemma 1. Let K be any event with positive probability: \mathbb P(K) > 0. Then the map

\displaystyle \mathbb P( \cdot \mid K) := \frac{ \mathbb P(\cdot \cap K) }{ \mathbb P(K) } : \mathcal F \to [0, 1]

is a well-defined probability measure. In particular,

  • \mathbb P(L \mid K) = 1 if K \subseteq L, and
  • \mathbb P(L \mid K) = 0 if K \cap L = \emptyset.

Equivalently,

\mathbb P(L \cap K) = \mathbb P(L \mid K) \cdot \mathbb P(K), \quad L \in \mathcal F.

Proof. It suffices to prove that \mathbb P(\Omega \mid K) = 1, which holds since K \subseteq \Omega implies

\displaystyle \mathbb P( \Omega \mid K) := \frac{ \mathbb P( \Omega \cap K) }{ \mathbb P(K) } = \frac{ \mathbb P( K) }{ \mathbb P(K) } = 1.

Corollary 1. The law of total probability then reduces to the following: if \Omega = \bigsqcup_{i=1}^n K_i and each K_i has positive probability, then

\displaystyle \mathbb P(K) = \sum_{i=1}^n \mathbb P(K \mid K_i) \cdot \mathbb P(K_i).

What the quantity \mathbb P(L \mid K) as defined in Lemma 1 measures, roughly speaking, is the probability that the event L occurs under the assumption that the event K occurs. This model fits the conclusion of Corollary 1 quite well: to evaluate \mathbb P(K), we condition on each of the varied K_i, then evaluate \mathbb P(K \mid K_i) for each i, followed by summing up via the addition principle. Furthermore, we recover the following intuitions:

  • if K \subseteq L, then any outcome \omega \in K automatically yields \omega \in L,
  • if K \cap L = \emptyset, then any outcome in \omega \in K automatically yields \omega \notin L.

In a sense, these two scenarios are the extremes: the first means that K is totally included in L, and the second means that K is totally excluded from L. But could there be a middle ground? What if, in a sense, event L “does not care” about event K? Assuming that \mathbb P(K) > 0, this observation amounts to the equality

\displaystyle \mathbb P(L \mid K) = \mathbb P(L) \quad \Rightarrow \quad \mathbb P(K \cap L) = \mathbb P(K) \cdot \mathbb P(L).

The equation on the left means that regardless of whether the event K occurs or not, the event L will hold with the same probability. The equation on the right is equivalent to the equation on the left if \mathbb P(K) > 0, but deserves its own attention since it handles the situation where \mathbb P(K) = 0. We use this definition for the notion of independence:

Definition 1. Two events K, L \in \mathcal F are independent if

\mathbb P(K \cap L) = \mathbb P(K) \cdot \mathbb P(L).

Example 1. Flip a fair coin twice so that its sample space is

\Omega := \{ \mathrm H, \mathrm T \}^2 = \{ (\mathrm H, \mathrm H), (\mathrm H, \mathrm T), (\mathrm T, \mathrm H), (\mathrm T, \mathrm T) \}.

Let \mathcal F := \mathcal P(\Omega) denote the discrete \sigma-algebra and \mathbb P denote the uniform counting measure \mathbb P(\cdot) = |\cdot|/4 on \mathcal F.

Let K := \{ (\mathrm H, \mathrm H), (\mathrm H, \mathrm T) \} denote the event that the first coin is a Head, and L := \{ (\mathrm H, \mathrm H), (\mathrm T, \mathrm H) \} denote the event that the second coin is a Head. Then

\displaystyle \mathbb P(K \cap L) = \mathbb P(\{ (\mathrm H, \mathrm H) \}) = \frac{1}{4} = \frac 12 \cdot \frac 12 = \mathbb P(K) \cdot \mathbb P(L).

Thus, the events K and L are independent, and misunderstanding such may lead to financial ruin.

Independence of events is an insanely useful idea in probability theory, especially in the sense of independent random variables, which we will explore in future write-ups. In fact, many kinds of events are independent:

Theorem 1. We have the following independence (or not) properties:

  • for any K \in \mathcal F, K and \emptyset are independent,
  • if K, L have positive probability, then they cannot be mutually exclusive and independent at the same time,
  • if K has positive probability, then K, K cannot be independent.

Furthermore, if K, L are independent, the following pairs of events are independent as well:

L,K, \quad K, \Omega \backslash L,\quad \Omega \backslash K, L, \quad \Omega \backslash K, \Omega \backslash L.

Proof. We illustrate the proof of K, \Omega \backslash L being independent and leave the rest as exercises. Since K, L are independent, we use finite additivity to obtain

\begin{aligned} \mathbb P(K) &= \mathbb P(K \cap (L \sqcup \Omega \backslash L)) \\ &= \mathbb P(K \cap L) + \mathbb P(K \cap \Omega \backslash L) \\ &= \mathbb P(K) \cdot \mathbb P(L) + \mathbb P(K \cap \Omega \backslash L). \end{aligned}

Since L \subseteq \Omega, we have \Omega \cap L = L, so that

\begin{aligned} \mathbb P(K \cap \Omega \backslash L) &= \mathbb P(K) - \mathbb P(K) \cdot \mathbb P(L) \\ &= \mathbb P(K) \cdot (1 - \mathbb P(L)) \\ &= \mathbb P(K) \cdot (\mathbb P(\Omega) - \mathbb P(\Omega \cap L)) \\ &= \mathbb P(K) \cdot \mathbb P(\Omega \backslash L). \end{aligned}

The reality, however, is that in most situations, the quantity \mathbb P(K \mid L) is nontrivial. Furthermore, the quantity seems trivial if, at least sequentially, the event L either occurs (or not), and then the event K occurs (or not).

For example, L could indicate the event “I suspect that this patient has COVID-19” and K could indicate the event “the patient tests positive for COVID-19”. Since we are finite human beings, there is no harm assuming that \mathbb P(K) > 0 and \mathbb P(L) > 0. Now the quantity \mathbb P(K \mid L) has a natural interpretation—it measures the probability that, under the assumption that a patient has COVID-19, the patient tests positive for the virus. Summarised using intuitive terms, \mathbb P(K \mid L) measures the sensitivity of the COVID-19 test.

In a similar vein (no pun intended), \mathbb P(\Omega \backslash K \mid \Omega \backslash L) measures the specificity of the COVID-19 test. The quantities \mathbb P(K \mid \Omega \backslash L) and \mathbb P(\Omega \backslash K \mid L) then measure the false positive rate and false negative rates respectively, and are connected to the sensitivity and specificity measures as follows:

\begin{aligned} \mathbb P(K \mid \Omega \backslash L) = 1 - \text{specificity}, \quad \mathbb P(\Omega \backslash K \mid L) = 1 - \text{sensitivity}. \end{aligned}

Now we present our case-study that generalises nicely into Bayes’ theorem.

Example 2. Suppose 100p\% of the population has been infected with COVID-19, and your COVID-19 test kit has a sensitivity of 95\% and a specificity of 90\%. Given that a patient tests positive using your test kit, what is the probability, in terms of p, that this patient actually has COVID-19?

Solution. Using K, L as our events, we observe that

\mathbb P(K \mid L) = 0.95,\quad \mathbb P(\Omega \backslash K \mid \Omega \backslash L) = 0.9,\quad \mathbb P(L) = p \in [0, 1].

We are interested in the quantity \mathbb P(L \mid K). Nothing in our problem-solving arsenal helps us at the moment, except perhaps for \mathbb P(K \mid L). We note that K \cap L = L \cap K, so that

\mathbb P(K \mid L) \cdot \mathbb P(L) = \mathbb P(K \cap L) = \mathbb P(L \cap K) = \mathbb P(L \mid K) \cdot \mathbb P(K).

Doing some algebruh and substituting relevant values,

\begin{aligned} \mathbb P(L \mid K) &= \frac{\mathbb P(K \mid L) \cdot \mathbb P(L)}{ \mathbb P(K)} \end{aligned}

Now taking advantage of the law of total probability,

\begin{aligned} f(p) := \mathbb P(L \mid K) &= \frac{\mathbb P(K \mid L) \cdot \mathbb P(L)}{ \mathbb P(K \mid L) \cdot \mathbb P(L) + \mathbb P(K \mid \Omega \backslash L) \cdot \mathbb P(\Omega \backslash L) } \\ &= \frac{0.95 \cdot p}{ 0.95 \cdot p + (1-0.9) \cdot (1-p) }. \end{aligned}

The fascinating observation is that f(5\%) \approx 33\%, f(10\%) \approx 51\%, and f(20\%) \approx 70\%. This means if we know that 20\% of the population has been infected with COVID-19, and assuming constant sensitivity and specificity of the COVID-19 test we tested positive for it, there is a 70\% chance that we have been infected with the virus too. For better or for worse, these data points inform governments on various public health policies and measures taken to curb the virus and “flatten the curve“.

Our computations above prove Bayes’ theorem:

Theorem 2 (Bayes). Given events K, L with positive probability,

\displaystyle \mathbb P(L \mid K) = \frac{\mathbb P(K \mid L) \cdot \mathbb P(L)}{\mathbb P(K)}.

Furthermore, if L = \bigsqcup_{i=1}^n L_i and each L_i has positive probability, then

\displaystyle \mathbb P(L \mid K) = \frac{\mathbb P(K \mid L) \cdot \mathbb P(L)}{\sum_{i=1}^n \mathbb P(K \mid L_i) \cdot \mathbb P(L_i)}.

There is more to discuss on Bayes’ theorem, but we conclude with one more application. We notice that if we are given a constant \mathbb P(K), then

\displaystyle \mathbb P(L \mid K) = \frac{1}{\mathbb P(K)} \cdot \mathbb P(K \mid L) \cdot \mathbb P(L).

Then the higher the value of \mathbb P(K \mid L) and \mathbb P(L), the higher the value of \mathbb P(L \mid K). This connection is often used in Bayesian statistics to model an update of belief: given a prior probability \mathbb P(L), the posterior probability is the new probability \mathbb P(L \mid K) given that we have new evidence, namely K.

We need to introduce one more key star player of probability, namely that of random variables. This we will do next time.

—Joel Kindiak, 26 Jun 25, 1847H

,

Published by


Leave a comment