Category: Probability

The Ultimate Winner
Previously, we have looked at three proposed solutions for the $K$ -armed bandit problem: explore-then-commit (ETC), upper confidence bounds (UCB), and Thompson sampling (TS). The regret bounds in the first two algorithms work for general $1$ -sub-Gaussian bandit instances (i.e. $\nu_i$ is $1$ -sub-Gaussian for each $i$ ), while the latter assumes a Gaussian bandit instance (i.e. $\nu_i = \mathcal N(\mu_i, 1)$ for each $i$ ).

We will assume this stricter condition, since Gaussian distributions with variance $1$ are, in fact, $1$ -sub-Gaussian. Given this information, and the bandit instance $\nu$ , which algorithm wins out? We first recall the following regret upper bounds for ETC and UCB:

$\begin{aligned} \frac{\mathcal R_n(\text{ETC}, \nu) }{\log(n)} &\leq \sum_{i=1}^K \frac 1{\log(n)} \left( m + (n-mK) e^{ -m\Delta_i^2/4 } \right)\Delta_i, \\ \frac{\mathcal R_n(\text{UCB}, \nu) }{\log(n)} &\leq \sum_{i=1}^K \left( \frac 3{\log(n)} + \frac {16}{\Delta_i^2} \right)\Delta_i. \end{aligned}$

The ETC strategy, though simple, is a bit too simplistic. Disregarding the fact that our exploration phase $m$ directly impacts the regret bound, even if we assumed $m$ to be merely constant (and not growing in $n$ ), the right hand side yields an upper bound of the form $n/{\log(n)}$ , which tends to $\infty$ as $n \to \infty$ . The UCB algorithm, on the other hand, yields a long-run regret bound of $\leq 16/\Delta_i^2$ , which is constant and hence ideal.

Round 1: UCB wins against ETC.

Where does Thompson sampling come into the picture? Let’s compare their regret bounds:

$\begin{aligned} \frac{\mathcal R_n(\text{UCB}, \nu) }{\log(n)} &\leq \sum_{i=1}^K \left( \frac 3{\log(n)} + \frac {16}{\Delta_i^2} \right)\Delta_i, \\ \frac{\mathcal R_n(\text{TS}, \nu) }{\log(n)} &\leq \sum_{i=1}^K \frac {2}{\Delta_i^2} \cdot \Delta_i. \end{aligned}$

Trivially, $2/\Delta_i^2 \leq 16/\Delta_i^2 \leq 3/{\log(n)} + 16/\Delta_i^2$ , so that we get the following.

Round 2: Thompson sampling wins against UCB.

Remark 1. The reality is, however, a lot more subtle. It turns out that a clever choice of $m$ that depends on $n$ can allow the upper bound for ETC to converge. Furthermore, the UCB algorithm can be modified (Chapter 8 of Lattimore and Szepesvari (2020)) to recover the constant $2/\Delta_i^2$ , and arguably is more extensive than Thompson sampling, since the UCB algorithm works for a wider class of bandits (namely, $1$ –sub-Gaussians rather than just Gaussians with variance $1$ ). The UCB algorithm is even more straightforward to generalise from a theoretical perspective, while TS analyses become highly technical—even in the simple case of Gaussians with variance $1$ .

But here’s the problem—how do we know that Thompson sampling or the modified UCB in Remark 1 are really the best algorithms for a bandit instance of Gaussians with variance $1$ ? We need to develop the notion of asymptotic optimality. We take inspiration from Chapter 16 of Lattimore and Szepesvari (2020).

Denote $\mathcal N(1) := \{ \mathcal N(\mu, 1) : \mu \in \mathbb R\}$ and denote the environment of $K$ -armed $1$ -Gaussian bandits by $\mathcal M_{\mathcal N}^K(1) := \mathcal N(1)^K$ .

Let $\rho$ denote a risk functional.

Definition 1. A policy (i.e. bandit algorithm) is $\rho$ –consistent over a bandit environment $\mathcal E$ if for any $\nu \in \mathcal E$ and $\alpha > 0$ , $\mathcal R_n^{\rho}(\pi, \nu)/n^\alpha \to 0$ as $n \to \infty$ .

In particular, UCB and TS are $\rho_{\mathbb E}$ -consistent over $\mathcal E_{\mathcal N}^K(1)$ since for any $\nu \in \mathcal E_{\mathcal N}^K(1)$ and $\alpha > 0$ they have constants $C$ (depending only on $\nu$ ) such that

$\displaystyle \frac{ \mathcal R_n^{\rho_{\mathbb E}}(\pi, \nu)}{ n^\alpha } = \frac{ \mathcal R_n^{\rho_{\mathbb E}}(\pi, \nu) }{ \log(n) } \cdot \frac{\log(n)}{n^\alpha} \leq C \cdot 0 = 0.$

For any policy consistent over the $K$ -armed bandit environment

$\mathcal E = \mathcal M_1 \times \cdots \times \mathcal M_K,$

it turns out that we have a very elegant regret lower bound, that even extends to the generalised risk setting. To describe this lower-bound, however, requires us to explore some baby information theory, and in particular, the Kullback-Leibler divergence.

Definition 2 (KL-Divergence). Let $\lambda$ denote the Lebesgue measure on $(\mathbb R, \mathfrak{B}(\mathbb R))$ . For probability measures $\nu \ll \lambda, \kappa \ll \lambda$ , if $\kappa \ll \nu$ , define the KL-divergence of $\nu$ relative to $\kappa$ by

$\displaystyle \mathrm{KL}(\nu, \kappa) := \int_{\Omega} \frac{\mathrm d\nu}{\mathrm d\lambda} \cdot \log \left( \frac{\mathrm d\nu}{\mathrm d\kappa} \right)\, \mathrm d\lambda = \mathbb E_{\nu}\left[\log\left( \frac{\mathrm d\nu}{\mathrm d\kappa} \right)\right],$

where $\displaystyle \mathbb E_{\nu}[g(X)] \equiv \int_{\mathbb R} g(x)\, \mathrm d \nu(x)$ .

Example 1. Given the Gaussian distributions $\nu_i = \mathcal N(\mu_i, \sigma_i^2)$ ,

$\displaystyle \mathrm{ KL }(\nu_1, \nu_2) = \frac 12 \left( \frac{(\mu_1 - \mu_2)^2}{\sigma_2^2} + \frac{\sigma_1^2}{\sigma_2^2} - \log \frac{\sigma_1^2}{\sigma_2^2} - 1\right).$

In the special case $\sigma_1 = \sigma_2 = \sigma$ ,

$\displaystyle \mathrm{ KL }(\nu_1, \nu_2) = \frac{(\mu_1 - \mu_2)^2}{2\sigma^2} .$

Furthermore, if $\sigma = 1$ , then $\mathrm{KL}(\nu_1, \nu_2) = (\mu_1 - \mu_2)^2/2$ .

Proof. See this page.

Definition 3. Fix $r \in \mathbb R$ . Define

$\displaystyle \mathcal K_{\inf}^{\rho}(\nu, r, \mathcal M) = \inf_{ \kappa \in \mathcal M } \{ \mathrm{KL}(\nu, \kappa) : \rho(\kappa) > r\}.$

Now let $\mathcal E = \mathcal M_1 \times \cdots \times \mathcal M_K$ denote a bandit environment.

Example 2. For $\nu_i = \mathcal N(\mu_i, 1)$ and $\mu_1 > \mu_i$ ,

$\displaystyle \mathcal K_{\inf}^{\rho_{\mathbb E}}(\nu_i, \mu_1, \mathcal N(1)) = \frac{ \Delta_i^2 }{ 2 },$

where $\Delta_i := \mu_1 - \mu_i$ .

Proof. By definition,

$\displaystyle \mathcal K_{\inf}^{\rho_{\mathbb E}}(\nu_i, \mu_1, \mathcal N(1)) = \inf_{ \kappa \in \mathcal N(1) } \{ \mathrm{KL}(\nu_i, \kappa) : \rho_{\mathbb E}(\kappa) > \mu_1\}.$

By Example 1, given $\kappa = \mathcal N(\mu_2, 1) \in \mathcal M(1)$ such that $\rho_{\mathbb E}(\kappa) > \mu_1$ , we have $\epsilon := \mu_2 - \mu_1 > 0$ , so that

$\begin{aligned} \mathrm{ KL }(\nu_i, \kappa) &= \frac{(\mu_i - (\mu_1 + \epsilon))^2}{2} = \frac{ (\Delta_i+ \epsilon)^2 }{2}. \end{aligned}$

Taking $\epsilon \to 0^+$ ,

$\displaystyle \mathcal K_{\inf}^{\rho_{\mathbb E}}(\nu_i, \mu_1, \mathcal N(1)) = \frac{ (\Delta_i + 0)^2 }{ 2 } = \frac{ \Delta_i^2 }{ 2 }.$

Why do we care about the optimised KL-divergence? Because it plays a fundamental role in (asymptotically) lower-bounding the cumulative regret!

Theorem 1. Let $\pi$ be a consistent policy over the $K$ -armed bandit environment

$\mathcal E = \mathcal M_1 \times \cdots \times \mathcal M_K.$

Then for any $\nu \in \mathcal E$ with unique $\rho$ -optimal arm $1$ ,

$\displaystyle \liminf_{n \to \infty} \frac{\mathcal R_n^{\rho}(\pi, \nu)}{\log(n)} \geq \sum_{i : \Delta_i^{\rho} > 0} \frac{1}{ \mathcal K_{\inf}^{\rho} ( \nu_i, \rho(\nu_1) , \mathcal M_i ) } \cdot \Delta_i^{\rho}.$

Proof. We follow the proof of Theorem 16.2 in Lattimore and Szepesvari (2020). Denote the bandit instance by $\nu = (\nu_1,\dots, \nu_K) \in \mathcal E$ . Fix $\epsilon > 0$ arbitrary. Denote $K_i := \mathcal K_{\inf}^{\rho} ( \nu_i, \rho(\nu_1) , \mathcal M_i )$ for brevity. By Definition 3, there exists $\nu_i' \in \mathcal M_i$ such that $\rho(\nu_i') > r$ and

$\mathrm{KL}(\nu_i, \nu_i') < K_i + \epsilon.$

Define the alternate bandit instance $\nu' \in \mathcal E$ by

$\nu' = (\nu_1,\dots, \nu_{i-1}, \nu_i', \nu_{i+1}, \dots \nu_K)$

for a suitably chosen $\nu_i' \in \mathcal M_i$ . We claim that $\mathbb E[T_i(n)] \geq 1/(K_i + \epsilon)$ . By the divergence decomposition lemma in Lattimore and Szepesvari (2020) (Lemma 15.1),

$\mathrm{KL}(\mathbb P_{\nu}, \mathbb P_{\nu'}) = \mathbb E[T_i(n)] \cdot \mathrm{KL}(\nu_i, \nu_i'),$

which implies that

$\displaystyle \mathbb E[T_i(n)] > \frac{ \mathrm{KL}(\mathbb P_{\nu}, \mathbb P_{\nu'}) }{ K_i + \epsilon }.$

Denote $R_n = \mathcal R_n^{\rho}(\pi, \nu)$ and $R_n' = \mathcal R_n^{\rho}(\pi, \nu')$ . Then similar to the writeup,

$\displaystyle R_n + R_n' \geq \frac n4 \cdot \min \{ \Delta_i^\rho, \rho(\nu_i') - \rho(\nu_1)\} \cdot \exp(- \mathbb E[T_i(n)] \cdot (K_i +\epsilon)).$

Performing algebra,

$\displaystyle \frac{ \mathbb E[T_i(n)] }{ \log(n) } \geq \frac 1{K_i +\epsilon} \cdot \left(1 - \frac{ \log\left( \frac{ 4 \cdot (R_n + R_n') }{ \min \{ \Delta_i^\rho, \rho(\nu_i') - \rho(\nu_1)\} } \right) }{ \log(n) } \right).$

Since $\pi$ is consistent, $R_n/{\log(n)} \to 0$ and $R_n'/{\log(n)} \to 0$ as $n \to \infty$ , which yields $1 - ( \cdots ) \to 1$ , so that

$\displaystyle \liminf_{n \to \infty} \frac{ \mathcal R_n^{\rho}(\pi, \nu) }{ \log(n) } = \sum_{i : \Delta_i^{\rho}} \frac {\mathbb E[T_i(n)]}{\log(n)} \cdot \Delta_i^{\rho} \geq \sum_{i : \Delta_i^{\rho}} \frac 1{K_i +\epsilon} \cdot \Delta_i^{\rho}.$

Taking $\epsilon \to 0^+$ yields the desired result.

Example 3. For any policy $\pi$ that is $\rho_{\mathbb E}$ -consistent over $\mathcal E_{\mathcal N}^K(1)$ , given that $\nu_1 = \mathcal N(\mu_1, 1) \in \mathcal N(1)$ is uniquely optimal, Example 3 yields

$\displaystyle \liminf_{n \to \infty} \frac{\mathcal R_n^{\rho_{\mathbb E}}(\pi, \nu)}{\log(n)} \geq \sum_{i : \Delta_i^{\rho} > 0} \frac{1}{ \mathcal K_{\inf}^{\rho_{\mathbb E}} ( \nu_i, \mu_1 , \mathcal N(1) ) } \cdot \Delta_i^{\rho} = \sum_{i : \Delta_i^{\rho} > 0} \frac{2}{ \Delta_i }.$

Definition 4. Using the notation in Theorem 1, we say that $\pi$ is asymptotically optimal over $\mathcal E$ if for any $\nu \in \mathcal E$ ,

$\displaystyle \limsup_{n \to \infty} \frac{\mathcal R_n^{\rho}(\pi, \nu)}{\log(n)} \leq \sum_{i : \Delta_i^{\rho} > 0} \frac{1}{ \mathcal K_{\inf}^{\rho} ( \nu_i, \rho(\nu_1) , \mathcal M_i ) } \cdot \Delta_i^{\rho}.$

Corollary 1. Thompson sampling is asymptotically optimal over $\mathcal M_{\mathcal N}^K(1)$ .

Proof. By previous discussions,

$\displaystyle \limsup_{n \to \infty} \frac{ \mathcal R_n(\mathrm{TS}, \nu) }{ \log(n) } \leq \sum_{i=1}^K \frac 2{\Delta_i},$

so that Thompson sampling is $\rho_{\mathbb E}$ -consistent over $\mathcal M_{\mathcal N}^K(1)$ . By Example 3,

$\displaystyle \liminf_{n \to \infty} \frac{ \mathcal R_n(\mathrm{TS}, \nu) }{ \log(n) } \geq \sum_{i=1}^K \frac 2{\Delta_i}.$

By Definition 4, Thompson sampling is asymptotically optimal over $\mathcal M_{\mathcal N}^K(1)$ :

$\displaystyle \lim_{n \to \infty} \frac{ \mathcal R_n(\mathrm{TS}, \nu) }{ \log(n) } = \sum_{i=1}^K \frac 2{\Delta_i}.$

In that sense, Thompson sampling is the ultimate winner! And of course, the modified UCB that achieves the same regret bound is also the ultimate winner over $\mathcal M_{\mathcal N}^K(1)$ . Yes, Definition 4 admits multiple ultimate winners.

And the discussion is far from over. Denoting $\mathrm{Ber} \equiv \{\mathrm{Ber}(p) : p \in [0, 1]\}$ , my joint work with my research supervisor shows that $\rho$ -TS is the ultimate winner over the bandit environment $\mathrm{Ber}^K$ for many types of risk functionals, including but not limited to the following famous ones:
- Expectation: $\rho_{\mathbb{E}}(\nu) = \mathbb{E}[X], X \sim \nu$ ,
- Variance: $\rho_{\mathbb{V}}(\nu) = \mathbb{V}(X), X \sim \nu$ ,
- Conditional value-at-risk: $\rho_{\mathrm{CVaR}_\alpha}(\nu) = \mathrm{CVaR}_\alpha(X), X \sim \nu$ ,
- Mean-variance: $\rho_{\mathrm{MV}_\alpha} := \alpha \rho_{\mathbb E} - \rho_{\mathbb V}$ ,
- Entropic risk: $\rho_{\mathrm{ER}_\theta}(\nu) := -\frac 1{\theta} \log(\mathbb E[\exp(-\theta X)])$ , $X \sim \nu$ .
Denoting the collection of probability distributions with support bounded in $[0,1]$ by $\mathcal M_{[0,1]}$ , my ambition was to show that $\rho$ -TS is the ultimate winner over the bandit environment $\mathcal M_{[0,1]}^K$ . Sadly, it was too ambitious, and is left as a highly non-trivial (possibly impossible?) exercise for the aspiring undergraduate. Perhaps it might work for the various possible $\mathcal M^K$ ?
- $\mathcal M := \mathcal C_{[0,1]}$ , i.e. the collection of $\mathcal M_{[0,1]}$ with continuous p.d.f.s.,
- $\mathcal M := \mathcal N(1)$ , i.e. the collection of $1$ -Gaussian distributions,
- $\mathcal M := \mathrm{SG}(1)$ , i.e. the collection of $1$ -sub-Gaussian distributions?
I honestly don’t know, and don’t have enough funding in time or money to find out. But I have to say that the nostalgia of revisiting this topic four years later brought me a much-needed closure from those research years.

For now, we are done.

Happy Valentine’s Day!

Remark 2. As of 9 June 2026, the task has been answered in the affirmative, namely for $\mathcal C_{[0,1]}$ , $\mathcal N(1)$ , and the subset of $\mathrm{SG}(1)$ with bounded and continuous p.d.f.s. Check out the pre-print here.

—Joel Kindiak, 14 Feb 26, 1602H
May 15, 2026
The Thompson Sampling Template
Let’s finally discuss Thompson sampling.

Previously, we looked at the UCB algorithm that incorporates the dynamic reward history into its exploration-exploitation tradeoff, yielding an upper-bound on its regret of the form

$\displaystyle A \sum_{i=1}^K \Delta_i + f(n) \sum_{i : \Delta_i > 0} \frac{1}{\Delta_i}.$

After an initial pull that yields the reward history $(A_1,X_1,\dots, A_K, X_K)$ , subsequent arm pulls are computed using said information in a deterministic manner relative to the reward history:

$\displaystyle A_{t+1} = \underset{i=1,\dots, K}{\arg \max} \left( \hat{\mu}_{i,T_i(t)} + \sqrt{ \frac{2 \sigma^2 \log(1/\delta) }{ T_i(t) } } \right),$

where the expression being arg-maxxed is motivated by, given $s$ , a tail concentration bound satisfied by an arm $i$ with $\sigma_i$ -sub-Gaussian distribution:

$\displaystyle \mathbb P\left( \mu_i > \hat{\mu}_{i,s} + \sqrt{ \frac{2 \sigma^2 \log(1/\delta) }{ s } }\right) \leq \delta$

Thompson sampling aims to randomise this selection. To simplify our discussion, we will be given knowledge that each $\nu_i$ has a distribution of $\mathcal N(\mu_i, 1)$ , and our goal is to learn $\mu_i$ .

When we initialise the algorithm, we are completely ignorant of the distribution of any $\nu_i$ . So we get to choose our initial guess. Let $\mathbb P_{i, t}$ denote our guess of the distribution of $\nu_i$ , so $\mathbb P_{i, 0}$ denotes our initial guess. Intuitively, after $t$ pulls of arm $i$ , we want to update our guess $\mathbb P_{i, t+1}$ given the previous guess $\mathbb P_{i, t}$ and the reward $X_{ t+1} \sim \nu_i$ obtained from said arm, i.e. we set

$\mathbb P_{i, t+1}(\cdot \mid X_1,\dots, X_t, X_{t+1}) := \mathrm{Update}(\mathbb P_{i, t}(\cdot \mid X_1,\dots, X_t), X_{t+1})$

where $\mathrm{Update}(\cdot , \cdot)$ is carefully curated for our purposes.

The Thompson sampling algorithm then proceeds as follows. As an initial exploration phase, for each $t = 1,\dots, K$ , pull the arm $A_t = t$ , then update the arm distributions:

$\mathbb P_{A_t, 1}(\cdot \mid X_{A_t}) := \mathrm{Update}(\mathbb P_{A_t, 0}, X_{A_t}).$

For each $t \geq K+1$ , select

$A_{t+1} = \underset{i=1,\dots, K}{\arg \max}\ \theta_{i, t+1},$

where $\theta_{i, t+1} \sim \mathbb P_{i, T_i(t)}$ independently, and finally, collect the reward $X_{A_{t+1}} \sim \nu_{A_{t+1}}$ . Then update the arms just like before:

$\mathbb P_{A_{t+1}, T_i(t+1)} = \mathbb P_{A_{t+1}, T_i(t)+1} := \mathrm{Update}(\mathbb P_{A_t, T_i(t)}, X_{A_{t+1}}),$

where the various probability measures are still conditioned on the reward history, but omitted for brevity.

Our current discussion has brought us far too high into the stratosphere, and we need to return to earth. It stands to reason then that, in the absence of additional information, a meaningful “initial guess” would be

$\mathbb P_{i, 0} = \mathcal N(0, 1) \equiv \mathcal N(\theta_{i,0}, \sigma_{i,0}^2),\quad \theta_{i,0} := 0,\quad \sigma_{i,0}^2 := 1.$

After pulling arm $i$ once and receiving a reward $X_{i,1} \sim \nu_i$ , our next guess should, intuitively, take the form of $\mathbb P_{i,1} = \mathcal N(\theta_{i,1}, \sigma_{i,1}^2)$ , where $(\theta_{i,1}, \sigma_{i,1}^2)$ should be intimately related to $( \theta_{i,0}, \sigma_{i,0}^2, X_{i,1} )$ . We can accomplish that goal using conjugate priors, justified by a measure-theoretic application of Bayes’ theorem.

Lemma 1. Suppose for any $s \geq 0$ ,

$\mathbb P_{i,s+1}( \cdot \mid X_{i,1} \dots, X_{i,s}) = \mathcal N(\theta_{i,s+1}, \sigma_{i,s+1}^2).$

Using conjugate priors,

$\displaystyle \frac{ \theta_{i,s+1} }{ \sigma_{i,s+1}^2 } = \frac{ \theta_{i,s} }{ \sigma_{i,s}^2 } + X_{i,s+1},\quad \frac 1{\sigma_{i,s+1}^2} = \frac 1{\sigma_{i,s}^2}+1.$

In particular, if $\mathbb P_{i, 0} = \mathcal N(0, 1)$ , then

$\displaystyle \theta_{i,s+1} = \frac{X_{i,1} + \cdots + X_{i,s+1}}{s+1} =: \hat{\mu}_{i,s+1},\quad \frac 1{\sigma_{i,s+1}^2} = s+1.$

Proof. See Section 34.3 and Example 34.7 in Lattimore and Szepesvari (2020).

Therefore, we would use the $\mathrm{Update}(\cdot,\cdot)$ procedure as follows:

$\mathrm{Update}(\mathcal N(\hat{\mu}_{i,t}, 1/t), X_{i, t+1}) = \mathcal N(\hat{\mu}_{i,t+1}, 1/(t+1)),$

with all relevant quantities defined in Lemma 1.

Remark 1. Observe that $\rho_{\mathbb E}(\mathcal N(\mu, \sigma^2)) = \mu$ . More generally, since $\mathcal N(\mu, \sigma^2)$ is characterised by its mean and variance, for any risk functional $\rho$ , there exists a unique function $\tilde \rho : \mathbb R \times (0,\infty) \to \mathbb R$ such that $\rho(\mathcal N(\mu, \sigma^2)) = \tilde \rho(\mu, \sigma^2)$ .

Remark 2. If all we know is that $\nu_i = \mathcal N(\mu_i, \sigma_i^2)$ , then we would require a stronger update procedure as outlined by Zhu and Tan (2020) (Figure 2). The upside of this approach is that for any $t$ , we can sample

$\theta_{i, t + 1} \sim \mathcal N( \hat{\mu}_{i, T_i(t)}, 1/T_i(t)),\quad \tau_{i,t+1} \sim \Gamma(\alpha_{i,T_i(t)}, \beta_{i,T_i(t)} )$

to form the randomised distribution $\tilde{\nu}_{i,t+1} = \mathcal N(\theta_{i, t + 1}, 1/\tau_{i, t+1})$ . Here, the parameters $(\alpha_{i,t}, \beta_{i,t})$ are updated according to the rules

$\begin{aligned} \alpha_{i,t+1} &= \alpha_{i,t} +1/2, \\ \beta_{i,t+1} &= \beta_{i,t} + \frac{T_i(t)}{T_i(t)+1} \cdot \frac{(X_{i,T_i(t+1)} - \hat{\mu}_i(t) )^2}{2}, \end{aligned}$

similar in spirit to Lemma 1. Denoting

$\mathbb P_{i,t+1} := \mathcal N( \hat{\mu}_{i, T_i(t)}, 1/T_i(t)) \times \Gamma(\alpha_{i,T_i(t)}, \beta_{i,T_i(t)} ),$

we sampled $\kappa_{i,t} = (\theta_{i, t + 1}, \tau_{i,t+1}) \sim \mathbb P_{i,t+1}$ and defined $\tilde{\nu}_{i,t+1} := \Pi(\kappa_{i,t})$ , where $\Pi (x,y) = \mathcal N(x,1/y)$ . Then the arm-selection criterion would be similarly modified to

$A_t := \underset{i=1,\dots, K}{\arg \max}\ \rho(\tilde{\nu}_{i,t+1} ).$

Now we ask the crucial question: given $\nu = (\nu_1,\dots,\nu_K)$ with $\nu_i \sim \mathcal N(\mu_i, \sigma_i^2)$ , how do we upper bound the following cumulative regret?

$\displaystyle \mathcal R_n(\rho\text{-}\mathrm{TS},\nu) = \sum_{i : \Delta_i^{\rho} > 0} \mathbb E[T_i(n)] \Delta_i^{\rho}$

My attempt at solving this problem for the risk functional $\rho$ being the conditional value-at-risk (CVaR) was the substance of my undergraduate research, inspired by Zhu and Tan (2020) and Baudry et al (2021). In this post, we will ultimately only solve the simple case $\sigma_i = 1$ and $\rho = \rho_{\mathbb E}$ , while we chart the way for future progress on the generalised problem.

Remark 3. There are many other distributions that $\nu_i$ could take. In the case $\nu_i$ are all Bernoulli bandits, my final year project solves the problem in the affirmative for general risk measures $\rho$ , adapted from seminal works by Riou and Honda (2010) and Baudry et al (2021). It even solves the problem if $\nu_i$ is simply known to have finite support, albeit only guaranteed for a stricter subset of (nonetheless useful) risk functionals. Unfortunately, it has not solved the general case of $\nu_i$ with support $[0, 1]$ , or $\nu_i$ being Gaussian, or even $\nu_i$ being sub-Gaussian. I was tempted to restart this project, but shall relegate it as an exercise. Doing these posts on bandits brought me a much-needed closure that I never knew I needed.

Lemma 2. Assume that arm $1$ is uniquely optimal. Fix $\epsilon > 0$ . Define

$E_{i,t} := \{ \rho( \tilde{\nu}_{i,t} ) \leq \rho(\nu_1) - \epsilon \}$

and $G_{i,s} := \mathbb P( E_{i,s+1}^c \mid X_{i,1}, \dots, X_{i,s})$ , where:
- $X_{i,t} \sim \nu_i$ independently,
- $\tilde{\nu}_{i,s+1} \sim \mathbb P_{i, s}(\cdot \mid X_{i,1}, \dots, X_{i,s})$ ,
- $\mathbb P_{i, t+1}(\cdot \mid X_{i,1},\dots, X_{i,t}, X_{i,t+1}) = \mathrm{Update}(\mathbb P_{i,t}(\cdot \mid X_{i,1},\dots, X_{i,t}), X_{i,t+1})$ ,
- $\rho$ is a risk functional (i.e. $\rho_{\mathbb E}(\nu_i) = \mu_i$ ).
Then

$\displaystyle \mathbb E[T_i(n)] \leq \mathbb E\left[ \sum_{s=0}^{n-1} \left( \frac 1{G_{1,s}} - 1 \right) \right] + \mathbb E\left[ \sum_{s=0}^{n-1} \mathbb I\{ G_{i,s} > 1/n \} \right] + 1.$

Hence, we remove the randomness of $T_i(t)$ in our subsequent analyses.

Proof. The result and proof are all a mild generalisation of Theorem 36.2 in Lattimore and Szepesvari (2020). By construction,

$\begin{aligned} \mathbb P(E_{1,t}^c \mid \mathcal F_{t-1}) &= \mathbb P( E_{1,t}^c \mid X_{1,1},\dots, X_{1,T_1(t-1)}) = G_{1, T_1(t-1)} \end{aligned}$

almost surely. Therefore, we now decompose $\mathbb E[T_i(n)]$ :

$\begin{aligned} \mathbb E[T_i(n)] &= \mathbb E\left[ \sum_{t=1}^n \mathbb I\{ A_t = i \} \right] \\ &= \underbrace{ \mathbb E\left[ \sum_{t=1}^n \mathbb I\{ A_t = i, E_{i,t} \} \right] }_{M_1} + \underbrace{ \mathbb E\left[ \sum_{t=1}^n \mathbb I\{ A_t = i, E_{i,t}^c \} \right] }_{M_2}. \end{aligned}$

To upper-bound $M_2$ , we refer to the solutions to Problem 36.5 of Lattimore and Szepesvari (2020). Define

$\mathcal T := \{ t \in \{1,\dots, n\} : G_{i, T_i(t-1)} > 1/n\}$

Using a union bound,

$\begin{aligned} \mathbb E\left[ \sum_{t=1}^n \mathbb I \{ A_t = i, E_{i,t}^c \} \right] &= \mathbb E\left[ \sum_{t \in \mathcal T} \mathbb I \{ A_t = i, E_{i,t}^c \} \right] + \mathbb E\left[ \sum_{t \notin \mathcal T} \mathbb I \{ A_t = i, E_{i,t}^c \} \right] \\ &\leq \underbrace{\mathbb E\left[ \sum_{t \in \mathcal T} \mathbb I \{ A_t = i \} \right] }_{M_{2,1}} + \underbrace{ \mathbb E\left[ \sum_{t \notin \mathcal T} \mathbb I \{ E_{i,t}^c \} \right] }_{M_{2,2}}.\end{aligned}$

Therefore, it suffices to upper-bound

$\mathbb E[T_i(n)] \leq M_1 + M_{2,1} + M_{2,2}.$

To upper-bound $M_1$ , first denote the chosen non-optimal arm by $B_t$ :

$B_{t+1} = \underset{i\neq 1}{\arg \max}\ \rho(\tilde{\nu}_{i,t}).$

Then

$\begin{aligned} \mathbb P(A_t = 1, E_{i,t} \mid \mathcal F_{t-1}) &\geq \mathbb P(B_t = i, E_{i,t} , E_{1,t}^c \mid \mathcal F_{t-1}) \\ &= \mathbb P(E_{1,t}^c \mid \mathcal F_{t-1}) \cdot \mathbb P(B_t = i, E_{i,t} \mid \mathcal F_{t-1}) \\ &= \mathbb P(E_{1,t}^c \mid \mathcal F_{t-1}) \cdot \mathbb P(B_t = i, E_{i,t} \mid \mathcal F_{t-1}), \end{aligned}$

since $E_{1,t}^c$ is conditionally independent of $B_t = i, E_{i,t}$ given $\mathcal F_{t-1}$ . To lower-bound the second term further,

$\begin{aligned} \{A_t = i \} \cap E_{i,t} \subseteq (\{ B_t = i \} \cap E_{i, t}) \cap E_{1,t} \end{aligned}$

and conditional independence given $\mathcal F_{t-1}$ implies that

$\begin{aligned} \mathbb P(A_t = i, E_{i,t} \mid \mathcal F_{t-1}) &\leq \mathbb P(B_t = i, E_{i,t} \mid \mathcal F_{t-1}) \cdot \mathbb P(E_{1,t} \mid \mathcal F_{t-1}) \\ &\leq \frac{\mathbb P(E_{1,t} \mid \mathcal F_{t-1})}{\mathbb P(E_{1,t}^c \mid \mathcal F_{t-1})} \cdot \mathbb P(A_t = 1, E_{i,t} \mid \mathcal F_{t-1}) \\ &\leq \frac{\mathbb P(E_{1,t} \mid \mathcal F_{t-1})}{\mathbb P(E_{1,t}^c \mid \mathcal F_{t-1})} \cdot \mathbb P(A_t = 1 \mid \mathcal F_{t-1}) \\ &= \left( \frac{1}{\mathbb P(E_{1,t}^c \mid \mathcal F_{t-1})} -1 \right) \cdot \mathbb P(A_t = 1 \mid \mathcal F_{t-1}) \\ &= \left( \frac{1}{G_{1,T_1(t-1)}} -1 \right) \cdot \mathbb P(A_t = 1 \mid \mathcal F_{t-1}). \end{aligned}$

Therefore, we can upper-bound $M_1$ by

$\begin{aligned} M_1 &= \mathbb E\left[ \sum_{t=1}^n \mathbb I\{ A_t = i, E_{i,t}\} \right] \\ &\leq \mathbb E \left[ \sum_{t=1}^n \mathbb P ( A_t = i, E_{i,t} \mid \mathcal F_{t-1})\right] \\ &\leq \mathbb E \left[ \sum_{t=1}^n \left( \frac{1}{G_{1,T_1(t-1)}} -1 \right) \cdot \mathbb P(A_t = 1 \mid \mathcal F_{t-1})\right] \\ &=\mathbb E \left[ \sum_{t=1}^n \left( \frac{1}{G_{1,T_1(t-1)}} -1 \right) \cdot \mathbb I \{ A_t = 1 \} \right] \\ &\leq \mathbb E \left[ \sum_{s=0}^{n-1} \left( \frac{1}{G_{1,s}} -1 \right) \right]. \end{aligned}$

To upper-bound $M_{2,1}$ , recall that

$\mathcal T := \{ t \in \{1,\dots, n\} : G_{i,T_i(t-1)} > 1/n\}$

Therefore,

$\begin{aligned} \sum_{t \in \mathcal T} \mathbb I\{ A_t = i \} &\leq \sum_{t=1}^n \sum_{s=1}^n \mathbb I\{ T_i(t) = s, T_i(t-1) = s-1 , G_{i,T_i(t-1)} > 1/n \} \\ &\leq \sum_{s=1}^n \sum_{t=1}^n \mathbb I\{ T_i(t) = s, T_i(t-1) = s-1 , G_{i,s-1} > 1/n \} \\ &\leq \sum_{s=1}^n \mathbb I\{ G_{i,s-1} > 1/n \} \underbrace{ \sum_{t=1}^n \mathbb I\{ T_i(t) = s, T_i(t-1) = s-1 \} }_{\leq 1} \\ &\leq \sum_{s=1}^n \mathbb I\{ G_{i,s-1} > 1/n \} \\ &= \sum_{s=0}^{n-1} \mathbb I\{ G_{i,s} > 1/n \}. \end{aligned}$

Taking expectations,

$\begin{aligned} M_{2,1} = \mathbb E\left[ \sum_{t \in \mathcal T} \mathbb I\{ A_t = i \} \right] &\leq \mathbb E\left[ \sum_{s=0}^{n-1} \mathbb I\{ G_{i,s} > 1/n \} \right]. \end{aligned}$

To upper-bound $M_{2,2}$ , $t \notin \mathcal T$ if and only if $G_{i, T_i(t-1)} \leq 1/n$ :

$\begin{aligned} M_{2,2} &= \mathbb E\left[ \sum_{t \notin \mathcal T} \mathbb I\{E_{i,t}^c\} \right] \\ &= \sum_{t=1}^n \mathbb E[ \mathbb I\{E_{i,t}^c, G_{i, T_i(t-1)} \leq 1/n\} ] \\ &= \sum_{t=1}^n \mathbb E[ \mathbb E[\mathbb I\{E_{i,t}^c, G_{i, T_i(t-1)} \leq 1/n\}\mid \mathcal F_{t-1}] ] \\ &= \sum_{t=1}^n \mathbb E[ \mathbb I\{ G_{i, T_i(t-1)} \leq 1/n \}] \cdot \mathbb P(E_{i,t}^c) ] \\ &= \sum_{t=1}^n \mathbb E[ \mathbb I\{ G_{i, T_i(t-1)} \leq 1/n \}] \cdot G_{i,T_i(t-1)} ] \\ &\leq \sum_{t=1}^n \mathbb E[ \mathbb I\{ G_{i, T_i(t-1)} \leq 1/n \}] \cdot 1/n ] \\ &= \sum_{t=1}^n \mathbb P(G_{i, T_i(t-1)} \leq 1/n ) \cdot 1/n \\ &\leq \sum_{t=1}^n 1 \cdot 1/n = 1. \end{aligned}$

Therefore,

$\begin{aligned} \mathbb E[T_i(n)] &\leq M_1 + M_{2,1} + M_{2,2} \\ &\leq \mathbb E\left[ \sum_{s=0}^{n-1} \left(\frac 1{G_{1,s}} - 1 \right) \right] + \mathbb E\left[ \sum_{s=1}^n \mathbb I\{G_{i,s-1} > 1/n\} \right] + 1. \end{aligned}$

Lemma 2 gives us the general recipe to evaluate the regret bound for $\rho$ -TS, but we don’t get any concrete result until we consider specific bandit instances. For simplicity, let’s suppose $\nu_i = \mathcal N(\mu_i, 1)$ for each $i$ . We initialise $\mathbb P_{i,0} := \mathcal N(0, 1)$ and use Lemma 1 to update our belief distribution $\mathbb P_{i,t+1} = \mathcal N(\hat{\mu}_{i,t+1}, 1/(t+1))$ of arm $i$ at time $t$ after observing the data $X_{i,1},\dots, X_{i,T_i(t)} \sim \nu_i$ independently.

How do we upper-bound the cumulative regret $\mathcal R_n (\rho_{\mathbb E}\text{-}\mathrm{TS}, \nu)$ ?

Fix $\epsilon > 0$ and consider the regret bound recipe in Lemma 2:

$\displaystyle \mathbb E[T_i(n)] \leq \underbrace{ \mathbb E\left[ \sum_{s=0}^{n-1} \left( \frac 1{G_{1,s}} - 1 \right) \right] }_{C_1(n,\epsilon)} + \underbrace{ \mathbb E\left[ \sum_{s=0}^{n-1} \mathbb I\{ G_{i,s} > 1/n \} \right] }_{C_2(n,\epsilon)} + 1,$

where $G_{i,s} = \mathbb P(E_{i,s+1}^c \mid X_{i,1}, \dots, X_{i,s})$ and

$E_{i,t} = \{\rho_{\mathbb E}(\tilde{\nu}_{i,t} \leq \rho_{\mathbb E}(\nu_1) - \epsilon\} = \{ \theta_{i,t} \leq \mu_1 - \epsilon\},$

where $\theta_{i,s+1} \sim \mathbb P_{i,s}(\cdot \mid X_{i,1},\dots, X_{i,s}) = \mathcal N(\hat{\mu}_{i,s} , 1/s)$ independently.

We aim to upper-bound $C_1(n,\epsilon)$ and $C_2(n,\epsilon)$ then eventually take $\epsilon \to 0^+$ .

Lemma 3. $C_1(n,\epsilon)/{\log(n)} \to 0$ as $n \to \infty$ .

Proof. To upper-bound $C_1(n,\epsilon)$ , we first use linearity:

$\displaystyle C_1(n,\epsilon) = \sum_{s=0}^{n-1} \mathbb E\left[ \frac{1}{ G_{1,s} } - 1 \right].$

The key idea is to note that the value of

$G_{1,s} = \mathbb P(\theta_{1,s+1} > \mu_1 - \epsilon \mid X_{i,1},\dots, X_{i,s})$

is deterministic when conditioned on $(X_{i,1},\dots, X_{i,s})$ , since

$\theta_{i,s+1} \sim \mathcal N(\hat{\mu}_{i,s}, 1/s),$

and the distribution of $\hat{\mu}_{i,s}$ , in turn, is determined by the reward history:

$\displaystyle \hat{\mu}_{i,s} = \frac{X_{i,1} + \cdots + X_{i,s}}{s} \sim \mathcal N(\mu_i, 1/s).$

Therefore, we can let $f$ denote the p.d.f of $\mathcal N(0, 1/s)$ for flexibility:

$\displaystyle f(y) = \sqrt{ \frac { s }{ 2\pi } } \exp\left(-\frac{s y^2}{2} \right),$

and $F$ denote its c.d.f.:

$\displaystyle F(y) = \int_{-\infty}^y f(x)\, \mathrm dx.$

Conditioned on $\hat{\mu}_{i,s} := y$ ,

$\theta_{i,s+1} \sim \mathcal N(\hat{\mu}_{i,s}, 1/s) \quad \Rightarrow \quad \theta_{i,s+1} - \hat{\mu}_{i,s} \sim \mathcal N(0, 1/s)$

so that

$\begin{aligned} G_{1,s} &= \mathbb P(\theta_{1,s+1} > \mu_1 - \epsilon \mid \hat{\mu}_{1,s} = y)\\ &= 1 - \mathbb P(\theta_{1,s+1} \leq \mu_1 - \epsilon \mid \hat{\mu}_{1,s} = y)\\ &= 1 - \mathbb P(\theta_{1,s+1} - \hat{\mu}_{1,s} \leq \mu_1 - \hat{\mu}_{1,s} - \epsilon \mid \hat{\mu}_{1,s} = y)\\ &= 1 - \mathbb P(\theta_{1,s+1} - y \leq \mu_1 - y - \epsilon)\\ &= 1 - F(\mu_1 - y - \epsilon) \end{aligned}$

almost surely. Therefore, by taking advantage of the marginal distribution and using the change of variables $y \to \mu_1 -y - \epsilon$ ,

$\begin{aligned}\mathbb E\left[ \frac{1}{ G_{1,s} } - 1 \right] &= \int_{\mathbb R} \left( \frac 1{1-F( \mu_1 - y - \epsilon)} - 1 \right) f(\mu_1-y)\, \mathrm dy \\ &= \int_{\mathbb R} \left( \frac 1{1-F( y )} - 1 \right) f(y + \epsilon )\, \mathrm dy \\ &= \int_{\mathbb R} \frac {F(y)}{1 - F(y)} \cdot f( y + \epsilon)\, \mathrm dy \\ &= \left( \int_{-\infty}^0 + \int_0^\infty \right) \frac {F(y)}{1 - F(y)} \cdot f( y + \epsilon)\, \mathrm dy. \end{aligned}$

Therefore, we need to upper-bound two integrals. For the first integral, $1-F_s(y) \geq 1/2$ so that given $W \sim \mathcal N(0, 1/s)$ and the Gaussian tail bound

$\displaystyle \mathbb P(W \leq -\epsilon) \leq \exp(-s\epsilon^2/2).$

we have

$\begin{aligned} \int_{-\infty}^0 \frac {F(y)}{1 - F(y)} \cdot f( y + \epsilon)\, \mathrm dy &\leq 2 \int_{-\infty}^0 F_{s}(y) \cdot f_{s}( y + \epsilon)\, \mathrm dy \\ &\leq 2 \int_{-\infty}^0 \mathbb P(W \leq y - \epsilon) \cdot f( y + \epsilon)\, \mathrm dy \\ &\leq 2 \int_{-\infty}^0 \mathbb P(W \leq -\epsilon) \cdot f( y + \epsilon)\, \mathrm dy \\ &\leq 2 \int_{-\infty}^0 \exp(-s\epsilon^2/2) \cdot f( y + \epsilon)\, \mathrm dy \\ &\leq 2 \exp(-s\epsilon^2/2) \underbrace{\int_{\mathbb R} f_{s}( y + \epsilon)\, \mathrm dy }_1 \\ &= 2 \exp(-s\epsilon^2/2). \end{aligned}$

For the second integral, $F_s(y) \leq 1$ and using a tail lower bound by Abramowitz and Stegun (1964),

$\displaystyle 1 - F_s(y) \geq \frac{ \exp(-sy^2/2) }{ y\sqrt s + \sqrt{sy^2 + 4} },$

followed by algebra and integration by parts,

$\begin{aligned} &\int_0^\infty \frac {F(y)}{1 - F(y)} \cdot f( y + \epsilon)\, \mathrm dy \\ &\leq \int_0^\infty \exp(sy^2/2) \cdot (y\sqrt s + \sqrt{sy^2 + 4}) \cdot \sqrt{ \frac{s}{2\pi} } \exp(-s(y+\epsilon)^2/2)\, \mathrm dy \\ &\leq \sqrt{ \frac{s}{2\pi} } \int_0^\infty (y\sqrt s +y\sqrt s + 2) \cdot \exp(s(y^2-(y+\epsilon)^2)/2)\, \mathrm dy \\ &\leq 2\exp(-s\epsilon^2/2)\sqrt{ \frac{s}{2\pi} }\int_0^\infty (y\sqrt s + 1) \cdot \exp(-s y \epsilon)\, \mathrm dy \\ &\leq 2\exp(-s\epsilon^2/2)\sqrt{ \frac{s}{2\pi} } \cdot \frac{1+\epsilon \sqrt s}{\epsilon s\sqrt s} \\ &= 2\cdot\frac{1+\epsilon \sqrt{s}}{\epsilon s\sqrt{2\pi}} \cdot \exp(-s\epsilon^2/2). \end{aligned}$

Combining both results,

$\begin{aligned} C_1(n,\epsilon) &= \sum_{s=0}^{n-1} \mathbb E\left[ \frac 1{G_{1,s}} - 1 \right] \\ &\leq 1 + 2 \cdot \sum_{s=1}^{\infty} \left( \exp(-s\epsilon^2/2) + \frac{1+\epsilon \sqrt{s}}{\epsilon s\sqrt{2\pi}} \cdot \exp(-s\epsilon^2/2) \right) \\ &\leq 1 + 2 \cdot \left(1 + \frac{1+\epsilon }{\epsilon \sqrt{2\pi}}\right) \cdot \sum_{s=0}^{\infty} \exp(-s\epsilon^2/2) \\ &\leq 1 + 2 \cdot \left(1 + \frac{1+\epsilon }{\epsilon \sqrt{2\pi}}\right) \cdot \frac{1}{1 - \exp(-\epsilon^2/2)}. \end{aligned}$

In particular, $C_1(n,\epsilon)/{\log(n)} \to 0$ as $n \to \infty$ .

Remark 4. In the general setting, we would also aim to prove that $C_1(n, \epsilon) \leq C_1'(\epsilon)$ , so that $C_1(n, \epsilon)/{\log(n)} \to 0$ as $n \to \infty$ . In the proof of Lemma 3, we have the dependency

$\mathbf X_{i,s-1} := (X_{i,1},\dots, X_{i,s-1}) \to \hat{\mu}_{i,s} \to G_{i,s},$

and the intermediate dependency has a relatively characterisable distribution conditioned on the reward history. More generally, at any time $s$ , we would obtain the distribution

$\mathbf X_{i,s-1}\to \mathbb P_{i,s}(\cdot \mid \mathbf X_{i,s-1}) \to \mathbb P(\rho(\tilde{\nu}_{i,s}) > \rho(\nu_i) - \epsilon \mid \mathbf X_{i,s-1}) =: G_{i,s},$

where $\tilde{\nu}_{i,s} = \Pi(\kappa_{i,s})$ for some deterministic function $\Pi$ , and $\kappa_{i,s} \sim \mathbb P_{i,s-1}(\cdot \mid \mathbf X_{i,s-1})$ . In the vanilla case:
- $\mathbb P_{i,s}(\cdot \mid \mathbf X_{i,s-1}) = \mathcal N(\hat{\mu}_{i,s-1}, 1/(s-1))$ ,
- $\kappa_{i,s} = \theta_{i,s} \sim \mathbb P_{i,s-1}(\cdot \mid \mathbf X_{i,s-1})$ ,
- $\Pi(x) = \mathcal N(x, 1)$ ,
- $\tilde{\nu}_{i,s} = \Pi(\kappa_{i,s}) = \mathcal N(\kappa_{i,s}, 1)$ ,
- $\rho_{\mathbb E}(\tilde{\nu}_{i,s}) = \kappa_{i,s} = \theta_{i,s}$ .
If we adopt the same strategy as in Lemma 3, then we would like $\mathbb P_{i,s}$ to be sufficiently computationally familiar with Radon-Nikodým derivative $f$ with respect to the measure $\mu$ . Taking the integral over the appropriate measure space $(\Omega, \mathcal F, \mu)$ that denotes the possible realisations of $\kappa_{i,s}$ , conditional expectation should, once again, yield

$\begin{aligned} \mathbb E\left[ \frac 1{G_{1,s}} - 1 \right] &= \int_{\Omega} \frac{1 - G_{1,s}(\omega)}{G_{1,s}(\omega)} f(\omega)\, \mathrm d\mu(\omega). \end{aligned}$

Hence, we would require tail lower bounds on $G_{1,s}(\omega)$ of the form

$\displaystyle G_{1,s}(\omega) \geq \frac 1{\psi_1(s,\epsilon) + 1} , \quad G_{1,s}(\omega) \geq \frac{f(\omega)}{\psi_2(s,\epsilon)},$

for non-negative functions $\psi_1,\psi_2$ such that

$\displaystyle \sum_{s=1}^\infty \psi_1(s,\epsilon) =: C_{1,1}(\epsilon) < \infty,\quad \sum_{s=1}^\infty \psi_2(s,\epsilon) =: C_{1,2}(\epsilon) < \infty$

and

$\displaystyle \Omega = \underbrace{ \{\omega : 1 - G_{1,s}(\omega) \leq \psi_1(s,\epsilon)\} }_{\Omega_1} \cup \underbrace{ \left\{\omega : \frac{f(\omega)}{G_{1,s}(\omega)} \leq \psi_2(s,\epsilon)\right\} }_{\Omega_2},$

so that

$\begin{aligned} \mathbb E\left[ \frac 1{G_{1,s}} - 1 \right] &= \left( \int_{\Omega_1} + \int_{\Omega_2}\right) \frac{1 - G_{1,s}(\omega)}{G_{1,s}(\omega)} f(\omega)\, \mathrm d\mu(\omega) \\ &\leq \psi_1(s,\epsilon) + \psi_2(s,\epsilon), \end{aligned}$

and $C_1(n,\epsilon) \leq C_{1,1}(\epsilon) + C_{1,2}(\epsilon)$ yields $C_1(n,\epsilon)/{\log(n)} \to 0$ . It suffices to say that the lion’s share of research in this field would go toward discovering and deriving useful tail lower bounds for

$G_{i,s}(\omega) = \mathbb P( \rho( \tilde{\nu}_{i,s} ) \geq \rho(\nu_i) - \epsilon \mid \mathbf X_{i,s-1}),$

where we are reminded that

$\tilde{\nu}_{i,s} = \Pi(\kappa_{i,s}),\quad \kappa_{i,s} \sim \mathbb P_{i,s-1}( \cdot \mid \mathbf X_{i,s-1}).$

Lemma 4. $C_2(n,\epsilon)/{\log(n)} \to 2/(\Delta_i - \epsilon)^2$ as $n \to \infty$ .

Proof. To upper-bound $C_2(n,\epsilon)$ , we cheat. We first note that

$\begin{aligned} \{G_{i,s} > 1/n\} &= \left\{\mathbb P(\theta_{i,s} > \mu_1 - \epsilon \mid X_{i,1},\dots, X_{i,s}) > 1/n\right\} \\ &= \left\{\mathbb P(\theta_{i,s} - \hat{\mu}_{i,s} > \mu_1 - \hat{\mu}_{i,s} - \epsilon \mid X_{i,1},\dots, X_{i,s}) > 1/n\right\} \\ &\subseteq \left\{\exp(-s(\mu_1 - \hat{\mu}_{i,s} - \epsilon)^2/2) > 1/n\right\} \\ &=\left\{ \hat{\mu}_{i,s} > \mu_1 - \epsilon - \sqrt{\frac{2\log(n)}{s}}\right\} \\ &=\left\{ \hat{\mu}_{i,s} -\mu_i > \Delta_i - \epsilon - \sqrt{\frac{2\log(n)}{s}} \right\}. \end{aligned}$

Define $u \in \mathbb N^+$ by

$\displaystyle u \geq \left\lceil \frac{2\log(n)}{(\Delta_i - \epsilon)^2} \right\rceil \quad \iff \quad \Delta_i - \epsilon - \sqrt{\frac{2\log(n)}{u}} > 0.$

Using the Gaussian tail bound again,

$\begin{aligned} C_2(n,\epsilon) &= \sum_{s=0}^{n-1} \mathbb P(G_{i,s} > 1/n) \\&\leq u + \sum_{s=u+1}^\infty \mathbb P\left( \hat{\mu}_{i,s} -\mu_i > \Delta_i - \epsilon - \sqrt{\frac{2\log(n)}{s}} \right) \\ &\leq 1 + \frac{2\log(n)}{(\Delta_i - \epsilon)^2} + \sum_{s=u+1}^\infty \exp\left( - \frac{s\left(\Delta_i - \epsilon - \sqrt{\frac{2\log(n)}{s}}\right)^2}{2} \right) \\ &\leq 1 + \frac{2\log(n)}{(\Delta_i - \epsilon)^2} + \frac{2}{(\Delta_i - \epsilon)^2}(\sqrt{\pi \log(n)} + 1), \end{aligned}$

where the last line follows from the proof of Theorem 8.1 in Lattimore and Szepesvari (2020). In particular,

$\displaystyle \lim_{n \to \infty} \frac{C_2(n,\epsilon)}{\log(n)} = \frac {2}{(\Delta_i - \epsilon)^2}.$

Remark 5. For the general case, we would like to obtain a concentration bound of the form

$\begin{aligned} \mathbb P(G_{i,s} > 1/n) \leq \psi_3(\epsilon) \cdot \exp(-s \cdot \psi_4(\epsilon)), \end{aligned}$

where $\psi_3$ is a non-negative function and $0 < \psi_4(\epsilon) < K_i$ for some desired constant such that $\psi_4(\epsilon) \to K^-$ as $\epsilon \to 0^+$ . Then for judiciously chosen $u$ ,

$\begin{aligned} C_2(n,\epsilon) &\leq u + \sum_{s=u+1}^{n-1} \mathbb P(G_{i,s} > 1/n) \\ &\leq u + \psi_3(\epsilon) \cdot n\exp(-u \cdot \psi_4(\epsilon)). \end{aligned}$

In particular, setting

$\displaystyle u \geq \frac{ \log (n) }{ \psi_4(\epsilon) } \quad \iff \quad n\exp(-u \cdot \psi_4(\epsilon) ) \leq 1$

yields

$\begin{aligned} C_2(n,\epsilon) &\leq u + \sum_{s=u+1}^{n-1} \mathbb P(G_{i,s} > 1/n) \\ &\leq 1 + \frac{ \log (n )}{\psi_4(\epsilon) } + \psi_3(\epsilon), \end{aligned}$

so that $C_2(n,\epsilon)/{\log(n)} \to 1/\psi_4(\epsilon)$ as $n \to \infty$ .

Theorem 1. Given that $\nu_i = \mathcal N(\mu_i, 1)$ for each $i$ , the regret bound of $\rho_{\mathbb E}$ -TS is given by

$\displaystyle \limsup_{n \to \infty} \frac{\mathcal R_n (\rho_{\mathbb E}\text{-}\mathrm{TS}, \nu) }{\log(n)} \leq \sum_{i : \Delta_i > 0} \frac{2}{\Delta_i}.$

Proof. Combining the displays in Lemmas 2, 3 and 4,

$\begin{aligned} \limsup_{n \to \infty} \frac{\mathbb E[T_i(n)]}{\log(n)} &\leq \lim_{n \to \infty} \frac{C_1(n,\epsilon) + C_2(n,\epsilon) + 1}{\log(n)} \\ &\leq \lim_{n \to \infty} \frac{C_1(n,\epsilon) }{\log(n)} + \lim_{n \to \infty} \frac{C_2(n,\epsilon) }{\log(n)} + \lim_{n \to \infty} \frac{1 }{\log(n)} \\ &= 0 + \frac {2}{(\Delta_i - \epsilon)^2} + 0 = \frac {2}{(\Delta_i - \epsilon)^2} . \end{aligned}$

Plugging into the regret decomposition,

$\displaystyle \limsup_{n \to \infty} \frac{\mathcal R_n (\rho_{\mathbb E}\text{-}\mathrm{TS}, \nu) }{\log(n)} \leq \sum_{i : \Delta_i > 0} \frac{2}{(\Delta_i - \epsilon)^2} \cdot \Delta_i.$

Taking $\epsilon \to 0^+$ ,

$\displaystyle \lim_{n \to \infty} \frac{\mathcal R_n (\rho_{\mathbb E}\text{-}\mathrm{TS}, \nu) }{\log(n)} \leq \sum_{i : \Delta_i > 0} \frac{2}{(\Delta_i - 0)^2} \cdot \Delta_i = \sum_{i : \Delta_i > 0} \frac{2}{\Delta_i}.$

Remark 6. Assuming the presence of sufficiently nice concentration bounds in Remarks 4 and 5, we can recover the generalised regret bound

$\displaystyle \limsup_{n \to \infty} \frac{ \mathcal R_n(\rho\text{-}\mathrm{TS}, \nu) }{ \log(n) } \leq \sum_{i : \Delta_i^{\rho} > 0} \frac{ \Delta_i^{\rho} }{ \psi_{i, 4}(\epsilon)} .$

Taking $\epsilon \to 0^+$ ,

$\displaystyle \limsup_{n \to \infty} \frac{ \mathcal R_n(\rho\text{-}\mathrm{TS}, \nu) }{ \log(n) } \leq \sum_{i : \Delta_i^{\rho} > 0} \frac{ \Delta_i^{\rho} }{K_i} .$

That was…intense. I took at least six hours to type this out. And several more days to include the generalising remarks. And this was supposedly the warm-up for my undergraduate research projects. I have to say…I probably would have progressed much further if I took the time then to write this proof out as I did today.

Before we can finally lay our short detour into multi-armed bandits, we must ask the question: is this Thompson sampling algorithm, in a sense, the best? We will answer this question in the affirmative next time.

—Joel Kindiak, 13 Feb 26, 2117H
May 14, 2026
Automated Confidence
Previously, we looked at the explore-then-commit algorithm, and asked how we can “automate” the optimisation between exploring and exploiting arms.

Lemma 1. Let $\nu$ be $\sigma$ -sub-Gaussian with mean $\mu$ and sample $X_1,\dots, X_n \sim \nu$ independently. Define $\hat{\mu} := (X_1 + \dots + X_n)/n$ . Then for any $\delta > 0$ ,

$\displaystyle \mathbb P\left( \mu \geq \hat{\mu} + \sqrt{ \frac{2\sigma^2\log(1/\delta)}{n} } \right) \leq \delta.$

Proof. Since $\nu$ is $\sigma$ -sub-Gaussian, $( \mu - \hat{\mu} )$ has zero mean and is also $(\sigma/\sqrt{n})$ -sub-Gaussian. Using the tail concentration for sub-Gaussian random variables, for any $\epsilon > 0$ ,

$\displaystyle \mathbb P\left( \mu - \hat{\mu} \geq \epsilon \right) \leq \exp\left( -\frac{\epsilon^2}{2(\sigma/\sqrt n)^2} \right) = \exp \left( -\frac{n\epsilon^2}{2\sigma^2} \right).$

Setting the right-hand side equals $\delta$ and solving for $\epsilon$ ,

$\displaystyle \exp\left( -\frac{n\epsilon^2}{2\sigma^2} \right) = \delta \quad \iff \quad \epsilon = \sqrt{ \frac{2\sigma^2\log(1/\delta)}{n} }.$

Therefore,

$\displaystyle \mathbb P\left( \mu - \hat{\mu} \geq \sqrt{ \frac{2\sigma^2\log(1/\delta)}{n} } \right) \leq \delta.$

We want to use Lemma 1 to design our algorithm. What Lemma 1 communicates to us is that for small $\delta$ and large $n$ ,

$\displaystyle \mathbb P\left( \mu \leq \hat{\mu} + \sqrt{ \frac{2\sigma^2\log(1/\delta)}{n} } \right) \geq 1 - \delta.$

That is, with probability close to $1$ , $\mu$ can be estimated using $\hat{\mu}$ and some upper confidence bound (UCB). We can use this observation to design our UCB algorithm.

Let $\nu = (\nu_1,\dots, \nu_K)$ denote a $K$ -armed bandit with $\nu_1$ having the uniquely largest mean, and suppose for simplicity that $\nu_i$ is known to be $\sigma_i$ -sub-Gaussian. Recall that for any policy $\pi$ , we can upper-bound the cumulative regret by

$\displaystyle \mathcal R_n(\pi, \nu) \leq \sum_{i=1}^n \mathbb E[T_i(n)] \cdot \Delta_i,$

where $n$ denotes the time horizon, and $\Delta_i := \mu_1 - \mu_i$ denotes the sub-optimality gap. Before designing the UCB algorithm, we make several fixed-time notions for a given arm $i$ with distribution $\nu_i$ :
- For any $u_i \geq 1$ , define $\hat{\mu}_{i,u} = \displaystyle \frac 1u \sum_{s=1}^u X_{i,s}$ , where $X_{i,s} \sim \nu_i$ independently.
- For any $u_i \geq 1$ , define $\displaystyle \mathrm{UCB}_{i, u}(\delta) := \hat{\mu}_{i,u} + \sqrt{ \frac{ 2 \sigma_i^2 \log(1/\delta) }{u} }$ .
Hence, given the number of pulls $T_i(t) \geq 1$ of arm $i$ up to and including time $t$ , define

$\hat{\mu}_{i}(t):= \hat{\mu}_{i, T_i(t)},\quad \mathrm{UCB}_{i}(t, \delta) := \mathrm{UCB}_{i, T_i(t)}(\delta).$

We shall design the UCB algorithm as follows. Pull each arm once, so that $T_i(K) = 1$ . For time $t = K,\dots, n-1$ , pull the arm $A_{t+1} = i$ that maximises $\mathrm{UCB}_i (t)$ , that is,

$\displaystyle A_{t+1} := \underset{i=1,\dots, K}{\arg \max}\ \mathrm{UCB}_i(t, \delta).$

Lemma 2. For any event $G$ , the number of pulls of arm $i$ can be upper-bounded by

$\begin{aligned} \mathbb E[T_i(n)] &\leq \mathbb E[\mathbb E[T_i(n)] \mid G]] + n \cdot \mathbb P(G^c). \end{aligned}$

Proof. By the law of total expectation, since $T_i(n) \leq n$ is always true,

$\begin{aligned} \mathbb E[T_i(n)] &\leq \mathbb E[\mathbb E[T_i(n)] \mid G] \cdot \underbrace{ \mathbb P(G) }_{\leq 1} + \mathbb E[ \underbrace{\mathbb E[T_i(n)]}_{\leq n} \mid G^c] \cdot \mathbb P(G^c) \\ &\leq \mathbb E[\mathbb E[T_i(n)] \mid G]] + n \cdot \mathbb P(G^c).\end{aligned}$

For each $i$ , define the ‘good event’ $G_i$ that satisfies the following two conditions:
- For any $t$ , $\mu_1 < \mathrm{UCB}_1(t, \delta)$ .
- We have $\mathrm{UCB}_{i, u_i}(\delta) < \mu_1$ .
The first condition requires $\mathrm{UCB}_1(t)$ to always be ‘optimistic’ relative to $\mu_1$ , and the second condition requires that at time $u_i$ , $\mathrm{UCB}_{i,u_i}(\delta)$ is less optimistic than $\mu_1$ . Expressed in terms of events,

$\displaystyle G_i = \left\{ \mu_1 < \min_{t=1,\dots, n} \mathrm{UCB}_1(t, \delta) \right\} \cap \{ \mathrm{UCB}_{i,u_i}(\delta) < \mu_1 \}.$

occurs. By Lemma 2,

$\begin{aligned} \mathbb E[T_i(n)] &\leq \mathbb E[\mathbb E[T_i(n)] \mid G_i]] + n \cdot \mathbb P(G_i^c).\end{aligned}$

It remains to upper-bound $\mathbb E[T_i(n)]$ under $G_i$ , as well as $\mathbb P(G_i^c)$ .

Lemma 3. Under the good event $G_i$ , $T_i(n) \leq u_i$ .

Proof. Suppose $G_i$ happens, and assume for a contradiction that $T_i(n) > u_i$ . Then there exists some $t$ such that $T_i(t) = u_i$ and $A_{t+1} = i$ . At this point in time,

$\displaystyle \mathrm{UCB}_i(t, \delta) \geq \mathrm{UCB}_1(t, \delta).$

By the definition of $G_i$ ,

$\begin{aligned} \mathrm{UCB}_i(t, \delta) &= \mathrm{UCB}_{i, u_i}(\delta) < \mu_1 < \mathrm{UCB}_1(t,\delta),\end{aligned}$

a contradiction.

Lemma 4. There exists $c \in (0, 1)$ depending on $u_i, \sigma_i^2, \delta$ such that

$\displaystyle \mathbb P(G_i^c) \leq n\delta + \exp\left( -\frac{u_i c^2 \Delta_i^2}{2\sigma_i^2} \right).$

Proof. Taking the complement of $G_i$ ,

$\displaystyle G_i^c = \left\{ \mu_1 \geq \min_{t=1,\dots, n} \mathrm{UCB}_1(t, \delta) \right\} \cup \{ \mathrm{UCB}_{i,u_i}(\delta) \geq \mu_1 \}.$

Taking a union bound,

$\begin{aligned} \mathbb P(G_i^c) &\leq \underbrace{ \mathbb P\left( \mu_1 \geq \min_{t=1,\dots, n} \mathrm{UCB}_1(t, \delta) \right) }_{p_1} + \underbrace{ \mathbb P ( \mathrm{UCB}_{i,u_i}(\delta) \geq \mu_1 ) }_{p_2}. \end{aligned}$

We can upper-bound $p_1$ by containing the first event:

$\begin{aligned} \left\{ \mu_1 \geq \min_{t=1,\dots, n} \mathrm{UCB}_1(t, \delta) \right\} &\subseteq \left\{ \mu_1 \geq \min_{s=1,\dots, n} \mathrm{UCB}_{1,s}(\delta) \right\} \\ &\subseteq \bigcup_{s=1}^n \left\{ \mu_1 \geq \mathrm{UCB}_{1,s}(\delta) \right\} \\ &= \bigcup_{s=1}^n \left\{ \mu_1 \geq \hat{\mu}_{1,s} + \sqrt{\frac{2 \sigma_i^2 \log(1/\delta)}{s}} \right\}. \end{aligned}$

By taking a union bound and applying Lemma 1,

$\begin{aligned} p_1 &= \mathbb P\left( \mu_1 \geq \min_{t=1,\dots, n} \mathrm{UCB}_1(t, \delta) \right) \\ &\leq \sum_{s=1}^n \mathbb P \left( \mu_1 \geq \hat{\mu}_{1,s} + \sqrt{\frac{2 \sigma_i^2 \log(1/\delta)}{s}} \right) \\ &\leq \sum_{s=1}^n \delta = n\delta. \end{aligned}$

To upper-bound $p_2$ , we unravel its definition:

$\begin{aligned} \{ \mathrm{UCB}_{i,u_i}(\delta) \geq \mu_1 \} &= \left\{ \hat{\mu}_{i,u_i} + \sqrt{\frac{ 2 \sigma_i^2 \log(1/\delta) }{u_i}} \geq \mu_1 \right\} \\ &= \left\{ \hat{\mu}_{i,u_i} -\mu_i + \sqrt{\frac{ 2 \sigma_i^2 \log(1/\delta) }{u_i}} \geq \mu_1 - \mu_i \right\} \\ &= \left\{ \hat{\mu}_{i,u_i} -\mu_i \geq \Delta_i - \sqrt{\frac{ 2 \sigma_i^2 \log(1/\delta) }{u_i}} \right\} \\ &\subseteq \{ \hat{\mu}_{i,u_i} -\mu_i \geq c\Delta_i \}, \end{aligned}$

where the final inclusion is obtained by a choice of $c \in (0, 1)$ such that

$\displaystyle \Delta_i - \sqrt{\frac{ 2 \sigma_i^2 \log(1/\delta) }{u_i}} \geq c \Delta_i.$

Using the same tail concentration in Lemma 1,

$\begin{aligned} \mathbb P( \mathrm{UCB}_{i,u_i}(\delta) \geq \mu_1 ) &\leq \mathbb P( \hat{\mu}_{i,u_i} -\mu_i \geq c\Delta_i ) \leq \exp\left( -\frac{u_i c^2 \Delta_i^2}{2\sigma_i^2} \right). \end{aligned}$

Therefore,

$\displaystyle \mathbb P(C_i^c) \leq p_1 + p_2 \leq n\delta + \exp\left( -\frac{u_i c^2 \Delta_i^2}{2\sigma_i^2} \right).$

Theorem 1. Given the horizon $n$ , define the error threshold $\delta := 1/n^2$ . Then the cumulative regret can be upper-bounded by

$\displaystyle \mathcal R_n(\mathrm{UCB}, \nu) \leq \left( 2 + \frac 1n\right) \sum_{i=1}^K\Delta_i + 16 \log(n) \sum_{i : \Delta_i > 0} \frac{ \sigma_i^2 }{ \Delta_i }.$

If furthermore that $\sigma_i = 1$ for all $i$ , then we recover the following more visually-friendly regret bound:

$\displaystyle \mathcal R_n(\mathrm{UCB}, \nu) \leq 3\sum_{i=1}^K\Delta_i + 16 \log(n) \sum_{i=1}^K \frac{ 1 }{ \Delta_i }.$

Proof. Combining Lemmas 3 and 4, there exists $c \in (0, 1)$ such that

$\displaystyle \mathbb E[T_i(n)] \leq u_i + n \left[ n\delta + \exp\left( -\frac{u_i c^2 \Delta_i^2}{2\sigma_i^2} \right)\right].$

As per Lemma 4, we required

$\displaystyle \Delta_i - \sqrt{ \frac{ 2 \sigma_i^2 \log(1/\delta) }{u_i} } \geq c \Delta_i.$

For the arms $i$ such that $\Delta_i > 0$ , solving for $u_i$ ,

$\displaystyle u_i \geq \frac{ 2 \sigma_i^2 \log(1/\delta) }{ (1-c)^2 \Delta_i^2 } \quad \Rightarrow \quad \frac{u_ic^2\Delta_i^2}{2 \sigma_i^2} \geq \frac{ c^2 \log(1/\delta) }{ (1-c)^2 }.$

This lower bound allows us to upper-bound the exponential term as

$\displaystyle \exp\left( -\frac{u_i c^2 \Delta_i^2}{2\sigma_i^2} \right) \leq \delta^{c^2/(1-c)^2}.$

Hence, we set $u_i$ to be the smallest integer such that

$\displaystyle u_i = \left\lceil \frac{ 2 \sigma_i^2 \log(1/\delta) }{ (1-c)^2 \Delta_i^2 } \right\rceil \leq \frac{ 2 \sigma_i^2 \log(1/\delta) }{ (1-c)^2 \Delta_i^2 } + 1 .$

Therefore,

$\displaystyle \mathbb E[T_i(n)] \leq \frac{ 2 \sigma_i^2 \log(1/\delta) }{ (1-c)^2 \Delta_i^2 } + 1 + n^2 \delta + n\delta^{c^2/(1-c)^2}.$

For convenience, we choose $c = 1/2$ so that $c/(1-c) = 1$ :

$\displaystyle \mathbb E[T_i(n)] \leq \frac{ 8 \sigma_i^2 \log(1/\delta) }{ \Delta_i^2 } + 1 + n^2 \delta + n\delta.$

Setting $\delta = 1/n^2$ , $\log(1/\delta) = 2 \log(n)$ and

$\displaystyle 1 + n^2 \delta + n \delta = 2 + \frac 1n.$

Therefore,

$\displaystyle \mathcal R_n(\mathrm{UCB}, \nu) \leq \left( 2 + \frac 1n\right) \sum_{i=1}^K\Delta_i + 16 \log(n) \sum_{i : \Delta_i > 0} \frac{ \sigma_i^2 }{ \Delta_i }.$

My undergraduate research professor and his collaborators proved a generalised version for $\Delta_i^{\rho}$ , where $\rho$ denotes sufficiently well-behaved risk functionals.

Theorem 2 (Tan et al, 2022). By modifying the UCB algorithm to account for general risk measures $\rho$ , called $\rho$ -LCB,

$\displaystyle \mathcal R_n(\rho\text{-}\mathrm{LCB}, \nu) \leq 5 K \sum_{i = 1}^K \Delta_i^{\rho} + 4L^2 \sigma^2 (32 \sqrt{e \log(n)} + 256 \sqrt 2)^2 \sum_{i : \Delta_i^{\rho} > 0} \frac{ 1}{ \Delta_i^{\rho} },$

where $K, L, \sigma$ are pre-defined universal constants in the aforementioned paper. Here, $\Delta_i^{\rho} = \rho(\nu_1) - \rho(\nu_i)$ , and we assume $\rho(\nu_1)$ to be maximum.

Proof. See Theorem 51 in the paper here. The technical portion of this proof is deriving a concentration bound analogous to that of Lemma 1, and the rest of the proof follows the same basic strategy in that of Theorem 1.

Though more convoluted due to the generality of $\rho$ , the regret bound in Theorem 2 matches in spirit to Theorem 1, taking the form

$\displaystyle \mathcal R_n(\pi, \nu) \leq A \sum_{i : \Delta_i^{\rho} > 0} \Delta_i^{\rho} + f(n) \sum_{i : \Delta_i^{\rho} > 0} \frac{ 1}{ \Delta_i^{\rho} }$

for the computable constant $A$ and function $f(n)$ .

The next algorithm is a modification of the UCB algorithm and is the one closest to my heart—Thompson sampling. In the UCB algorithm, we compute our “optimism” of arm $i$ using its upper confidence bound $\mathrm{UCB}_i(t+1,\delta)$ . Conditioned on the number of pulls $T_i(t)$ and the reward history $(X_{i,1},\dots, X_{i,T_i(t)})$ , $\mathrm{UCB}_i(t+1,\delta)$ is a deterministic value.

The idea for Thompson sampling is to randomise this process in a mathematically meaningful manner: Conditioned on the number of pulls $T_i(t)$ and the reward history $(X_{i,1},\dots, X_{i,T_i(t)})$ , we sample a random distribution $\tilde{\nu}_i$ , and pull the arm that maximises $\rho(\tilde{\nu}_i)$ . In the vanilla case $\rho = \rho_{\mathbb E}$ , we have $\rho(\tilde{\nu}_i) = \mathbb E[Y]$ for $Y \sim \tilde{\nu}_i$ .

My work aimed to solve this modified $K$ -armed bandit problem by using Thompson sampling. Spanning sufficiently many commonly used risk functionals $\rho$ , the algorithm does work for Bernoulli distributions. For a non-singular subset of these functionals, it even works for any distribution with finite support. However, just like how the heart of Theorem 2 is in the vanilla case of Theorem 1, the heart of the risk-averse Thompson sampling approach lies in the vanilla expectation approach.

We will explore this approach the next time, and also eventually address what we mean for an algorithm to be optimal.

—Joel Kindiak, 11 Feb 26, 1735H
May 13, 2026
A Zeroth-Order Solution
What would be a zero-th order solution to the $K$ -armed bandit problem? Given the $K$ -armed bandit $\nu \equiv (\nu_1,\dots, \nu_K)$ and a time horizon $n$ , our goal is to minimise the quantity

$\displaystyle \mathcal R_n(\pi, \nu) = \sum_{i=1}^K \mathbb E[T_i(n)] \Delta_i,$

where:
- $\pi$ denotes our choice of arms, i.e., our solution,
- $T_i(n)$ denotes the total number of pulls of arm $i$ as of time $n$ ,
- and $\Delta_i = \rho_{\mathbb E}(\nu_1) - \rho_{\mathbb E}(\nu_i)$ denotes the suboptimality gap.
Here, we assumed that $\rho_{\mathbb E}(\nu_1)$ is the unique maximum of $\{ \rho_{\mathbb E}(\nu_1), \dots, \rho_{\mathbb E}(\nu_K)\}$ , so that $\Delta_1 = 0$ and $\Delta_i > 0$ for $i \neq 1$ .

As a first pass, let’s explore, then commit.
1. We fix an exploration time $m \in \mathbb N$ , and we explore each arm $m$ times first. During this exploration phase, we would obtain the quantity $\hat{\mu}_i = (X_{i,1} + \dots + X_{i,m})/m$ , where each $X_{i,t} \sim \nu_i$ independently.
2. For the remaining $(n - mK)$ turns we pull the arm $i$ that corresponds to $\hat{\mu}_{\max} := \max_i \hat{\mu}_i$ .
This algorithm is called the explore-then-commit algorithm.

Lemma 1. For each arm $i$ ,

$\displaystyle \mathbb E[T_i(n)] \leq m + (n - mK) \cdot \mathbb P((\hat{\mu}_i - \mu_i) - (\hat{\mu}_1 - \mu_1) \geq \Delta_i).$

Proof. For any arm $i \neq 1$ , after the initial exploration phase,

$\begin{aligned} E_i &:= \{T_i(n) = n - mK + m\} \\ &= \{\hat{\mu}_i = \hat{\mu}_{\max}\} \\ &= \bigcap_{j \neq i} \{\hat{\mu}_i \geq \hat{\mu}_j\} \\ &\subseteq \{\hat{\mu}_i \geq \hat{\mu}_1\} \\ &=\{ (\hat{\mu}_i - \mu_i) - (\hat{\mu}_1 - \mu_1) \geq \mu_1 - \mu_i\} \\ &=\{ (\hat{\mu}_i - \mu_i) - (\hat{\mu}_1 - \mu_1) \geq \Delta_i\} \end{aligned}$

Therefore, for each arm $i$ ,

$\begin{aligned} \mathbb E[T_i(n)] &= \mathbb E[T_i(n) \mid E_i] \cdot \mathbb P(E_i) \\ &= m + (n-mK) \cdot \mathbb P(E_i) \\ &\leq m + (n - mK) \cdot \mathbb P((\hat{\mu}_i - \mu_i) - (\hat{\mu}_1 - \mu_1) \geq \Delta_i). \end{aligned}$

Therefore, the efficacy of the explore-then-commit algorithm depends on our ability to bound the term

$\displaystyle p_i := \mathbb P((\hat{\mu}_i - \mu_i) - (\hat{\mu}_1 - \mu_1) \geq \Delta_i),$

which itself depends on the underlying distributions $\nu$ . There are many probability distributions that we could consider, such as the Bernoulli distribution or the Gaussian distribution. The Gaussian distribution, in particular, yields particularly pleasant concentration results.

Lemma 2. For any $\sigma > 0$ , if $X \sim \mathcal N(0, \sigma^2)$ , then

$\mathbb E[\exp(\lambda X)] = \exp(\lambda^2 \sigma^2/2).$

Proof. Fix $\lambda \in \mathbb R$ and $Z \sim \mathcal N(0, 1)$ . By definition,

$\begin{aligned} \mathbb E[\exp(\lambda X)] &= \int_{\mathbb R} e^{\lambda x} \cdot \frac 1{\sqrt{2\pi}} e^{-x^2/2}\, \mathrm dx \\ &= \frac 1{\sqrt{2\pi}} \int_{\mathbb R} e^{-((x - \lambda)^2 - \lambda^2)/2}\, \mathrm dx \\ &= e^{\lambda^2/2} \cdot \frac 1{\sqrt{2\pi}} \int_{\mathbb R} e^{-(x - \lambda)^2/2}\, \mathrm dx \\ &=e^{\lambda^2/2} \cdot 1 = e^{\lambda^2/2}. \end{aligned}$

Therefore, $\mathcal N(0, 1)$ is $1$ -sub-Gaussian.

For the general case, write $X = \sigma Z \sim \mathcal N(0, \sigma^2)$ . Then

$\mathbb E[\exp(\lambda X)] = \mathbb E[\exp((\sigma \lambda) Z)] = e^{(\sigma \lambda)^2/2} = e^{\lambda^2 \sigma^2/2}.$

Therefore, $\mathcal N(0, \sigma^2)$ is $\sigma$ -sub-Gaussian.

Hence, we generalise this inequality property. Any distribution that satisfies this inequality property shall hence-forth be called sub-Gaussian.

Definition 1. Let $\nu$ denote a probability distribution. We say that $\nu$ is $\sigma$ -sub-Gaussian if for any $X \sim \nu$ and $\lambda \in \mathbb R$ ,

$\mathbb E[\exp(\lambda X)] \leq \exp(\lambda^2 \sigma^2/2).$

Unsurprisingly, $\mathcal N(0, \sigma^2)$ is $\sigma$ -sub-Gaussian. Intuitively, its tails behave like a Gaussian.

Theorem 1. Let $\nu$ be $\sigma$ -sub-Gaussian. Then for any $X \sim \nu$ and $\epsilon > 0$ ,

$\begin{aligned} \mathbb P(X \geq \epsilon) &\leq \exp\left( -\frac{\epsilon^2}{2\sigma^2} \right). \end{aligned}$

Proof. We use the Cramer-Chernoff method by taking exponentials followed by applying Markov’s inequality: for any $\lambda > 0$ ,

$\begin{aligned} \mathbb P(X \geq \epsilon) & = \mathbb P(\exp(\lambda X) \geq \exp(\lambda \epsilon)) \\ &\leq \frac{\mathbb E[\exp(\lambda X)]}{\exp(\lambda \epsilon)} \\ &\leq \frac{ \exp(\lambda^2 \sigma^2/2) }{\exp(\lambda \epsilon)} \\ &= \exp(\lambda^2 \sigma^2/2 - \lambda \epsilon). \end{aligned}$

In particular, $\lambda^2 \sigma^2/2 - \lambda$ is minimised by $\lambda = \frac 12 (2\epsilon/\sigma^2)) = \epsilon/\sigma^2$ , so that

$\displaystyle \frac{\lambda^2 \sigma^2}2 - \lambda \epsilon = \frac{\epsilon^2}{\sigma^4} \cdot \frac{\sigma^2}{2} - \frac{\epsilon}{\sigma^2} \cdot \epsilon = -\frac{\epsilon^2}{2\sigma^2}.$

Therefore, particularising to $\lambda = \epsilon^2/\sigma^2$ ,

$\begin{aligned} \mathbb P(X \geq \epsilon) &\leq \exp\left( -\frac{\epsilon^2}{2\sigma^2} \right). \end{aligned}$

Remark 1. Given $X \sim \nu$ with zero mean, we say that $X$ is $\sigma$ -sub-Gaussian if $\nu$ is $\sigma$ -sub-Gaussian.

Sub-Gaussian distributions can produce new distributions.

Lemma 3. Let $X$ be $\sigma$ -sub-Gaussian, $X_1$ be $\sigma_1$ -sub-Gaussian, and $X_2$ be $\sigma_2$ -sub-Gaussian, all independent of each other. Then the following hold:
- For any $\alpha \in \mathbb R$ , $\alpha X$ is $( |\alpha| \sigma)$ -sub-Gaussian.
- $X_1 + X_2$ is $\sqrt{\sigma_1^2 + \sigma_2^2}$ -sub-Gaussian.
Proof. By a similar argument in Lemma 2, for any $\lambda \in \mathbb R$ ,

$\begin{aligned} \mathbb E[\exp(\lambda (\alpha X))] &= \mathbb E[\exp( (\lambda \alpha) X)] \\ &\leq \exp((\lambda \alpha)^2 \sigma^2/2) \\ &\leq \exp(\lambda^2 (|\alpha| \sigma)^2/2), \end{aligned}$

as required. The second result follows similarly by denoting $\sigma^2 = \sigma_1^2 + \sigma_2^2$ and the independence of the random variables:

$\begin{aligned} \mathbb E[\exp(\lambda (X_1 + X_2))] &= \mathbb E[\exp(\lambda(X_1 + X_2))] \\ &= \mathbb E[\exp(\lambda X_1)\exp(\lambda X_2)] \\ &= \mathbb E[\exp(\lambda X_1)] \cdot \mathbb E[\exp(\lambda X_2)] \\ &\leq \exp(\lambda^2 \sigma_1^2/2) \cdot \exp(\lambda^2 \sigma_2^2/2) \\ &\leq \exp(\lambda^2 (\sigma_1^2 + \sigma_2^2)/2) \\ &\leq \exp(\lambda^2 \sigma^2/2), \end{aligned}$

as required.

Definition 2.We say that a probability distribution $\nu$ has zero mean if for any $X \sim \nu$ , $\mathbb E[X] = \nu$ .

Corollary 1. Suppose for simplicity that each $\nu_i$ is $1$ -sub-Gaussian. Then the explore-then-commit (i.e. ETC) algorithm yields a regret upper bound of

$\displaystyle \mathcal R_n(\text{ETC}, \nu) \leq m \sum_{i=1}^K \Delta_i + (n-mK) \cdot \sum_{i=1}^K \exp\left( -\frac{m \Delta_i^2}{4} \right) \Delta_i.$

Proof. Suppose each $\nu_i$ is $1$ -sub-Gaussian. By Lemma 3, the random variable $\hat{\mu}_i - \mu_i$ has zero mean and is $(1/\sqrt m)$ -sub-Gaussian. Likewise,

$(\hat{\mu}_i - \mu_i) - (\hat{\mu}_1 - \mu_1)$

has zero mean and is $\sqrt{2/m}$ -sub-Gaussian. By Theorem 1,

$\displaystyle \mathbb P(\hat{\mu}_i - \mu_i - (\hat{\mu}_1 - \mu_1) \geq \Delta_i) \leq \exp\left( -\frac{m \Delta_i^2}{4} \right).$

Substituting into Lemma 1 and the definition of $\mathcal R_n(\text{ETC}, \nu)$ , we have

$\displaystyle \mathcal R_n(\text{ETC}, \nu) \leq m \sum_{i=1}^K \Delta_i + (n-mK) \cdot \sum_{i=1}^K \exp\left( -\frac{m \Delta_i^2}{4} \right) \Delta_i.$

Having a result that works for basically any sub-Gaussian distribution $\nu$ , we must ask a question: what other distributions are sub-Gaussian? Obviously, Lemma 2 shows us that the Gaussian is sub-Gaussian par excellence. Bounded random variables ought not be too far off either.

Example 1. Given a probability distribution $\nu$ with zero mean, suppose that there exists $a < b$ such that for any $X \sim \nu$ , $X \in [a, b]$ almost surely. Then $\nu$ is $(b-a)/2$ -sub-Gaussian.

Proof. We follow this post. Denote $M_X(t) = \mathbb E[\exp(tX)]$ . By differentiating under the integral sign, we have

$M_X^{(n)}(t) = \mathbb E[X^n \exp(tX)]$

for any natural number $n$ . Denote $\psi_X(t) := \log M_X(t)$ , where the logarithm is always taken with base $e$ . Differentiating twice,

$\displaystyle \psi_X'(t) = \frac{M_X'(t)}{M_X(t)}$

and hence,

$\begin{aligned} \psi_X''(t) &= \frac{M_X''(t)}{M_X(t)} - (\psi_X'(t))^2 \\ &= \mathbb E\left[ X^2 \cdot \frac{\exp(tX)}{M_X(t)} \right] - \mathbb E\left[X \cdot \frac{\exp(tX)}{M_X(t)}\right]^2 \\ &= \mathbb E[X^2 \cdot g_t(X)] - \mathbb E[X g_t(X)]^2,\end{aligned}$

where we denoted $g_t(x) := \exp(tx)/M_X(t)$ for brevity. For each $t$ , define the probability measure $\mathbb P_t$ by

$\displaystyle \mathbb P_t(K) := \int_K g_t(X)\, \mathrm d\nu = \mathbb E[\mathbb I_K \cdot g_t(X)],$

so that $\mathrm d\mathbb P_t/\mathrm d\nu = g_t(X)$ . For any random variable $Y$ ,

$\displaystyle \mathbb E_t[Y] := \int_{\mathbb R}Y\, \mathrm{d}\mathbb P_t = \int_{\mathbb R}Y \cdot g_t(X)\, \mathrm{d}\nu = \mathbb E[Y \cdot g_t(X)].$

Therefore,

$\begin{aligned} \psi_X''(t) &= \mathbb E[X^2 \cdot g_t(X)] - \mathbb E[X g_t(X)]^2 \\ &= \mathbb E_t[X^2] - \mathbb E_t[X^2] \\ &= \mathbb E_t[ (X - \mathbb E_t[X])^2 ] \\ &= \min_{c \in \mathbb R} \mathbb E_t[ (X - c)^2 ] \\&\leq \mathbb E_t\left[ \left(X - \frac{a+b}{2} \right)^2 \right] \\&\leq \mathbb E_t\left[ \left(b - \frac{a+b}{2} \right)^2 \right] \\ &= \frac{(b-a)^2}{4} \cdot \underbrace{ \mathbb E[g_t(X)] }_1 = \frac{(b-a)^2}{4}.\end{aligned}$

Now fix $\lambda \in \mathbb R$ . Use Taylor’s theorem to construct $\eta$ between $0$ and $\lambda$ such that

$\begin{aligned} \psi(\lambda) &= \psi(0) + \psi'(0) \lambda + \frac{\psi''(\eta)}{2} \cdot \lambda^2 \\ &\leq \log(1) + \mathbb E[X] \cdot \lambda + \frac{(b-a)^2}{4} \cdot \frac 12 \cdot \lambda^2 \\ &= 0 + 0 + \frac{\lambda^2 (b-a)^2}{8}. \end{aligned}$

Therefore, $\mathbb E[\exp(\lambda X)] = \exp(\psi(\lambda)) \leq \exp(\lambda^2 (b-a)^2/8)$ , which implies that $\nu$ is $(b-a)/2$ -sub-Gaussian.

Remark 2. In the case where $X$ has nonzero finite mean $\mathbb E[X] \neq 0$ , we say that $X$ is $\sigma$ -sub-Gaussian if its noise $(X - \mathbb E[X])$ is.

Example 2. Given a probability distribution $\nu$ , suppose that there exists $a < b$ such that for any $X \sim \nu$ , $X \in [a, b]$ almost surely. Then $\nu$ is $(b-a)/2$ -sub-Gaussian.

Proof. $X - \mathbb E[X] \in [a - \mathbb E[X], b - \mathbb E[X]]$ has zero mean and by Example 1, is sub-Gaussian with parameter

$((b - \mathbb E[X]) - (a - \mathbb E[X]))/2 = (b-a)/2.$

Example 3. For any $p \in [0, 1]$ , the distribution $\mathrm{Ber}(p)$ is $1/2$ -sub-Gaussian.

Proof. Given $X \in \mathrm{Ber}(p)$ , $X \in \{0, 1\} \subseteq [0, 1]$ . By Example 4, it is $(1-0)/2 = 1/2$ -sub-Gaussian.

Returning to Corollary 1, we notice that the exploration time $m$ could plausibly be optimised to minimise the right-hand side. The smaller the $m$ term, the less exploration we do, yielding a higher probability of inefficient exploitation. On the other hand, the higher the $m$ term, the more exploration we do, potentially compromising on the magnitude of our exploitation. Could we balance this trade-off more organically?

The next algorithm aims to do just that—the upper confidence bound (UCB) algorithm. We turn to this algorithm next time.

—Joel Kindiak, 10 Feb 26, 2229H
May 12, 2026
Return to Bandits
This set of posts hit close to my heart, as the subject matter of my undergraduate research experience. Most of what I write references the “Bandit Bible” by Tor Lattimore and Csaba Szepesvári. In a sense, the goal of these posts is to communicate my research, and hopefully, accomplish its lofty potential (that did not happen while I was an undergraduate).

Suppose you entered a casino with 1000 one-dollar coins, and there are 5 slot machines in front of you. Each turn, you insert one coin into a machine of your choice. Your goal is to maximise your gains from the machine. How would you allocate your coins?

Definition 1. A stochastic $K$ -armed bandit is a collection of $K$ probability distributions $\nu := ( \nu_1,\dots, \nu_k )$ , called arms.

Suppose you pull the arms for a total of $n$ turns, called the time horizon of your experiment. At each time $t = 1,2,\dots, n$ , denote the arm pulled by $A_t \in \{1,\dots, k\}$ . Given the arm being pulled, you receive a reward $X_t \sim \nu_{A_t}$ . Your goal is to maximise your total reward:

$\displaystyle \sum_{t=1}^n X_t.$

Seeing that this task is rather ambitious, since we need to optimise over uncountably many possibilities, we can, and should, simplify our objectives. Instead of maximising our total reward, lets maximise the total expected reward:

$\displaystyle \mathbb E\left[ \sum_{t=1}^n X_t \right].$

For this goal to be a meaningful one, one of the arms should have the highest mean. Let’s formalise this intuition a bit more:
- For each $i = 1, \dots, n$ , assume that the mean of the $i$ -th arm $\mu_i := \mathbb E[Y_i]$ , where $Y_i \sim \nu_i$ , is finite.
- In this case, the quantity $\mu_{\max} := \max\{ \mu_1,\dots, \mu_n\}$ exists. Suppose $\mu_{\max} = \mu_1$ without loss of generality.
- Suppose furthermore that for $i \neq 1$ , $\mu_i < \mu_1$ . Define the suboptimality gap by $\Delta_i := \mu_1 - \mu_i$ .
Hence, we would like to choose our arms so that we can minimise our cumulative regret:

$\displaystyle \mathcal R_n(\nu) = n \mu_1 - \mathbb E\left[ \sum_{t=1}^n X_t \right].$

We call our choice of arms our policy, denoted $\pi$ . More formally, we observe that at each time $(t+1)$ , we have access to which arms we have pulled and collected rewards from. Due to our ignorance of arm $1$ ‘s optimality, we can, and should, select arms probabilistically:

$A_t \sim \pi_t( \cdot \mid A_1, X_1, \dots, A_t, X_t).$

Different policies $\pi$ would lead to different probability kernels $\pi_t(\cdot \mid \cdot)$ , and some distributions yield better (or worse) results. Hence, our regret should really by denoted

$\displaystyle \mathcal R_n(\pi, \nu) = n \mu_1 - \mathbb E\left[ \sum_{t=1}^n X_t \right] \geq 0,$

to emphasise that the rewards $X_t$ obtained depends on the policy $\pi$ . Furthermore, this quantity is randomised, since the choice of $X_t$ is randomised.

Theorem 1. Using our formalisations, we can decompose the regret as follows:

$\displaystyle \mathcal R_n(\pi, \nu) = \sum_{i=1}^K \mathbb E[T_n(i)] \Delta_i ,\quad T_n(i) := \sum_{t=1}^n \mathbb I\{A_t = i\}.$

Proof. At each $t$ ,

$\displaystyle X_t = \sum_{i=1}^K X_{A_t} \cdot \mathbb I\{A_t = i\}.$

Summing over $t = 1, \dots, n$ on both sides and interchanging the finite sums,

$\displaystyle \sum_{t=1}^n X_t = \sum_{i=1}^K \sum_{t=1}^n X_{A_t} \cdot \mathbb I\{A_t = i\}.$

Taking expectations,

$\begin{aligned} \mathbb E\left[ \sum_{t=1}^n X_t \right] &= \sum_{i=1}^K \sum_{t=1}^n \mathbb E[ X_{A_t} \cdot \mathbb I\{A_t = i\} ]. \end{aligned}$

Using the tower rule,

$\begin{aligned} \mathbb E[ X_{A_t} \cdot \mathbb I\{A_t = i\} ] &= \mathbb E[ \mathbb E[ X_{A_t} \cdot \mathbb I\{A_t = i\} \mid A_t]] \\ &= \mathbb E[ \mathbb I\{A_t = i\} \cdot \mathbb E[ X_{A_t} \mid A_t]] \\ &= \mathbb E[ \mathbb I\{A_t = i\} \cdot \mu_{A_t}] \\ &= \mathbb E[ \mathbb I\{A_t = i\} \cdot \mu_i].\end{aligned}$

Using the definition of $N_t(i)$ ,

$\begin{aligned} \mathcal R_n(\pi, \nu) &= n \mu_1 - \mathbb E\left[ \sum_{t=1}^n X_t \right] \\ &= n \mu_1 - \sum_{i=1}^K \sum_{t=1}^n \mathbb E[ X_{A_t} \cdot \mathbb I\{A_t = i\} ] \\ &= n \mu_1 - \sum_{i=1}^K \sum_{t=1}^n \mathbb E[ \mathbb I\{A_t = i\} \cdot \mu_{i}] \\ &= \sum_{i=1}^K \sum_{t=1}^n \mathbb E[ \mathbb I\{A_t = i\} \cdot (\mu_1 - \mu_i)] \\ &= \sum_{i=1}^K \sum_{t=1}^n \mathbb E[ \mathbb I\{A_t = i\} \cdot \Delta_i] \\ &= \sum_{i=1}^K\mathbb E\left[ \sum_{t=1}^n \mathbb I\{A_t = i\} \right]\Delta_i \\ &= \sum_{i=1}^K\mathbb E[T_n(i)] \Delta_i. \end{aligned}$

My work focused on risk-averse bandits, which basically asks for the same cumulative regret minimisation goal, except that the vanilla suboptimality gap $\Delta_i$ gets spiced up just a bit:

$\displaystyle \mathcal R_n^{\rho}(\pi, \nu) = \sum_{i=1}^K \mathbb E[T_n(i)] \Delta _i^\rho.$

Here, $\rho$ is not a number. Rather, it is a risk-functional. Let me elaborate on that. Given $X \sim \nu_i$ , we have $\mathbb E[X] = \mu_i$ . Furthermore, given that $X, Y \sim \nu_i$ , we have $\mathbb E[X] = \nu_i = \mathbb E[Y]$ . By denoting $\rho_{\mathbb E}(\nu_i) := \mathbb E[X]$ where $X \sim \nu_i$ , we have

$\Delta_i = \mu_1 - \mu_i = \rho_{\mathbb E}(\nu_1) - \rho_{\mathbb E}(\nu_i).$

Therefore, $\rho_{\mathbb E}$ takes probability measures as inputs and returns a real number as an output that measures the relative risk of the arm, and is called a risk functional. For instance, $\rho_{\mathbb{V}}$ could correspond to the variance risk:

$\rho_{\mathbb{V}}(\nu_i) := \mathrm{Var}(X_i) \equiv \mathbb E[(X-\mathbb E[X])^2],\quad X \sim \nu_i.$

A popular risk measure in modern portfolio theory is the mean-variance with parameter $r > 0$ :

$\rho_{\mathrm{MV}} := r \cdot \rho_{\mathbb E} - \rho_{\mathbb{V}}.$

The risk measure that I was initially interested in was the conditional value-at-risk.

In any case, given a risk functional $\rho$ , assuming that $\rho(\nu_1)$ is the uniquely maximum value, we can define the $\rho$ -suboptimality gap by

$\Delta_i^\rho := \rho(\nu_1) - \rho(\nu_i).$

Hence, our goal is to design a policy that minimises the cumulative $\rho$ -regret:

$\displaystyle \mathcal R_n^{\rho}(\pi, \nu) = \sum_{i=1}^K \mathbb E[T_n(i)] \Delta _i^\rho.$

We are going to first work with the usual case $\rho_{\mathbb E}(\nu_i) = \mathbb E[X]$ , where $X \sim \nu_i$ . Here, $\rho = \rho_{\mathbb E}$ corresponds to the usual expectation. We will explore the following three algorithms:
- Explore-then-Commit
- Upper Confidence Bound
- Thompson Sampling
It is this last algorithm that consumed my waking and sleeping hours in my undergraduate years.

—Joel Kindiak, 9 Feb 26, 1306H
May 11, 2026
Jordan Decomposition
This post is inspired by Professor Tom Fischer’s writeup.

Recall the Hahn decomposition theorem:

Theorem 1. If $\mu$ is a signed measure on $(\Omega, \mathcal F)$ , then there exist disjoint measurable subsets $P, N \in \mathcal F$ such that

$\mu (\cdot \cap P) \in [0, \infty),\quad \mu (\cdot \cap N) \in (-\infty, 0] ,\quad P \sqcup N = \Omega.$

We call $P$ a positive set, denoted $P \geq 0$ , and $N$ a negative set, denoted $N \leq 0$ , and we call the pair $(P, Q)$ a Hahn decomposition for $\mu$ .

Call a set $M$ is $\mu$ –null if it is positive and negative: $\mu( \cdot \cap M) \in \{0\}$ .

Define the symmetric difference by

$K\triangle L := (K \backslash L) \sqcup (L \backslash K).$

We leave it as an exercise to verify that

$K \cup L = (K \cap L) \sqcup (K\triangle L).$

Problem 1. Let $\mu$ be a signed measure on $(\Omega, \mathcal F)$ and $(P_1,N_1)$ , $(P_2,N_2)$ be two Hahn decompositions for $\mu$ . Show that $P_1 \triangle P_2$ and $N_1 \triangle N_2$ are $\mu$ -null.

(Click for Solution)

Solution. Fix $K \in \mathcal F$ . Then

$\begin{aligned} \mu(K \cap (P_1 \triangle P_2)) &= \mu(K \cap P_1 \backslash P_2) + \mu(K \cap P_2 \backslash P_1) \\ &= \mu(K \cap P_1 \cap N_2) + \mu(K \cap P_2 \cap N_2) \\ &= \mu(K \cap P_1 \cap N_2) + \mu(\emptyset) \\ &= \mu(K \cap P_1 \cap N_2). \end{aligned}$

Using Theorem 1,

$\begin{aligned} 0 &\leq \mu((K \cap N_2) \cap P_1) \\ &= \mu(K \cap P_1 \cap N_2) \\ &= \mu((K \cap P_1) \cap N_2) \leq 0, \end{aligned}$

so $\mu(K \cap (P_1 \triangle P_2)) = \mu(K \cap P_1 \cap N_2) = 0$ , as required.

We now state the Jordan decomposition theorem.

Problem 2. Let $\mu$ be a finite signed measure on $(\Omega, \mathcal F)$ . Construct unique finite measures $\mu^+, \mu^-$ on $(\Omega , \mathcal F)$ such that:
- $\mu = \mu^+ - \mu^-$ , and
- for any Hahn decomposition $(P, N)$ of $(\Omega, \mathcal F)$ , $\mu^+(K) = 0$ for $K \subseteq N$ and $\mu^-(K) = 0$ for $K \subseteq P$ .
We call $(\mu^+, \mu^-)$ the (unique) Jordan decomposition of $\mu$ .

(Click for Solution)

Solution. Use Theorem 1 to construct a Hahn decomposition $(P_1, N_1)$ . Define

$\begin{aligned} \mu^+ &:= \mu(\cdot \cap P_1) \in [0, \infty),\\ \mu^- &:= -\mu(\cdot \cap N_1) \in [0, \infty). \end{aligned}$

Then for any $K \in \mathcal F$ ,

$\begin{aligned} \mu(K) &= \mu(K \cap P_1) + \mu(K \cap N_1) \\ &= \mu^+(K) - \mu^-(K) \\ &= (\mu^+ - \mu^-)(K). \end{aligned}$

Hence, $\mu = \mu^+ - \mu^-$ .

Now fix any Hahn decomposition $(P_2, N_2)$ . Then

$\begin{aligned} P_1 &= P_1 \cap (P_1 \cup P_2) \\ &= P_1 \cap ((P_1 \cap P_2) \sqcup (P_1 \triangle P_2)) \\ &= (P_1 \cap (P_1 \cap P_2)) \sqcup (P_1 \cap (P_1 \triangle P_2)) \\ &= (P_1 \cap P_2) \sqcup (P_1 \cap (P_1 \triangle P_2)).\end{aligned}$

Fix $K \subseteq N_2$ . Then

$\begin{aligned} \mu^+(K) &= \mu(K \cap P_1) \\ &= \mu(K \cap P_1 \cap P_2) + \mu(K \cap P_1 \cap (P_1 \triangle P_2)) \\ &= \mu(K \cap \emptyset) + \mu((K \cap P_1) \cap (P_1 \triangle P_2)) \\ &= 0 + 0 = 0. \end{aligned}$

Similarly, $K \subseteq P_2$ implies $\mu^-(K) = 0$ .

Finally, we establish the uniqueness of the measures. Suppose

$\mu = \mu_1^+ - \mu_1^- = \mu_2^+ - \mu_2^-.$

We need to check that $\mu_1^+ = \mu_2^+$ . To that end, fix any Hahn decomposition $(P, N)$ . For any $K \subseteq P$ ,

$\mu_1^+(K \cap N) = 0 = \mu_2^+(K \cap N)$

and

$\mu_1^-(K) = 0 = \mu_2^-(K)$

so that

$\begin{aligned} \mu_1^+(K) &= \mu(P) - \mu_1^-(K) \\ &= \mu(P) - \mu_2^-(K) = \mu_2^+(K). \end{aligned}$

Hence, for any $K = (K \cap P) \sqcup (K \cap N)$ ,

$\begin{aligned} \mu_1^+(K) &= \mu_1^+(K \cap P) + \mu_1^+(K \cap N) \\ &= \mu_2^+(K \cap P) + \mu_2^+(K \cap N) = \mu_2^+(K). \end{aligned}$

—Joel Kindiak, 7 Jan 26, 1212H
March 30, 2026
Risky Bernoulli Bandits
In this appendix for the multi-armed bandit writeups, I thought I’d revisit my final year project in a relatively readable manner, demonstrating how $\rho$ -Thomoson Sampling is asymptotically optimal for Bernoulli bandits (that is, $\nu_i = \mathrm{Ber}(p_i)$ for each $i$ ). This is the instantiation $M = 1$ of my final year project.

As a set-up, we initialise $\mathbb P_{i,0} = \mathrm{Beta}(\alpha, \beta)$ with p.d.f.

$\displaystyle f_{\alpha , \beta}(x) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} \cdot x^{\alpha - 1} \cdot (1-x)^{\beta - 1},$

since the conjugate prior of a Bernoulli distribution is the Beta distribution. For the update function, we use

$\mathrm{Update}(\Gamma(\alpha, \beta)) = \Gamma(\alpha+ 1,\beta)$

and $\mathbb P_{i,t+1} = \mathrm{Update}(\mathbb P_{i,t})$ . At each time $t$ , we sample $\theta_{i,t+1} \sim \mathbb P_{i,t}$ and pull the arm

$\displaystyle A_{t+1} = \underset{i=1,\dots, K}{\arg \max}\ \rho(\mathrm{Ber}( \theta_{i,t+1} )).$

Observe that the map $p \mapsto \mathrm{Ber}(p) \mapsto \rho(\mathrm{Ber}(p))$ induces a unique map $\tilde{\rho} : [0, 1] \to \mathbb R$ given by $\tilde{\rho} = \rho(\mathrm{Ber}(p))$ . Following the Thompson sampling strategy, we need to obtain useful tail concentration bounds. We will impose some technical conditions on the risk functional $\rho$ to make this happen.

Definition 1. Call a risk functional $\rho$ continuous if $\tilde{\rho} : [0,1] \to \mathbb R$ is continuous.

If $\rho$ is continuous, then by the extreme value theorem, $\tilde{\rho}([0,1])$ is compact. Hence, there exists $p_1, p_2 \in [0,1]$ such that for any $p \in [0, 1]$ ,

$\tilde{\rho}(p_1) \leq \tilde{\rho}(p) \leq \tilde{\rho}(p_2).$

Lemma 1. If $\rho$ is continuous, then for any $p \in [0,1]$ and $r \in \mathbb R$ , there exists $p_* \in [0, 1]$ such that

$\mathcal{K}_{\inf}^{\rho}(\mathrm{Ber}(p), r) \equiv \mathcal{K}_{\inf}^{\rho}(\mathrm{Ber}(p), r, \mathrm{Ber}) = \mathrm{KL}(\mathrm{Ber}(p), \mathrm{Ber}(p_*)).$

Proof. By definition,

$\displaystyle \mathcal{K}_{\inf}^{\rho}(\mathrm{Ber}(p), r, \mathrm{Ber}) = \inf_{q : \tilde{\rho}(q) > r} \underbrace{ \mathrm{KL}(\mathrm{Ber}(p), \mathrm{Ber}(q)) }_{f(q)}.$

By a direct computation,

$\displaystyle \mathcal{K}_{\inf}^{\rho}(\mathrm{Ber}(p), r, \mathrm{Ber}) = \inf_{q : \tilde{\rho}(q) > r} \underbrace{ p \log \frac pq + (1-p) \log \frac{1-p}{1-q} }_{f(q)},$

so that $f$ is continuous on $(0, 1)$ .

If $r < \tilde{\rho}(p_1)$ , then we may simply choose $p_* = p$ . If $r \geq \tilde{\rho}(p_2)$ , then the constraint set is empty and the left-hand side is infinite, so we choose $p_* = 0$ if $p \neq 0$ and $p_* = 1$ otherwise.

Suppose $\tilde{\rho}(p_1) \leq r < \tilde{\rho}(p_2)$ . If $p = 0$ or $p = 1$ , then $f(q) = 0$ identically, so that we can choose $p_* = p_2$ .

Suppose instead that $p \in (0, 1)$ . Fix $\epsilon \in (0, \tilde{\rho}(p_2) - r)$ . Then $[r + \epsilon, \tilde{\rho}(p_2)]$ is compact, so that

$\tilde{\rho}^{-1}( [r + \epsilon, \infty) ) = \tilde{\rho}^{-1}( [r+ \epsilon, \tilde{\rho}(p_2) ] )$

is closed and bounded, and thus compact by the Heine-Borel theorem. Since the sets

$\begin{aligned} C_1(\epsilon) &:= \tilde{\rho}^{-1}( [r+ \epsilon, \infty) ) \cap [0, p],\\ C_2 (\epsilon)&:= \tilde{\rho}^{-1}( [r+ \epsilon, \infty) ) \cap [p, 1], \end{aligned}$

are also compact, we can define their extrema

$q_1(\epsilon) := \max C_1(\epsilon), \quad q_2(\epsilon) := \min C_2(\epsilon).$

At least one of these sets will always be non-empty, since $r_2 \in [0, 1]$ and $\rho(r_2) > r+\epsilon > r$ . If $C_1(\epsilon)$ is always empty, then we can omit its discussion later on. However, if $C_1(\epsilon_0) \neq \emptyset$ for some $\epsilon_0$ , then $C_1(\epsilon) \neq \emptyset$ for any $\epsilon \in (0, \epsilon_0)$ . Without loss of generality then, we assume both sets are non-empty for some sufficiently small $\epsilon > 0$ .

Using monotonicity properties, $f(q_j(\epsilon)) \leq f(q)$ for any $q \in C_j(\epsilon)$ . Therefore,

$\displaystyle \inf_{q \in C_j(\epsilon)} f(q) = f(q_j(\epsilon)).$

Now it is clear that $C_j(\epsilon_1) \subseteq C_j(\epsilon_2)$ whenever $\epsilon_1 \geq \epsilon_2$ . By the monotone convergence theorem, there exist $q_j \in [0,1]$ such that $q_j(1/n) \to q_j$ and

$\displaystyle \inf_{q \in C_j(0)} f(q) = f(q_j).$

Define $C_j'(\epsilon) := C_j(\epsilon) \backslash \rho^{-1}(r)$ . Since smaller sets yield larger infimums, for sufficiently large $n$ ,

$\displaystyle f(q_j) = \inf_{q \in C_j(0)} f(q) \leq \inf_{q \in C_j'(0)} f(q) \leq \inf_{C_j(1/n)} f(q) \leq f(q_j(1/n)).$

Taking $n \to \infty$ , by the squeeze theorem,

$\displaystyle \inf_{q \in C_j'(0)} f(q) = f(q_j).$

Finally, choose

$\displaystyle p_* = \underset{j=1,2}{\arg \min}\ f(q_j),$

to deduce that

$\mathcal{K}_{\inf}^{\rho}(\mathrm{Ber}(p), r, \mathrm{Ber}) = f(p_*) = \mathrm{KL}(\mathrm{Ber}(p), \mathrm{Ber}(p_*)).$

Remark 1. Most arguments in Lemma 1 boils down to the compactness of $[a, b]$ , and by extension, the space $\mathrm{Ber}$ of Bernoulli distributions, as well as the relative continuity properties of $\tilde{\rho}$ and $\mathrm{KL}(\cdot, \cdot)$ . Here are some proposed steps for further exploration:
- Generalise the risk functional to $\rho : \mathcal C \to \mathbb R$ , where $\mathcal C$ is a space of probability distributions that is compact under a suitable metric or topology.
- We would probably need to partition $\mathcal P = \{M_1,\dots, M_n\}$ into $n$ compact sets, then define the compact sets $C_n(\epsilon) = \tilde{\rho}^{-1}([r+\epsilon, \infty)) \cap M_n$ .
- During the sequential argument, we could use sequential compactness to concoct a convergent subsequence $q_j(1/n) \to q$ in place of a by-default convergent subsequence. That way, we might still be able to infimise over $C_j(0)$ via $C_j(1/n)$ .
- Since we only care about infimising, we do not need the strength of full continuity for $\mathrm{KL}(\cdot,\cdot)$ ; rather we would only require its lower semi-continuity property, which could be conceived as the “lower half” of vanilla continuity.
Thanks to the continuity of $\rho$ , we can obtain the pleasant tail upper bounds below.

Theorem 1. Fix $r \in \mathbb R$ and natural numbers $\alpha, \beta$ , $n := \alpha + \beta$ . Then there exists a universal constant $C_1$ such that for any random variable with Beta distribution $X \sim \mathrm{Beta}(\alpha, \beta)$ ,

$\begin{aligned} \mathbb P(\tilde{\rho}(X) \geq r) &\leq C_1\sqrt{n} \cdot \exp( -n \cdot \mathcal{K}_{\inf}^{\rho}(\mathrm{Ber}(\alpha/n), r) ), \\ \mathbb P(\tilde{\rho}(X) \leq r) &\leq C_1\sqrt{n} \cdot \exp( -n \cdot \mathcal{K}_{\inf}^{\rho}(\mathrm{Ber}(\alpha/n), r) ). \end{aligned}$

Proof. Denote the closed (and thus, compact) sets $S_1 := \tilde{\rho}^{-1}([r,\infty))$ and $S_2 := \tilde{\rho}^{-1}((-\infty, r])$ . Use Lemma 1 to construct $p_* \in [0, 1]$ such that

$\mathcal{K}_{\inf}^{\rho}(\mathrm{Ber}(\alpha/n), r) ) = \mathrm{KL}(\mathrm{Ber}(\alpha/n), \mathrm{Ber}(p_*)) \equiv \mathrm{KL}(\alpha/n, p_*)$

for brevity. Using the conjugate-prior connection between Beta distributions and Bernoulli distributions, since the sum of i.i.d. Bernoullis yield a Binomial,

$\begin{aligned} f_{Y\mid X}(y\mid x) &= \mathbb P(Y = y \mid X = x) \\ &= {n \choose y} \cdot x^{y} (1-x)^{n-y}, \end{aligned}$

where $Y = \xi_1 + \dots + \xi_n \sim \mathrm{Bin}(n,x)$ for i.i.d. $\xi_i \sim \mathrm{Ber}(x)$ . By Bayes’ rule and the law of total probability, using the uniform distribution prior $\mathrm{Beta}(1, 1)$ with p.d.f. $1$ :

$\begin{aligned} f_{X|Y}(x \mid y) &= \frac{ f_{Y\mid X}(y \mid x) \cdot f_X(x) }{ f_Y(y) } \\ &= \frac{ f_{Y\mid X}(y \mid x) \cdot f_X(x) }{ \int_{[0,1]} f_{Y \mid X}(y \mid x) \cdot f_X(x) \, \mathrm dx} \\ &= \frac{ {n \choose y} \cdot x^{y} (1-x)^{n-y} \cdot 1 }{ \int_{[0,1]} {n \choose y} \cdot x^{y} (1-x)^{n-y} \cdot 1 \, \mathrm dx} \\ &= \frac{ x^{y} (1-x)^{n-y} }{ \int_{[0,1]} x^{y} (1-x)^{n-y} \, \mathrm dx}. \end{aligned}$

In particular,

$\begin{aligned} \mathbb P(X \in S \mid Y = \alpha) &= \int_{S} f_{X\mid Y}(x \mid \alpha)\, \mathrm dx \\ &= \int_{S} \frac{ x^{y} (1-x)^{n-y} }{ \int_{[0,1]} x^{\alpha} (1-x)^{n-\alpha} \, \mathrm dx}, \mathrm dx \\ &= \frac{ \int_{S} x^{\alpha} (1-x)^{n-\alpha} \, \mathrm dx }{ \int_{[0,1]} x^{\alpha} (1-x)^{n-\alpha} \, \mathrm dx} . \end{aligned}$

In what follows, set $M = 1$ and follow the proof of Lemma 13 in Riou and Honda (2020) to obtain the upper bound

$\begin{aligned} \mathbb P(\tilde{\rho}(X) \geq r) = \mathbb P(X \in S_1) &\leq C\sqrt{n} \cdot \exp( -n \cdot \mathrm{KL}(\alpha/n, p_*)), \\ \mathbb P(\tilde{\rho}(X) \leq r) = \mathbb P(X \in S_2) &\leq C\sqrt{n} \cdot \exp( -n \cdot \mathrm{KL}(\alpha/n, p_*)), \end{aligned}$

where $C = e^{1/12}/2\pi$ using Stirling’s approximation. We remark that in the live algorithm, at time $t+1$ ,

$\displaystyle \alpha = \sum_{s=1}^{T_i(t)} \mathbb I\{X_s = 1\}$

would be a somewhat confusing random variable, and the tail concentration bounds are conditioned on $\alpha$ .

Remark 2. The challenge to generalise this result comes in require concrete distributions to work with, and so our general version would need to be “simplified” into, or expressed in terms of, a more computationally tractable option. One possible area for exploration would be considering the compact set of bounded-mean Gaussian distributions

$\mathcal N_K := \{ \mathcal N(\mu, \sigma^2) : (\mu, \sigma) \in K \},$

where $K \subseteq \mathbb R \times (0, \infty)$ is compact, then applying meaningful tail-bounds of the Gaussian to derive the risk version of said concentration bounds.

These upper bounds are, with more technical bookkeeping, ultimately responsible for the asymptotically optimal regret bounds. And since contiunity is a relatively benign condition, many risk functionals enjoy the tail upper bound of Theorem 1, and potentially, the asymptotically optimal regret bound for $\rho$ -Thompson Sampling.

Example 1. Given continuous risk functionals $\rho_1, \rho_2$ and constants $c_1, c_2$ , the linear combination $\rho = c_1 \rho_1 + c_2 \rho_2$ is also a continuous risk functional. See Examples 2 and 3 for myriads of continuous risk functionals that can be combined.

However, a proper proof of the Thompson sampling algorithm requires tail lower bounds. To achieve that goal, we introduce the notion of a dominant risk functional.

Definition 2. For any $p \in [0, 1]$ , define

$V(p, 0) = [p, 1],\quad V(p, 1) = [0,p].$

We say that a risk functional $\rho$ is dominant if for any $p \in [0,1]$ and $r \in \mathbb R$ , there exists $q \in [0,1]$ and $j \in \{0, 1\}$ such that

$\mathcal{K}_{\inf}^{\rho}(\mathrm{Ber}(p), r, \mathrm{Ber}) = \mathrm{KL}(\mathrm{Ber}(p), \mathrm{Ber}(q))$

and $\tilde{\rho}(V(q,j)) \subseteq [ \tilde{\rho}(q), \infty)$ . We remark that by Lemma 1, if $\rho$ is continuous, then we are guaranteed the optimised KL-divergence result.

In the original version, it was this concept that I dreamt of while struggling to solve the bandit problem. I prayed long and hard, and solved the problem in my dream thrice. I was shocked, and said to myself, “I must be dreaming. I will wake up and write down my solution.” And so at 4.30am sometime in February 2022, I did just that, and after a sanity check at 8.30am the next morning, concluded that the solution was correct.

In any case, the dominant risk functional property guarantees for us a much-needed tail lower bound.

Theorem 2. Fix $r \in \mathbb R$ and natural numbers $\alpha, \beta$ , $n := \alpha + \beta$ . If $\rho$ is dominant, then there exists another universal constant $C_2$ such that for any random variable with Beta distribution $X \sim \mathrm{Beta}(\alpha, \beta)$ ,

$\begin{aligned} \mathbb P(\tilde{\rho}(X) \geq r) &\geq \frac{C_2}{n} \cdot \exp( -n \cdot \mathcal{K}_{\inf}^{\rho}(\mathrm{Ber}(\alpha/n), r) ). \end{aligned}$

Proof. Fix $r \in \mathbb R$ . Since $\rho$ is dominant, there exists $q \in [0,1]$ and $j \in \{0, 1\}$ such that

$\mathcal{K}_{\inf}^{\rho}(\mathrm{Ber}(p), r, \mathrm{Ber}) = \mathrm{KL}(\mathrm{Ber}(p), \mathrm{Ber}(q))$

and $\tilde{\rho}(V(q,j)) \subseteq [ \tilde{\rho}(q), \infty) \subseteq [r, \infty)$ , where the last inclusion holds by $\tilde{\rho}(q) \geq r$ . Assume $j = 0$ for simplicity. Taking probabilities, and denoting

$\begin{aligned} \mathbb P(\tilde{\rho}(X) \in [r, \infty)) &\geq \mathbb P( \tilde{\rho}(X) \in \tilde{\rho}( V(q,0) )) \\ &= \mathbb P( \tilde{\rho}(X) \in \tilde{\rho}( [q, 1] )) \\ &= \mathbb P(X \in [q, 1]) \\ &= \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\int_q^1 x^{\alpha-1} (1-x)^{\beta-1}\, \mathrm dx \\ &\geq \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} \cdot q^{\alpha-1} \int_q^1 x^{\beta-1}\, \mathrm dx \\ &=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} \cdot q^{\alpha-1} \cdot \frac{(1-q)^{\beta}}{\beta} \\ &=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} \cdot \underbrace{ \frac{\beta}{q} }_{\geq \beta} \cdot \underbrace{ \frac{q^{\alpha}}{(\alpha/n)^{\alpha}} \cdot \frac{(1-q)^{\beta}}{(\beta/n)^{\beta}} }_{\exp(-n \cdot \mathrm{KL}( \alpha/n, q ) )} \cdot \frac{\alpha^{\alpha} \cdot \beta^{\beta}}{n^n} \\ &\geq \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta+1)} \cdot \frac{\alpha^{\alpha} \cdot \beta^{\beta}}{n^n} \cdot \exp(-n \cdot \mathrm{KL}( \alpha/n, q ) ) . \end{aligned}$

The rest of the calculation follows from the proof of Lemma 2 in Baudry et al (2021) by using Stirling’s approximation, and yields the desired lower bound

$\displaystyle \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta+1)} \cdot \frac{\alpha^{\alpha} \cdot \beta^{\beta}}{n^n} \geq \underbrace{ \sqrt{\frac{2\pi}{2.13}} }_{C_2} \cdot \frac 1n .$

Remark 3. This tail bound eventually bounds all other terms by constants, once the exponentials cancel other exponentials out in subsequent calculations. The dream dealt with the slightly more general case $p = (p_0,p_1,p_2)$ , where $p_0+p_1+p_2 = 1$ . I couldn’t dream of the higher-dimensional scenarios. Nor did I need to, to be very honest. See Remark 2 for the computational challenge of generalising Theorem 2 to more general forms for the theoretical use of $\rho$ -Thompson Sampling.

Remark 4. The bounds in Theorems 1 and 2 work precisely for a special class of distributions, namely Bernoulli bandits (or slightly more generally, multinomial bandits). For other common classes of distributions, like Gaussians for example, we would need different tail lower bounds. Moreover, we would need to work with specific conjugate pairs of distributions, and approximate non-parametric bandits using parametric ones, which leads to messier approximation-controlling calculations when evaluating the regret bound.

But you might wonder—what functionals could pass the dominant risk functional criteria?

Lemma 2. Let $\rho$ be continuous and $c \in [0, 1]$ be a constant pivot point. Suppose $\tilde{\rho}|_{[0,c]}$ is non-increasing and $\tilde{\rho}|_{[c,1]}$ is non-decreasing. Then $\rho$ is dominant.

Proof. Fix $p \in [0, 1]$ and $r \in \mathbb R$ . Use Lemma 1 to produce $p_* \in [0, 1]$ such that

$\mathcal K_{\inf}^{\rho}(\mathrm{Ber}(p), r) = \mathrm{KL}(\mathrm{Ber}(p), \mathrm{Ber}(p_*)).$

Set $q = p_*$ . First suppose $c \in (0, 1)$ . Either $q \in [0, c]$ or $q \in [c, 1]$ . In the former, $\tilde{\rho}(s) \geq \tilde{\rho}(q)$ for $s \in [0, q]$ , which implies

$\tilde{\rho}(V(q,1)) = \tilde{\rho}([0, q]) \subseteq [\tilde{\rho}(q), \infty),$

as required. In the latter, $\tilde{\rho}(s) \geq \tilde{\rho}(q)$ for $s \in [q,1]$ , which implies

$\tilde{\rho}(V(q,0)) = \tilde{\rho}([q, 1]) \subseteq [\tilde{\rho}(q), \infty),$

as required. If $c = 0$ , then $\tilde{\rho}$ is non-decreasing, and the second case holds. Finally, if $c = 1$ , then $\tilde{\rho}$ is non-increasing, and the previous argument holds.

Remark 5. For two risk functionals $\rho_1, \rho_2$ satisfying the hypotheses of Lemma 2 with the same pivot point $c$ and non-negative constants $\alpha_1, \alpha_2$ such that $\alpha_1 + \alpha_2 = 1$ , $\rho := \alpha_1 \cdot \rho_1 + \alpha_2 \cdot \rho_2$ is dominant.

Example 2. Given $\nu_p = \mathrm{Ber}(p)$ , denote $\tilde{\rho}(p) \equiv \rho(\nu_p)$ . The following risk functionals are continuous and satisfy the hypotheses of Lemma 2 (here, let $X \sim \nu_p$ , $\phi : [0, 1] \to [0,\infty)$ have total integral $1$ , and $\Phi$ denote the c.d.f. of the standard Gaussian):
- Expected value: $\rho_{\mathbb E}(\nu_p) = \mathbb E[X] = p$ ,
- Conditional value-at-risk: $\rho_\alpha(\nu_p) = \min\{p/(1-\alpha), 1\}$
- Proportional hazard: $\rho_\alpha(\nu_p) = p^{\alpha}$
- Lookback: $\rho_q(\nu_p) = p^q(1-q\log p)$
- Spectral risk: $\rho_\phi(\nu_p) = \int_0^p \phi(t)\, \mathrm dt$
- Entropic risk: $\rho_{\mathrm{ER},\theta}(\nu_p) = -\frac 1{\theta} \log((1-p) + pe^{-\theta})$
- Dual power distortion: $\rho_{\mathrm{DP},\gamma}(\nu_p) = 1 - (1-p)^{\gamma}$
- Wang transform: $\rho_{\mathrm{W},\sigma}(\nu_p) = \Phi(\Phi^{-1}(p) + \sigma)$
- Logarithmic distortion: $\rho_{\mathrm{L}, \beta}(\nu_p) = \log(1+\beta p)/{ \log(1+\beta) }$
- $\delta$ -Sharpe ratio: $\displaystyle \rho_{\mathrm{SR}, \delta}(\nu_p) = \rho_{\mathrm{SR}}(\nu_{(1-\delta)p}) = \sqrt{\frac{ (1-\delta)p }{ 1 - (1-\delta)p}}$
Remark 6. Since the Sharpe ratio is not well-defined at $\mathrm{Ber}(1)$ , we do not have the nice compactness property of Lemma 1 for the vanilla Sharpe ratio. The dilation by $(1-\delta)$ allows the Sharpe ratio to be defined for $[0, (1-\delta)^{-1}) \supseteq [0,1]$ . Given a $K$ -armed Bernoulli bandit with maximum probability $p_{\max} \in (0,1)$ ,

$\rho_{\mathrm{SR}} (\nu_p) = \rho_{\mathrm{SR}, 1-p_{\max}} (\nu_{p/p_{\max}})$

satisfies the requirements of Lemma 1, and we recover its useful tail bounds.

Lemma 3. If $\tilde{\rho}$ is differentiable on $[0,1]$ with non-decreasing derivative $\tilde{\rho}'$ , then $\rho$ is dominant.

Proof. We claim that no matter what, $\tilde{\rho}$ will satisfy the hypotheses of Lemma 2.
- If $\tilde{\rho}'(0) \geq 0$ , then $\tilde{\rho}$ is non-decreasing.
- If $\tilde{\rho}' \leq 0$ , then $\tilde{\rho}$ is non-increasing.
- If there exists $c_0 \in (0, 1]$ such that $\tilde{\rho}'(c_0) > 0$ , then the intermediate value property of derivatives yields $c \in (0, c_0)$ such that $\tilde{\rho}'(c) = 0$ . Since $\tilde{\rho}'$ is non-decreasing, $\tilde{\rho}|_{[0, c]}$ is non-increasing and $\tilde{\rho}|_{[c, 1]}$ .
In all three cases, $\rho$ satisfies the hypotheses of of Lemma 2, and thus are dominant.

Remark 7. More generally, if $\tilde{\rho}$ is convex (which generalises Lemma 3), i.e. for any $t \in [0, 1]$ and $x, y \in [0, 1]$ ,

$\tilde{\rho}(t \cdot x + (1-t) \cdot y) \leq t \cdot \tilde{\rho}(x) + (1-t) \cdot \tilde{\rho}(y),$

the risk functional $\rho$ will be dominant. Furthermore, for any two risk functionals $\rho_1, \rho_2$ satisfying the hypotheses of Lemma 3 and non-negative constants $\alpha_1, \alpha_2$ such that $\alpha_1 + \alpha_2 = 1$ , $\rho := \alpha_1 \cdot \rho_1 + \alpha_2 \cdot \rho_2$ is dominant.

Example 3. Given $\nu_p = \mathrm{Ber}(p)$ , denote $\tilde{\rho}(p) \equiv \rho(\nu_p)$ . The following risk functionals satisfy the hypotheses of Lemma 2 (here, let $X \sim \nu_p$ ):
- Second moment: $\rho_{\text{SM}}(\nu_p) = \mathbb E[X^2] = p$
- Negative variance: $\rho_{-\mathbb V}(\nu_p) = -p(1-p)$
- Mean-variance: $\rho_{\mathrm{MV}_\gamma} = \gamma \rho_{\mathbb E} + \rho_{-\mathbb V}$
- Target semi-variance: $\rho_{\mathrm{TSV}_r}(\nu) = -\mathbb E[\min\{ X-r , 0 \}^2 ]$
- Exponential tilt: $\rho_\lambda(\nu_p) = \exp(p\lambda)$
- Quadratic utility: $\rho_a(\nu_p) = ap^2 + bp + c$
Furthermore, for two risk functionals $\tilde{\rho}_1,\tilde{\rho}_2$ and non-negative constants $c_1, c_2$ satisfying the hypotheses of Lemma 2, $\rho := c_1 \tilde{\rho}_1 + c_2 \tilde{\rho}_2$ is dominant.

Therefore, most risk functionals as listed in Examples 1, 2, and 3 are the ones that most people care about are in fact continuous and dominant, and therefore, by passing the relevant arguments through much book-keeping, enjoy the asymptotically optimal regret bound for $\rho$ -TS on the Bernoulli bandit environment.

Theorem 3. The regret bound of $\rho$ -TS over a Bernoulli bandit environment is asymptotically optimal, and given by

$\displaystyle \lim_{n \to \infty} \frac{ \mathcal R_n(\mathrm{Ber}^K, \rho\text{-}\mathrm{TS}) }{\log(n)} = \sum_{i : \Delta_i^{\rho}} \frac{\Delta_i^{\rho}}{\mathcal K_{\inf}^{\rho} (\nu_i, \rho(\nu_1))}.$

Proof. Follow the proof of Theorem 1 in Chang and Tan (2022), and apply Theorems 1 and 2 in the analysis. Left as an exercise (effectively) in algebra and mildly clever calculus.

Oh, by the way, with the help of ChatGPT, here’s a Jupyter writeup of the implemented algorithm and some pretty pictures!

The red curve indicates the theoretical asymptotic lower bound, and each diagram reflects the algorithm running for a fixed $10$ -armed Bernoulli bandit, with different risk functionals, even combinations of them:
- $\rho_1 = 0.5 \cdot \mathrm{CVaR}_{0.05} + 0.5 \cdot \mathrm{ER}_{0.5}$
- $\rho_2(\nu_p) = \rho_{\mathrm{MV}_{2}}(\nu_p) + \exp(2p)$
And for them all, the asymptotic lower bound lies happily in their $1$ -sigma bands. (Yes, each algorithm was averaged across $100$ runs!)

And with that, we are truly done. Happy lunar new year!

—Joel Kindiak, 16 Feb 26, 1131H
February 17, 2026
Real-Life Hypothesis Tests
These problems arise from my actual experience, but numbers have been fudged to protect confidentiality.

Problem 1 (Population Mean). As I taught my classes, I noticed that students are exceedingly taller than I. My height is 160 cm, so I suspect that the average height $\mu$ of students is not 160 cm. By collecting the heights $x$ cm of 30 randomly chosen students, I obtained the following data:

$\Sigma x = 4840,\quad \Sigma x^2 = 781\, 176.$

Test at the 5% significance level to determine whether my suspicion is justified.
(Click for Solution)

Solution. Let $X$ denote the height of a randomly chosen student in cm, and $\mu = \mathbb E[X]$ .

We first set up the null and alternative hypotheses:

$\mathrm H_0 : \mu = 160,\quad \mathrm H_1 : \mu \neq 160.$

Denote the population variance by $\sigma^2$ and $n = 30$ . Assume $\mathrm H_0$ holds, so that $\mu = 60$ . Since $n \geq 30$ , by the central limit theorem,

$\displaystyle \frac{\bar X_n - \mu}{\sigma / \sqrt{n}} \approx Z \sim \mathcal N(0, 1).$

Since $\sigma^2$ is unknown, we need to estimate it using $s^2$ :

$\displaystyle s^2 = \frac 1{30-1} \left( 781\, 176 - \frac{4840^2}{30} \right) \approx 11.1.$

Furthermore, we estimate $\mu$ using $\bar x$ :

$\displaystyle \bar x = \frac{4840}{30} \approx 161.$

Hence, our calculated test statistic $c$ will be

$\displaystyle c := \frac{\bar x - \mu}{s/\sqrt{n}} = \frac{161.33 - 160}{\sqrt{11.126 / 30}} \approx 2.19.$

Since $n \geq 30$ , $t(n-1) \approx \mathcal N(0, 1)$ , so that using either a $z$ – or a $t$ -test would yield similar results. Denote $T \sim t(n-1)$ and the significance level $\alpha = 0.05$ .
- Using a $z$ -table, $p \leq \alpha \iff |z| \geq 1.96$ .
- Using a $t$ -table, $p \leq \alpha \iff |t| \geq 2.05$ .
Whether we let $z = c$ or $t = c$ , it is true that $p \leq \alpha$ . Therefore, there is sufficient evidence to reject $\mathrm H_0$ and conclude that Joel’s suspicion is justified, i.e. the average height of students is larger than $160$ cm.
Problem 2 (Confidence Intervals). Keep the scenario as Problem 1 but denote the true population mean by $\mu_0$ . Use the $t$ -test for simplicity. Determine the interval of values that $\mu_0$ can take such that there is insufficient evidence to reject the null hypothesis at the 5% significance.

(Click for Solution)

Solution. By definition,

$\displaystyle t = \frac{\bar x - \mu_0}{s / \sqrt n} \quad \iff \quad \mu_0 = \bar x - t \cdot \frac{s}{\sqrt n}.$

We do not reject $\mathrm H_0$ if and only if $p > \alpha = 0.05$ . Therefore,

$p > \alpha \quad \iff \quad |t| < 2.05 =: t_{\alpha/2} \quad \iff \quad -t \in (-t_{\alpha/2}, t_{\alpha/2}).$

Therefore,

$\displaystyle \mu_0 = \bar x - t \cdot \frac{s}{\sqrt n} \in \left( \bar x - t_{\alpha/2} \cdot \frac{s}{\sqrt n}, \ \bar x + t_{\alpha/2} \cdot \frac{s}{\sqrt n}\right).$

Remark 1. We call this calculated interval the $(1-\alpha)$ -confidence interval for $\mu_0$ . Denoting a specific sample $K := \{X_1,\dots, X_n\}$ , let $\bar X_K, S_K^2$ denote the corresponding computed unbiased estimators for $\mu, \sigma^2$ respectively. Then the computed corresponding confidence interval $I_K$ will equal

$I_K = \displaystyle \left( \bar X_K - t_{\alpha/2} \cdot \frac{S_K}{\sqrt n}, \bar X_K + t_{\alpha/2} \cdot \frac{S_K}{\sqrt n}\right).$

Hence, different samples would yield different confidence intervals. Since $K$ is random, so is $I_K$ . Furthermore, defining $T := (\bar X_K - \mu_0)/(S_K/\sqrt n) \sim t(n-1)$ , mimicking the computation above yields

$\mathbb P(\mu_0 \in I_K) = \mathbb P(- t_{\alpha/2} < T < t_{\alpha/2}) = 1-\alpha.$

Thus, we have the following interpretation of a $(1-\alpha)$ -confidence interval: the probability that a randomly chosen confidence interval will contain the (deterministic though unknown) population mean is $(1-\alpha)$ .

Problem 3 (Population Proportion). I went to a nearby café, and noticed that there were more women than men in the café. Out of 50 people present, 32 were women.

I suspect that it is true in general that there were more women than men in Starbucks on average. Test at the 5% significance level to determine whether my suspicion is justified.

(Click for Solution)

Solution. Let $\xi$ be a Bernoulli random variable that represents the gender of a person. Here $\xi = 0$ denotes that the person is a man and $\xi = 1$ denotes that the person is a woman. Denote $p := \mathbb E[\xi]$ , which yields the proportion of women in the café.

We first set up the null and alternative hypotheses:

$\mathrm H_0 : p = 0.5,\quad \mathrm H_1 : p > 0.5.$

Assume $\mathrm H_0$ holds, so that $p = 0.5$ . We next estimate $p$ using $\bar \xi_n$ :

$\displaystyle \bar \xi_n = \frac{32}{50} = 0.64.$

Since $n = 50 \geq 30$ and $np(1-p) = 12.5 \geq 10$ , by the central limit theorem,

$\displaystyle \frac{\bar \xi_n - p}{\sqrt{p(1-p)}/\sqrt n} \approx Z \sim \mathcal N(0, 1).$

Hence, our calculated test statistic, the $z$ -value, will be as follows:

$\displaystyle z := \frac{0.64 - 0.5}{ \sqrt{0.5 (1-0.5 )}/\sqrt{50} } \geq 1.97.$

Using a $z$ -table, $p := \mathbb P(Z > z) < 0.05 = \alpha \iff z > 1.645$ , which holds. Therefore, there is sufficient evidence to reject $\mathrm H_0$ and conclude that Joel’s suspicion is justified, i.e. there are more women than men on average.

Problem 4 (Goodness-of-Fit). A total of 750 students took an assessment worth $10$ marks. For each $k = 1, 2, \dots, 10$ , let $f(k)$ denote the number of students who scored $k$ marks out of 10. We have the following data:

Assuming that scores are continuous, determine at the 5% significance level if the scores can be well-approximated using a normal distribution.

(Click for Solution)

Solution. Let $X$ denote the score of a randomly chosen student with $\mu = \mathbb E[X]$ and $\sigma^2 = \mathrm{Var}(X)$ . We first set up the null and alternative hypotheses:

$\mathrm H_0 : X \sim \mathcal N(\mu, \sigma^2),\quad \mathrm H_1 : X \not \sim \mathcal N(\mu, \sigma^2).$

We first estimate $\mu$ and $\sigma^2$ using $\bar x$ and $s^2$ respectively. Denoting the scores by $x$ , the summary statistics are

$\Sigma x = 3600,\quad \Sigma x^2 = 21\, 600.$

Hence,

$\displaystyle \bar x = \frac{3650}{750} \approx 4.87,\quad s^2 = \frac 1{749} \left(21\, 600 - \frac{3600^2}{750}\right) \approx 4.12.$

Now we assume $\mathrm H_0$ holds, so that $X \sim \mathcal N(4.87, 4.12)$ . Denoting

$p_k := \mathbb P(k - 0.5 < X < k + 0.5),$

we will use the test statistic

$\displaystyle W = \sum_{k=1}^{10} \frac{ (f(k) - 750p_k)^2 }{ 750p_k } \sim \chi^2(9),$

which follows a $\chi^2$ -distribution with $10-1 = 9$ degrees of freedom. For a proof for why this distribution works, refer to this document. Using relevant $z$ -table look-up values (or a spreadsheet application), we obtain the following values for $E_k \equiv 750 p_k$ (rounded to the nearest integer for readability, but whose original value we use in the final computation):

$\begin{aligned} E_1 = 25 \quad E_2 = 55 \quad E_3 &= 96 \quad E_4 = 133 \quad E_5 = 146 \\ E_6 = 125 \quad E_7 = 85 \quad E_8 &= 45 \quad E_9 = 19 \quad E_{10} = 6 \end{aligned}$

Piecing all of the values together,

$\displaystyle w = \sum_{k=1}^{10} \frac{ (f(k) - 750 p_k)^2}{ 750 p_k } \approx 1.97.$

Using a $\chi^2$ -table, $p := \mathbb P(W > w) < 0.05 = \alpha \iff w > 18.3$ , which does not hold. Therefore, there is (woefully) insufficient evidence to reject $\mathrm H_0$ and we cannot conclude that $X$ does not follow a normal distribution.

Problem 5 (Population Variance). Using the data in Problem 4, and assuming that the scores are normally distributed, test at the 5% significance level to determine if the standard deviation of assessment scores is greater than 2.

(Click for Solution)

Solution. We first set up the null and alternative hypotheses:

$\mathrm H_0 : \sigma^2 = 4, \quad \mathrm H_1 : \sigma^2 > 4.$

We use the test statistic $W := (n-1) S^2 / \sigma^2 \sim \chi(n-1)$ :

$\displaystyle w = \frac{ (750 - 1) \cdot 4.12105 }{ 4 } \approx 772.$

Using a spreadsheet application, $\mathbb P(W > w) < 0.05 = \alpha \iff w > 686$ . Therefore, there is sufficient evidence to reject $\mathrm H_0$ and conclude that $\sigma^2 > 4$ , which implies $\sigma > 2$ .

—Joel Kindiak, 4 Dec 25, 1915H
February 13, 2026
The Exponential Family
Recall that if $\lambda > 0$ and $X_n \sim \mathrm{Geom}(\lambda/n)$ , for any $x \geq 0$ ,

$\displaystyle \lim_{n \to \infty} \mathbb P\left( \frac {X_n}n > x \right) = e^{-\lambda x}.$

Definition 1. A continuous random variable is said to follow an exponential distribution with rate parameter $\lambda > 0$ , denoted $X \sim \mathrm{Exp}(\lambda)$ , if

$\mathbb P(X > x) = e^{-\lambda x}.$

Suppose $X \sim \mathrm{Exp}(\lambda)$ .

Problem 1. Prove the following properties:
- $f_X(x) = \lambda e^{-\lambda x} \cdot \mathbb I_{[0,\infty)}(x)$ ,
- $\mathbb E[X] = 1/\lambda$ ,
- $\mathrm{Var}(X) = 1/\lambda^2$ ,
- $X$ satisfies the memoryless property.
(Click for Solution)

Solution. The c.d.f. $F_X$ of $X$ for $x > 0$ is given by

$F_X(x) = \mathbb P(X \leq x) = 1 - \mathbb P(X > x) = 1 - e^{-\lambda x}.$

Hence,

$\displaystyle f_X(x) = \frac{\mathrm d}{\mathrm dx}(F_X(x)) = \frac{\mathrm d}{\mathrm dx} (1- e^{-\lambda x}) = \lambda e^{-\lambda x}.$

For the second result, we use the tail-probability characterisation of the expectation, where the interchange of integrals is valid by Fubini’s theorem:

$\begin{aligned} \mathbb E[X] &= \int_{-\infty}^{\infty} x f_X(x)\, \mathrm dx \\ &= \int_{0}^{\infty} \int_0^x f_X(x)\, \mathrm dy\, \mathrm dx \\ &= \int_{0}^{\infty} \int_{y}^\infty f_X(x) \, \mathrm dx \, \mathrm dy \\ &= \int_{0}^{\infty} \mathbb P(X > y) \, \mathrm dy. \end{aligned}$

Hence, for $X \sim \mathrm{Exp}(\lambda)$ ,

$\begin{aligned} \mathbb E[X] &= \int_{0}^{\infty} e^{-\lambda y}\, \mathrm dy = \frac 1{\lambda} \cdot [-e^{-\lambda y}]_0^\infty = \frac 1{\lambda} \cdot (0 - (-1)) = \frac 1{\lambda}. \end{aligned}$

For the variance, we adopt a similar approach:

$\begin{aligned} \mathbb E[X^2] &= \int_{-\infty}^{\infty} 2y \cdot \mathbb P(X > y) \, \mathrm dy \\ &= \int_{0}^{\infty} 2y \cdot e^{-\lambda y} \, \mathrm dy \\ &= \frac 2{\lambda} \int_0^\infty y e^{-\lambda y}\, \mathrm dy \\ &= \frac 2{\lambda} \cdot \mathbb E[X] = \frac 2{\lambda^2}. \end{aligned}$

Therefore,

$\displaystyle \mathrm{Var}(X) = \mathbb E[X^2] - \mathbb E[X]^2 = \frac{2}{\lambda^2} - \frac 1{\lambda^2} = \frac{1}{\lambda^2}.$

For the memoryless property,

$\begin{aligned} \mathbb P(X > s + t \mid X > t) &= \frac{\mathbb P(X > s + t, X > t)}{\mathbb P(X > t)} \\ &= \frac{\mathbb P(X > s+t)}{\mathbb P(X > t)} = \frac{e^{-\lambda(s+t)}}{e^{-\lambda t}} \\ &= e^{-\lambda s} = \mathbb P(X > s).\end{aligned}$

Problem 2. Suppose $Y \sim \mathrm{Exp}(\mu)$ is independent to $X$ .
- Calculate the distribution of $\min\{X, Y\}$ .
- If $\lambda = \mu$ , evaluate the p.d.f. of $X + Y$ .
(Click for Solution)

Solution. Denoting $W := \min\{X, Y\}$ ,

$\begin{aligned} \mathbb P(W > w) &= \mathbb P(X > w, Y > w) \\ &= \mathbb P(X > w) \cdot \mathbb P(Y > w)\\&= e^{-\lambda w} \cdot e^{-\mu w} \\ &= e^{-(\lambda + \mu) w}. \end{aligned}$

Hence, $\min\{X, Y\} = W \sim \mathrm{Exp}(\lambda + \mu)$ . To evaluate the p.d.f. of $U:= X+ Y$ , we compute the convolution of their individual p.d.f.s:

$\begin{aligned} f_U(u) &= (f_X * f_Y)(u) \\ &= \int_0^u f_X(x) \cdot f_Y(u-x)\, \mathrm dx \\ &= \int_0^u \lambda e^{-\lambda x} \cdot \mu e^{-\mu(u - x)}\, \mathrm dx \\ &= \lambda \cdot \mu \cdot e^{-\mu u} \cdot \int_0^u e^{-(\lambda - \mu) x} \, \mathrm dx \\ &= \lambda^2 \cdot e^{-\lambda u} \cdot \int_0^u 1 \, \mathrm dx \\ &= \lambda^2 \cdot u \cdot e^{-u}. \end{aligned}$

Definition 2. A continuous random variable $X$ is said to follow a gamma distribution with shape parameter $\alpha > 0$ and rate parameter $\lambda > 0$ , denoted $X \sim \Gamma( \alpha, \lambda)$ if it has a p.d.f. given by

$\displaystyle f_X(x) = \frac{\lambda^\alpha}{\Gamma(\alpha)} \cdot x^{\alpha - 1} \cdot e^{-\lambda x}.$

Problem 3. Prove the following properties:
- if $X \sim \Gamma(\alpha, \lambda)$ , then $\mathbb E[X] = \alpha/\lambda$ , $\mathrm{Var}(X) = \alpha/\lambda^2$ ,
- if $X_i \sim \Gamma(\alpha_i, \lambda)$ are i.i.d., then $\sum_{i=1}^n X_i \sim \Gamma(\sum_{i=1}^n \alpha_i, \lambda)$ ,
- if $X \sim \Gamma(\alpha, \lambda)$ and $c > 0$ , then $cX \sim \Gamma(\alpha, \lambda/c)$ .
(Click for Solution)

Solution. Suppose $Y \sim \Gamma(\alpha + 1,\lambda)$ . By definition of the expectation,

$\begin{aligned} \mathbb E[X^n] &= \int_0^\infty x^n \cdot \frac{\lambda^\alpha}{\Gamma(\alpha)} \cdot x^{\alpha - 1} \cdot e^{-\lambda x}\, \mathrm dx \\ &= \frac {\alpha \cdot (\alpha+1) \cdot \cdots \cdot (\alpha +n-1)}{\lambda^n} \cdot \int_0^\infty \frac{\lambda^{\alpha+n}}{\Gamma(\alpha+1)} \cdot x^{(\alpha + n) - 1} \cdot e^{-\lambda x}\, \mathrm dx \\ &= \frac{\alpha \cdot (\alpha+1) \cdot \cdots \cdot (\alpha +n-1)}{\lambda^n} \cdot \int_0^\infty f_Y(x)\, \mathrm dx \\ &= \frac{\alpha \cdot (\alpha+1) \cdot \cdots \cdot (\alpha +n-1)}{\lambda^n} . \end{aligned}$

Hence, $\mathbb E[X] = \alpha/\lambda$ , and

$\begin{aligned} \mathrm{Var}(X) &= \mathbb E[X^2] - \mathbb E[X]^2 = \frac{\alpha \cdot (\alpha+1)}{\lambda^2} - \frac{\alpha^2}{\lambda^2} = \frac{\alpha}{\lambda^2}. \end{aligned}$

We prove the second result by induction. Suppose $X \sim \Gamma(\alpha, \lambda)$ and $Y \sim \Gamma (\beta, \lambda)$ are independent. To evaluate the p.d.f. of $U:= X+ Y$ , we compute the convolution of their individual p.d.f.s:

$\begin{aligned} f_U(u) &= (f_X * f_Y)(u) \\ &= \int_0^u f_X(x) \cdot f_Y(u-x)\, \mathrm dx \\ &= \int_0^u \frac{\lambda^\alpha}{\Gamma(\alpha)} \cdot x^{\alpha - 1} \cdot e^{-\lambda x} \cdot \frac{\lambda^\beta}{\Gamma(\beta)} \cdot (u-x)^{\beta - 1} \cdot e^{-\lambda (u-x)}\, \mathrm dx \\ &= \frac{\lambda^{\alpha + \beta}}{\Gamma(\alpha) \cdot \Gamma(\beta)} \cdot \int_0^u x^{\alpha - 1} \cdot (u-x)^{\beta - 1} \cdot e^{-\lambda u}\, \mathrm dx \\ &= \frac{\lambda^{\alpha + \beta}}{\Gamma(\alpha) \cdot \Gamma(\beta)} \cdot \int_0^1 (ut)^{\alpha - 1} \cdot (u-ut)^{\beta - 1} \cdot e^{-\lambda u}\cdot u\, \mathrm dt \\ &= \frac{\lambda^{\alpha + \beta}}{\Gamma(\alpha + \beta)} \cdot u^{(\alpha+\beta) -1}\cdot e^{-\lambda u} \cdot \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \cdot \Gamma(\beta)} \cdot \int_0^1 t^{\alpha - 1} \cdot (1-t)^{\beta - 1} \, \mathrm dt \\ &= \frac{\lambda^{\alpha + \beta}}{\Gamma(\alpha + \beta)} \cdot u^{(\alpha+\beta) -1}\cdot e^{-\lambda u}. \end{aligned}$

Therefore, $W \sim \Gamma(\alpha +\beta, \lambda)$ . Inductively, if $X_i \sim \Gamma(\alpha_i, \lambda)$ are i.i.d.,

$\displaystyle \sum_{i=1}^{k+1} X_i = \sum_{i=1}^{k} X_i + X_{k+1} \sim \Gamma \left( \sum_{i=1}^{k} \alpha_i + \alpha_{k+1}, \lambda \right) = \Gamma \left( \sum_{i=1}^{k+1} \alpha_i, \lambda \right).$

For the final property, denoting $V := cX$ ,

$\begin{aligned} f_{V}(v) = f_{cX}(v) &= \frac 1c \cdot f_X\left( \frac vc \right) \\ &= \frac 1c \cdot \frac{\lambda^\alpha}{\Gamma(\alpha)} \cdot \left( \frac vc \right)^{\alpha - 1} \cdot e^{-\lambda v/c} \\ &= \frac{(\lambda /c)^\alpha}{\Gamma(\alpha)} \cdot v^{\alpha - 1} \cdot e^{-(\lambda /c) v}. \end{aligned}$

Hence, $cX = V \sim \Gamma(\alpha, \lambda / c)$ .

Given probability distributions $\mathbb Q_1, \mathbb Q_2$ , write $\mathbb Q_1 =\mathbb Q_2$ if there exists a random variable $X$ such that $X \sim \mathbb Q_1$ and $X \sim \mathbb Q_2$ .

Problem 4. Prove the following properties:
- $\mathrm{Exp}(\lambda) = \Gamma(1, \lambda)$ ,
- $\Gamma(\nu/2, 1/2) = \chi^2(\nu)$ ,
- for i.i.d. $X_1,\dots, X_n \sim \mathrm{Exp}(\lambda)$ , $\sum_{i=1}^n X_i \sim \Gamma(n, \lambda),\bar X \sim \Gamma(n, \lambda/n)$ ,
- for any fixed $c > 0$ , if $W \sim \chi^2(\nu)$ , then $cW \sim \Gamma(\nu/2, 1/(2c))$ .
(Click for Solution)

Solution. We note that if $X \sim \Gamma( 1,\lambda)$ , since $\Gamma(1) = 0! = 1$ ,

$\displaystyle f_X(x) = \frac{\lambda}{\Gamma(1)} \cdot x^{1 - 1} \cdot e^{-\lambda x} = \lambda e^{-\lambda x},$

so that $X \sim \mathrm{Exp}(1, \lambda)$ . If $X \sim \Gamma( \nu/2, 1/2)$ , then

$\displaystyle f_X(x) = \frac{(1/2)^{\nu/2}}{\Gamma(\nu/2)} \cdot x^{\nu/2 - 1} \cdot e^{-x/2} = \frac{1}{2^{\nu/2} \cdot \Gamma(\nu/2)} \cdot x^{\nu/2 - 1} \cdot e^{-x/2}.$

The last two results are immediate corollaries of Problem 3.

These probability distributions are examples of the exponential family of probability distributions.

—Joel Kindiak, 4 Aug 25, 1356H
January 9, 2026
The Geometric Distribution

Definition 1. For $K \subseteq \mathbb R$ , a random variable $X : \Omega \to K$ satisfies the memoryless property if the following holds: for any $s, t \in K$ ,

$\displaystyle \mathbb P(X > s + t \mid X > t) = \mathbb P(X > s).$

Problem 1. If $K = \mathbb N^+$ and $X$ satisfies the memoryless property, compute an expression for $\mathbb P(X = x)$ in terms of $p := \mathbb P(X = 1)$ .

(Click for Solution)

Solution. Define the function $G$ by $G(x):= \mathbb P(X > x)$ . By the definition of conditional probability,

$\begin{aligned} G(s+t) = \mathbb P(X > s+t) &= \mathbb P(X > s +t, X>t) \\ &= \mathbb P(X > s +t \mid X>t) \cdot \mathbb P(X > t) \\ &= \mathbb P(X > s) \cdot \mathbb P(X > t) \\ &= G(s) \cdot G(t). \end{aligned}$

Therefore, $G(x) = a^x$ for some $a > 0$ . In particular,

$a = G(1) = \mathbb P(X > 1) = 1 - \mathbb P(X=1) = 1-p.$

Therefore, $G(x) = (1-p)^x$ , so that

$\begin{aligned} \mathbb P(X=x) &= \mathbb P(X > x-1) - \mathbb P(X > x) \\ &= (1-p)^{x-1} - (1-p)^x \\ &= (1-p)^{x-1} \cdot (1 - (1-p)) \\ &= (1-p)^{x-1} \cdot p. \end{aligned}$

Definition 2. A discrete random variable $X : \Omega \to \mathbb N^+$ is said to follow a geometric distribution with success probability $p$ , denoted $X \sim \mathrm{Geom}(p)$ , if

$\displaystyle \mathbb P(X = x) = (1-p)^{x-1} \cdot p,\quad x \in \mathbb N^+.$

Suppose $X \sim \mathrm{Geom}(p)$ .

Problem 2. Prove that $X$ satisfies the memoryless property.

(Click for Solution)

Solution. Using a geometric series,

$\begin{aligned} \mathbb P(X > x) &= \sum_{k=x+1}^\infty \mathbb P(X = k) \\ &= \sum_{k=x+1}^\infty (1-p)^{k-1} \cdot p \\ &= \frac{p\cdot (1-p)^x}{1 - (1-p)} = (1-p)^x. \end{aligned}$

By the definition of conditional probability,

$\begin{aligned} \mathbb P(X > s +t \mid X>t) &= \frac{\mathbb P(X > s +t, X>t)}{\mathbb P(X > t)} \\ &= \frac{\mathbb P(X > s +t)}{\mathbb P(X > t)} \\ &= \frac{(1-p)^{s+t}}{(1-p)^t} \\ &= (1-p)^s = \mathbb P(X > s). \end{aligned}$

Problem 3. Prove that $\displaystyle \mathbb E[X] = \sum_{x = 0}^\infty \mathbb P(X > x)$ . Hence, evaluate $\mathbb E[X]$ and $\mathrm{Var}(X)$ .

(Click for Solution)

Solution. By interchanging sums,

$\begin{aligned} \mathbb E[X] &= \sum_{x = 0}^\infty x \cdot \mathbb P(X = x) \\ &= \sum_{x=0}^\infty \sum_{y=0}^{x-1} \mathbb P(X = x) \\ &= \sum_{y=0}^\infty \sum_{x=y+1}^\infty \mathbb P(X = x) \\ &= \sum_{y=0}^\infty \mathbb P(X > y) = \sum_{x = 0}^\infty \mathbb P(X > x). \end{aligned}$

Hence, using the calculations in Problem 2,

$\displaystyle \mathbb E[X] = \sum_{x=0}^\infty (1-p)^x = \frac{1}{1-(1-p)} = \frac 1p.$

For the variance, we first compute $\mathbb E[X^2]$ . Observe that

$\displaystyle \sum_{y=0}^{x-1} (2y + 1) = x^2.$

Therefore, by interchanging sums,

$\begin{aligned} \mathbb E[X^2] &= \sum_{x = 0}^\infty x^2 \cdot \mathbb P(X = x) \\ &= \sum_{x=0}^\infty \sum_{y=0}^{x-1} (2y + 1) \cdot \mathbb P(X = x) \\ &= \sum_{y=0}^\infty \sum_{x=y+1}^\infty (2y + 1) \cdot \mathbb P(X = x) \\ &= \sum_{y=0}^\infty \left( (2y + 1) \cdot \sum_{x=y+1}^\infty (1-p)^{x-1} \cdot p \right) \\ &= \sum_{y=0}^\infty \left( (2y + 1) \cdot \frac{(1-p)^y}{1 - (1-p)} \cdot p \right) \\ &= \frac 1p \cdot \sum_{y=0}^\infty \left( (2y + 1) \cdot (1-p)^y \cdot p \right) \\ &= \frac 1p \cdot \left( p + (1-p) \cdot \left( 2 \cdot \sum_{y=1}^\infty y\cdot \mathbb P(X = y ) + \sum_{y=1}^\infty \mathbb P(X = y ) \right) \right) \\ &= \frac 1p \cdot \left( p + (1-p) \cdot \left( \frac 2p + 1 \right) \right) \\ &= 1 + \left( \frac 1p - 1 \right) \cdot \left( \frac 2p + 1 \right) \\ &= 1 + \frac 2{p^2} + \frac 1p - \frac 2p - 1 = \frac {2-p}{p^2}. \end{aligned}$

Therefore,

$\begin{aligned} \mathrm{Var}(X) &= \mathbb E[X^2] -\mathbb E[X]^2 = \frac {2-p}{p^2} - \frac 1{p^2} = \frac{1-p}{p^2}. \end{aligned}$

Problem 4. If $Y \sim \mathrm{Geom}(q)$ is independent of $X$ , compute the distribution of $\min\{X, Y\}$ .

(Click for Solution)

Solution. Denote $W := \min \{X, Y\}$ . Then

$\begin{aligned} \mathbb P(W > w) &= \mathbb P(\min\{X,Y\} > w) \\ &= \mathbb P(X > w, Y > w) \\ &= \mathbb P(X > w) \cdot \mathbb P(Y > w) \\ &= (1-p)^w \cdot (1-q)^w = ((1-p)(1-q))^w. \end{aligned}$

Therefore, $\min\{X, Y\} \sim \mathrm{Geom}(1-(1-p)(1-q))$ .

Problem 5. Fix $\lambda > 0$ . Suppose $X_n \sim \mathrm{Geom}(\lambda /n)$ . For any $x > 0$ , evaluate

$\displaystyle \lim_{n \to \infty} \mathbb P \left( \frac {X_n}n > x \right) .$

(Click for Solution)

Solution. Using the tail-probability,

$\begin{aligned} \lim_{n \to \infty} \mathbb P \left( \frac {X_n}n > x \right) &= \lim_{n \to \infty} \mathbb P(X_n > nx) \\ &= \lim_{n \to \infty} \left( 1 - \frac{\lambda}n \right)^{nx} \\ &= \left( \lim_{n \to \infty} \left( 1 - \frac{\lambda}n \right)^{n} \right)^x \\ &= (e^{-\lambda})^x = e^{-\lambda x}. \end{aligned}$

—Joel Kindiak, 3 Aug 25, 0004H

January 6, 2026