The Ultimate Winner

Previously, we have looked at three proposed solutions for the $K$ -armed bandit problem: explore-then-commit (ETC), upper confidence bounds (UCB), and Thompson sampling (TS). The regret bounds in the first two algorithms work for general $1$ -sub-Gaussian bandit instances (i.e. $\nu_i$ is $1$ -sub-Gaussian for each $i$ ), while the latter assumes a Gaussian bandit instance (i.e. $\nu_i = \mathcal N(\mu_i, 1)$ for each $i$ ).

We will assume this stricter condition, since Gaussian distributions with variance $1$ are, in fact, $1$ -sub-Gaussian. Given this information, and the bandit instance $\nu$ , which algorithm wins out? We first recall the following regret upper bounds for ETC and UCB:

$\begin{aligned} \frac{\mathcal R_n(\text{ETC}, \nu) }{\log(n)} &\leq \sum_{i=1}^K \frac 1{\log(n)} \left( m + (n-mK) e^{ -m\Delta_i^2/4 } \right)\Delta_i, \\ \frac{\mathcal R_n(\text{UCB}, \nu) }{\log(n)} &\leq \sum_{i=1}^K \left( \frac 3{\log(n)} + \frac {16}{\Delta_i^2} \right)\Delta_i. \end{aligned}$

The ETC strategy, though simple, is a bit too simplistic. Disregarding the fact that our exploration phase $m$ directly impacts the regret bound, even if we assumed $m$ to be merely constant (and not growing in $n$ ), the right hand side yields an upper bound of the form $n/{\log(n)}$ , which tends to $\infty$ as $n \to \infty$ . The UCB algorithm, on the other hand, yields a long-run regret bound of $\leq 16/\Delta_i^2$ , which is constant and hence ideal.

Round 1: UCB wins against ETC.

Where does Thompson sampling come into the picture? Let’s compare their regret bounds:

$\begin{aligned} \frac{\mathcal R_n(\text{UCB}, \nu) }{\log(n)} &\leq \sum_{i=1}^K \left( \frac 3{\log(n)} + \frac {16}{\Delta_i^2} \right)\Delta_i, \\ \frac{\mathcal R_n(\text{TS}, \nu) }{\log(n)} &\leq \sum_{i=1}^K \frac {2}{\Delta_i^2} \cdot \Delta_i. \end{aligned}$

Trivially, $2/\Delta_i^2 \leq 16/\Delta_i^2 \leq 3/{\log(n)} + 16/\Delta_i^2$ , so that we get the following.

Round 2: Thompson sampling wins against UCB.

Remark 1. The reality is, however, a lot more subtle. It turns out that a clever choice of $m$ that depends on $n$ can allow the upper bound for ETC to converge. Furthermore, the UCB algorithm can be modified (Chapter 8 of Lattimore and Szepesvari (2020)) to recover the constant $2/\Delta_i^2$ , and arguably is more extensive than Thompson sampling, since the UCB algorithm works for a wider class of bandits (namely, $1$ –sub-Gaussians rather than just Gaussians with variance $1$ ). The UCB algorithm is even more straightforward to generalise from a theoretical perspective, while TS analyses become highly technical—even in the simple case of Gaussians with variance $1$ .

But here’s the problem—how do we know that Thompson sampling or the modified UCB in Remark 2 are really the best algorithms for a bandit instance of Gaussians with variance $1$ ? We need to develop the notion of asymptotic optimality. We take inspiration from Chapter 16 of Lattimore and Szepesvari (2020).

Denote $\mathcal N(1) := \{ \mathcal N(\mu, 1) : \mu \in \mathbb R\}$ and denote the environment of $K$ -armed $1$ -Gaussian bandits by $\mathcal M_{\mathcal N}^K(1) := \mathcal N(1)^K$ .

Let $\rho$ denote a risk functional.

Definition 1. A policy (i.e. bandit algorithm) is $\rho$ –consistent over a bandit environment $\mathcal E$ if for any $\nu \in \mathcal E$ and $\alpha > 0$ , $\mathcal R_n^{\rho}(\pi, \nu)/n^\alpha \to 0$ as $n \to \infty$ .

In particular, UCB and TS are $\rho_{\mathbb E}$ -consistent over $\mathcal E_{\mathcal N}^K(1)$ since for any $\nu \in \mathcal E_{\mathcal N}^K(1)$ and $\alpha > 0$ they have constants $C$ (depending only on $\nu$ ) such that

$\displaystyle \frac{ \mathcal R_n^{\rho_{\mathbb E}}(\pi, \nu)}{ n^\alpha } = \frac{ \mathcal R_n^{\rho_{\mathbb E}}(\pi, \nu) }{ \log(n) } \cdot \frac{\log(n)}{n^\alpha} \leq C \cdot 0 = 0.$

For any policy consistent over the $K$ -armed bandit environment

$\mathcal E = \mathcal M_1 \times \cdots \times \mathcal M_K,$

it turns out that we have a very elegant regret lower bound, that even extends to the generalised risk setting. To describe this lower-bound, however, requires us to explore some baby information theory, and in particular, the Kullback-Leibler divergence.

Definition 2 (KL-Divergence). Let $\lambda$ denote the Lebesgue measure on $(\mathbb R, \mathfrak{B}(\mathbb R))$ . For probability measures $\nu \ll \lambda, \kappa \ll \lambda$ , if $\kappa \ll \nu$ , define the KL-divergence of $\nu$ relative to $\kappa$ by

$\displaystyle \mathrm{KL}(\nu, \kappa) := \int_{\Omega} \frac{\mathrm d\nu}{\mathrm d\lambda} \cdot \log \left( \frac{\mathrm d\nu}{\mathrm d\kappa} \right)\, \mathrm d\lambda = \mathbb E_{\nu}\left[\log\left( \frac{\mathrm d\nu}{\mathrm d\kappa} \right)\right],$

where $\displaystyle \mathbb E_{\nu}[g(X)] \equiv \int_{\mathbb R} g(x)\, \mathrm d \nu(x)$ .

Example 1. Given the Gaussian distributions $\nu_i = \mathcal N(\mu_i, \sigma_i^2)$ ,

$\displaystyle \mathrm{ KL }(\nu_1, \nu_2) = \frac 12 \left( \frac{(\mu_1 - \mu_2)^2}{\sigma_2^2} + \frac{\sigma_1^2}{\sigma_2^2} - \log \frac{\sigma_1^2}{\sigma_2^2} - 1\right).$

In the special case $\sigma_1 = \sigma_2 = \sigma$ ,

$\displaystyle \mathrm{ KL }(\nu_1, \nu_2) = \frac{(\mu_1 - \mu_2)^2}{2\sigma^2} .$

Furthermore, if $\sigma = 1$ , then $\mathrm{KL}(\nu_1, \nu_2) = (\mu_1 - \mu_2)^2/2$ .

Proof. See this page.

Definition 3. Fix $r \in \mathbb R$ . Define

$\displaystyle \mathcal K_{\inf}^{\rho}(\nu, r, \mathcal M) = \inf_{ \kappa \in \mathcal M } \{ \mathrm{KL}(\nu, \kappa) : \rho(\kappa) > r\}.$

Now let $\mathcal E = \mathcal M_1 \times \cdots \times \mathcal M_K$ denote a bandit environment.

Example 2. For $\nu_i = \mathcal N(\mu_i, 1)$ and $\mu_1 > \mu_i$ ,

$\displaystyle \mathcal K_{\inf}^{\rho_{\mathbb E}}(\nu_i, \mu_1, \mathcal N(1)) = \frac{ \Delta_i^2 }{ 2 },$

where $\Delta_i := \mu_1 - \mu_i$ .

Proof. By definition,

$\displaystyle \mathcal K_{\inf}^{\rho_{\mathbb E}}(\nu_i, \mu_1, \mathcal N(1)) = \inf_{ \kappa \in \mathcal N(1) } \{ \mathrm{KL}(\nu_i, \kappa) : \rho_{\mathbb E}(\kappa) > \mu_1\}.$

By Example 1, given $\kappa = \mathcal N(\mu_2, 1) \in \mathcal M(1)$ such that $\rho_{\mathbb E}(\kappa) > \mu_1$ , we have $\epsilon := \mu_2 - \mu_1 > 0$ , so that

$\begin{aligned} \mathrm{ KL }(\nu_i, \kappa) &= \frac{(\mu_i - (\mu_1 + \epsilon))^2}{2} = \frac{ (\Delta_i+ \epsilon)^2 }{2}. \end{aligned}$

Taking $\epsilon \to 0^+$ ,

$\displaystyle \mathcal K_{\inf}^{\rho_{\mathbb E}}(\nu_i, \mu_1, \mathcal N(1)) = \frac{ (\Delta_i + 0)^2 }{ 2 } = \frac{ \Delta_i^2 }{ 2 }.$

Why do we care about the optimised KL-divergence? Because it plays a fundamental role in (asymptotically) lower-bounding the cumulative regret!

Theorem 1. Let $\pi$ be a consistent policy over the $K$ -armed bandit environment

$\mathcal E = \mathcal M_1 \times \cdots \times \mathcal M_K.$

Then for any $\nu \in \mathcal E$ with unique $\rho$ -optimal arm $1$ ,

$\displaystyle \liminf_{n \to \infty} \frac{\mathcal R_n^{\rho}(\pi, \nu)}{\log(n)} \geq \sum_{i : \Delta_i^{\rho} > 0} \frac{1}{ \mathcal K_{\inf}^{\rho} ( \nu_i, \rho(\nu_1) , \mathcal M_i ) } \cdot \Delta_i^{\rho}.$

Proof. We follow the proof of Theorem 16.2 in Lattimore and Szepesvari (2020). Denote the bandit instance by $\nu = (\nu_1,\dots, \nu_K) \in \mathcal E$ . Fix $\epsilon > 0$ arbitrary. Denote $K_i := \mathcal K_{\inf}^{\rho} ( \nu_i, \rho(\nu_1) , \mathcal M_i )$ for brevity. By Definition 3, there exists $\nu_i' \in \mathcal M_i$ such that $\rho(\nu_i') > r$ and

$\mathrm{KL}(\nu_i, \nu_i') < K_i + \epsilon.$

Define the alternate bandit instance $\nu' \in \mathcal E$ by

$\nu' = (\nu_1,\dots, \nu_{i-1}, \nu_i', \nu_{i+1}, \dots \nu_K)$

for a suitably chosen $\nu_i' \in \mathcal M_i$ . We claim that $\mathbb E[T_i(n)] \geq 1/(K_i + \epsilon)$ . By the divergence decomposition lemma in Lattimore and Szepesvari (2020) (Lemma 15.1),

$\mathrm{KL}(\mathbb P_{\nu}, \mathbb P_{\nu'}) = \mathbb E[T_i(n)] \cdot \mathrm{KL}(\nu_i, \nu_i'),$

which implies that

$\displaystyle \mathbb E[T_i(n)] > \frac{ \mathrm{KL}(\mathbb P_{\nu}, \mathbb P_{\nu'}) }{ K_i + \epsilon }.$

Denote $R_n = \mathcal R_n^{\rho}(\pi, \nu)$ and $R_n' = \mathcal R_n^{\rho}(\pi, \nu')$ . Then similar to the writeup,

$\displaystyle R_n + R_n' \geq \frac n4 \cdot \min \{ \Delta_i^\rho, \rho(\nu_i') - \rho(\nu_1)\} \cdot \exp(- \mathbb E[T_i(n)] \cdot (K_i +\epsilon)).$

Performing algebra,

$\displaystyle \frac{ \mathbb E[T_i(n)] }{ \log(n) } \geq \frac 1{K_i +\epsilon} \cdot \left(1 - \frac{ \log\left( \frac{ 4 \cdot (R_n + R_n') }{ \min \{ \Delta_i^\rho, \rho(\nu_i') - \rho(\nu_1)\} } \right) }{ \log(n) } \right).$

Since $\pi$ is consistent, $R_n/{\log(n)} \to 0$ and $R_n'/{\log(n)} \to 0$ as $n \to \infty$ , which yields $1 - ( \cdots ) \to 1$ , so that

$\displaystyle \liminf_{n \to \infty} \frac{ \mathcal R_n^{\rho}(\pi, \nu) }{ \log(n) } = \sum_{i : \Delta_i^{\rho}} \frac {\mathbb E[T_i(n)]}{\log(n)} \cdot \Delta_i^{\rho} \geq \sum_{i : \Delta_i^{\rho}} \frac 1{K_i +\epsilon} \cdot \Delta_i^{\rho}.$

Taking $\epsilon \to 0^+$ yields the desired result.

Example 3. For any policy $\pi$ that is $\rho_{\mathbb E}$ -consistent over $\mathcal E_{\mathcal N}^K(1)$ , given that $\nu_1 = \mathcal N(\mu_1, 1) \in \mathcal N(1)$ is uniquely optimal, Example 3 yields

$\displaystyle \liminf_{n \to \infty} \frac{\mathcal R_n^{\rho_{\mathbb E}}(\pi, \nu)}{\log(n)} \geq \sum_{i : \Delta_i^{\rho} > 0} \frac{1}{ \mathcal K_{\inf}^{\rho_{\mathbb E}} ( \nu_i, \mu_1 , \mathcal N(1) ) } \cdot \Delta_i^{\rho} = \sum_{i : \Delta_i^{\rho} > 0} \frac{2}{ \Delta_i }.$

Definition 4. Using the notation in Theorem 1, we say that $\pi$ is asymptotically optimal over $\mathcal E$ if for any $\nu \in \mathcal E$ ,

$\displaystyle \limsup_{n \to \infty} \frac{\mathcal R_n^{\rho}(\pi, \nu)}{\log(n)} \leq \sum_{i : \Delta_i^{\rho} > 0} \frac{1}{ \mathcal K_{\inf}^{\rho} ( \nu_i, \rho(\nu_1) , \mathcal M_i ) } \cdot \Delta_i^{\rho}.$

Corollary 1. Thompson sampling is asymptotically optimal over $\mathcal M_{\mathcal N}^K(1)$ .

Proof. By previous discussions,

$\displaystyle \limsup_{n \to \infty} \frac{ \mathcal R_n(\mathrm{TS}, \nu) }{ \log(n) } \leq \sum_{i=1}^K \frac 2{\Delta_i},$

so that Thompson sampling is $\rho_{\mathbb E}$ -consistent over $\mathcal M_{\mathcal N}^K(1)$ . By Example 3,

$\displaystyle \liminf_{n \to \infty} \frac{ \mathcal R_n(\mathrm{TS}, \nu) }{ \log(n) } \geq \sum_{i=1}^K \frac 2{\Delta_i}.$

By Definition 4, Thompson sampling is asymptotically optimal over $\mathcal M_{\mathcal N}^K(1)$ :

$\displaystyle \lim_{n \to \infty} \frac{ \mathcal R_n(\mathrm{TS}, \nu) }{ \log(n) } = \sum_{i=1}^K \frac 2{\Delta_i}.$

In that sense, Thompson sampling is the ultimate winner! And of course, the modified UCB that achieves the same regret bound is also the ultimate winner over $\mathcal M_{\mathcal N}^K(1)$ . Yes, Definition 4 admits multiple ultimate winners.

And the discussion is far from over. Denoting $\mathrm{Ber} \equiv \{\mathrm{Ber}(p) : p \in [0, 1]\}$ , my joint work with my research supervisor shows that $\rho$ -TS is the ultimate winner over the bandit environment $\mathrm{Ber}^K$ for many types of risk functionals, including but not limited to the following famous ones:

Expectation: $\rho_{\mathbb{E}}(\nu) = \mathbb{E}[X], X \sim \nu$ ,
Variance: $\rho_{\mathbb{V}}(\nu) = \mathbb{V}(X), X \sim \nu$ ,
Conditional value-at-risk: $\rho_{\mathrm{CVaR}_\alpha}(\nu) = \mathrm{CVaR}_\alpha(X), X \sim \nu$ ,
Mean-variance: $\rho_{\mathrm{MV}_\alpha} := \alpha \rho_{\mathbb E} - \rho_{\mathbb V}$ ,
Entropic risk: $\rho_{\mathrm{ER}_\theta}(\nu) := -\frac 1{\theta} \log(\mathbb E[\exp(-\theta X)])$ , $X \sim \nu$ .

Denoting the collection of probability distributions with support bounded in $[0,1]$ by $\mathcal M_{[0,1]}$ , my ambition was to show that $\rho$ -TS is the ultimate winner over the bandit environment $\mathcal M_{[0,1]}^K$ . Sadly, it was too ambitious, and is left as a highly non-trivial (possibly impossible?) exercise for the aspiring undergraduate. Perhaps it might work for the various possible $\mathcal M^K$ ?

$\mathcal M := \mathcal C_{[0,1]}$ , i.e. the collection of $\mathcal M_{[0,1]}$ with continuous p.d.f.s.,
$\mathcal M := \mathcal N(1)$ , i.e. the collection of $1$ -Gaussian distributions,
$\mathcal M := \mathrm{SG}(1)$ , i.e. the collection of $1$ -sub-Gaussian distributions?

I honestly don’t know, and don’t have enough funding in time or money to find out. But I have to say that the nostalgia of revisiting this topic four years later brought me a much-needed closure from those research years.

For now, we are done.

Happy Valentine’s Day!

—Joel Kindiak, 14 Feb 26, 1602H

KindiakMath

The Ultimate Winner

Leave a comment Cancel reply

The Ultimate Winner

Share this:

Leave a comment Cancel reply