The Ultimate Winner

Previously, we have looked at three proposed solutions for the K-armed bandit problem: explore-then-commit (ETC), upper confidence bounds (UCB), and Thompson sampling (TS). The regret bounds in the first two algorithms work for general 1-sub-Gaussian bandit instances (i.e. \nu_i is 1-sub-Gaussian for each i), while the latter assumes a Gaussian bandit instance (i.e. \nu_i = \mathcal N(\mu_i, 1) for each i).

We will assume this stricter condition, since Gaussian distributions with variance 1 are, in fact, 1-sub-Gaussian. Given this information, and the bandit instance \nu, which algorithm wins out? We first recall the following regret upper bounds for ETC and UCB:

\begin{aligned} \frac{\mathcal R_n(\text{ETC}, \nu) }{\log(n)} &\leq \sum_{i=1}^K \frac 1{\log(n)} \left( m  + (n-mK) e^{ -m\Delta_i^2/4 } \right)\Delta_i, \\ \frac{\mathcal R_n(\text{UCB}, \nu) }{\log(n)} &\leq \sum_{i=1}^K  \left( \frac 3{\log(n)} + \frac {16}{\Delta_i^2} \right)\Delta_i. \end{aligned}

The ETC strategy, though simple, is a bit too simplistic. Disregarding the fact that our exploration phase m directly impacts the regret bound, even if we assumed m to be merely constant (and not growing in n), the right hand side yields an upper bound of the form n/{\log(n)}, which tends to \infty as n \to \infty. The UCB algorithm, on the other hand, yields a long-run regret bound of \leq 16/\Delta_i^2, which is constant and hence ideal.

Round 1: UCB wins against ETC.

Where does Thompson sampling come into the picture? Let’s compare their regret bounds:

\begin{aligned}  \frac{\mathcal R_n(\text{UCB}, \nu) }{\log(n)} &\leq \sum_{i=1}^K  \left( \frac 3{\log(n)} + \frac {16}{\Delta_i^2} \right)\Delta_i, \\  \frac{\mathcal R_n(\text{TS}, \nu) }{\log(n)} &\leq \sum_{i=1}^K \frac {2}{\Delta_i^2} \cdot \Delta_i. \end{aligned}

Trivially, 2/\Delta_i^2 \leq 16/\Delta_i^2 \leq 3/{\log(n)} + 16/\Delta_i^2, so that we get the following.

Round 2: Thompson sampling wins against UCB.

Remark 1. The reality is, however, a lot more subtle. It turns out that a clever choice of m that depends on n can allow the upper bound for ETC to converge. Furthermore, the UCB algorithm can be modified (Chapter 8 of Lattimore and Szepesvari (2020)) to recover the constant 2/\Delta_i^2, and arguably is more extensive than Thompson sampling, since the UCB algorithm works for a wider class of bandits (namely, 1sub-Gaussians rather than just Gaussians with variance 1). The UCB algorithm is even more straightforward to generalise from a theoretical perspective, while TS analyses become highly technical—even in the simple case of Gaussians with variance 1.

But here’s the problem—how do we know that Thompson sampling or the modified UCB in Remark 2 are really the best algorithms for a bandit instance of Gaussians with variance 1? We need to develop the notion of asymptotic optimality. We take inspiration from Chapter 16 of Lattimore and Szepesvari (2020).

Denote \mathcal N(1) := \{ \mathcal N(\mu, 1) : \mu \in \mathbb R\} and denote the environment of K-armed 1-Gaussian bandits by \mathcal M_{\mathcal N}^K(1) := \mathcal N(1)^K.

Let \rho denote a risk functional.

Definition 1. A policy (i.e. bandit algorithm) is \rhoconsistent over a bandit environment \mathcal E if for any \nu \in \mathcal E and \alpha > 0, \mathcal R_n^{\rho}(\pi, \nu)/n^\alpha \to 0 as n \to \infty.

In particular, UCB and TS are \rho_{\mathbb E}-consistent over \mathcal E_{\mathcal N}^K(1) since for any \nu \in \mathcal E_{\mathcal N}^K(1) and \alpha > 0 they have constants C (depending only on \nu) such that

\displaystyle \frac{ \mathcal R_n^{\rho_{\mathbb E}}(\pi, \nu)}{ n^\alpha  } = \frac{ \mathcal R_n^{\rho_{\mathbb E}}(\pi, \nu) }{ \log(n) } \cdot \frac{\log(n)}{n^\alpha} \leq C \cdot 0 = 0.

For any policy consistent over the K-armed bandit environment

\mathcal E = \mathcal M_1 \times \cdots \times \mathcal M_K,

it turns out that we have a very elegant regret lower bound, that even extends to the generalised risk setting. To describe this lower-bound, however, requires us to explore some baby information theory, and in particular, the Kullback-Leibler divergence.

Definition 2 (KL-Divergence). Let \lambda denote the Lebesgue measure on (\mathbb R, \mathfrak{B}(\mathbb R)). For probability measures \nu \ll \lambda, \kappa \ll \lambda, if \kappa \ll \nu, define the KL-divergence of \nu relative to \kappa by

\displaystyle \mathrm{KL}(\nu, \kappa) := \int_{\Omega} \frac{\mathrm d\nu}{\mathrm d\lambda}  \cdot \log \left( \frac{\mathrm d\nu}{\mathrm d\kappa} \right)\, \mathrm d\lambda = \mathbb E_{\nu}\left[\log\left( \frac{\mathrm d\nu}{\mathrm d\kappa} \right)\right],

where \displaystyle \mathbb E_{\nu}[g(X)] \equiv \int_{\mathbb R} g(x)\, \mathrm d \nu(x).

Example 1. Given the Gaussian distributions \nu_i = \mathcal N(\mu_i, \sigma_i^2),

\displaystyle \mathrm{ KL }(\nu_1, \nu_2) = \frac 12 \left( \frac{(\mu_1 - \mu_2)^2}{\sigma_2^2} + \frac{\sigma_1^2}{\sigma_2^2} - \log \frac{\sigma_1^2}{\sigma_2^2}  - 1\right).

In the special case \sigma_1 = \sigma_2 = \sigma,

\displaystyle \mathrm{ KL }(\nu_1, \nu_2) =  \frac{(\mu_1 - \mu_2)^2}{2\sigma^2} .

Furthermore, if \sigma = 1, then \mathrm{KL}(\nu_1, \nu_2) = (\mu_1 - \mu_2)^2/2.

Proof. See this page.

Definition 3. Fix r \in \mathbb R. Define

\displaystyle \mathcal K_{\inf}^{\rho}(\nu, r, \mathcal M) = \inf_{ \kappa \in \mathcal M } \{ \mathrm{KL}(\nu, \kappa) : \rho(\kappa) > r\}.

Now let \mathcal E = \mathcal M_1 \times \cdots \times \mathcal M_K denote a bandit environment.

Example 2. For \nu_i = \mathcal N(\mu_i, 1) and \mu_1 > \mu_i,

\displaystyle \mathcal K_{\inf}^{\rho_{\mathbb E}}(\nu_i, \mu_1,  \mathcal N(1)) = \frac{ \Delta_i^2 }{ 2 },

where \Delta_i := \mu_1 - \mu_i.

Proof. By definition,

\displaystyle \mathcal K_{\inf}^{\rho_{\mathbb E}}(\nu_i, \mu_1,  \mathcal N(1)) = \inf_{ \kappa \in \mathcal N(1) } \{ \mathrm{KL}(\nu_i, \kappa) : \rho_{\mathbb E}(\kappa) > \mu_1\}.

By Example 1, given \kappa = \mathcal N(\mu_2, 1) \in \mathcal M(1) such that \rho_{\mathbb E}(\kappa) > \mu_1, we have \epsilon := \mu_2 - \mu_1 > 0, so that

\begin{aligned} \mathrm{ KL }(\nu_i, \kappa) &= \frac{(\mu_i - (\mu_1 + \epsilon))^2}{2} = \frac{ (\Delta_i+ \epsilon)^2 }{2}. \end{aligned}

Taking \epsilon \to 0^+,

\displaystyle \mathcal K_{\inf}^{\rho_{\mathbb E}}(\nu_i, \mu_1,  \mathcal N(1)) = \frac{ (\Delta_i + 0)^2 }{ 2 } = \frac{ \Delta_i^2 }{ 2 }.

Why do we care about the optimised KL-divergence? Because it plays a fundamental role in (asymptotically) lower-bounding the cumulative regret!

Theorem 1. Let \pi be a consistent policy over the K-armed bandit environment

\mathcal E = \mathcal M_1 \times \cdots \times \mathcal M_K.

Then for any \nu \in \mathcal E with unique \rho-optimal arm 1,

\displaystyle \liminf_{n \to \infty} \frac{\mathcal R_n^{\rho}(\pi, \nu)}{\log(n)} \geq \sum_{i : \Delta_i^{\rho} > 0} \frac{1}{ \mathcal K_{\inf}^{\rho} ( \nu_i, \rho(\nu_1) , \mathcal M_i ) } \cdot \Delta_i^{\rho}.

Proof. We follow the proof of Theorem 16.2 in Lattimore and Szepesvari (2020). Denote the bandit instance by \nu = (\nu_1,\dots, \nu_K) \in \mathcal E. Fix \epsilon > 0 arbitrary. Denote K_i := \mathcal K_{\inf}^{\rho} ( \nu_i, \rho(\nu_1) , \mathcal M_i ) for brevity. By Definition 3, there exists \nu_i' \in \mathcal M_i such that \rho(\nu_i') > r and

\mathrm{KL}(\nu_i, \nu_i') < K_i + \epsilon.

Define the alternate bandit instance \nu' \in \mathcal E by

\nu' = (\nu_1,\dots, \nu_{i-1}, \nu_i', \nu_{i+1}, \dots \nu_K)

for a suitably chosen \nu_i' \in \mathcal M_i. We claim that \mathbb E[T_i(n)] \geq 1/(K_i + \epsilon). By the divergence decomposition lemma in Lattimore and Szepesvari (2020) (Lemma 15.1),

\mathrm{KL}(\mathbb P_{\nu}, \mathbb P_{\nu'}) = \mathbb E[T_i(n)] \cdot \mathrm{KL}(\nu_i, \nu_i'),

which implies that

\displaystyle \mathbb E[T_i(n)] > \frac{ \mathrm{KL}(\mathbb P_{\nu}, \mathbb P_{\nu'}) }{ K_i + \epsilon }.

Denote R_n = \mathcal R_n^{\rho}(\pi, \nu) and R_n' = \mathcal R_n^{\rho}(\pi, \nu'). Then similar to the writeup,

\displaystyle R_n + R_n' \geq \frac n4 \cdot \min \{ \Delta_i^\rho, \rho(\nu_i') - \rho(\nu_1)\} \cdot \exp(- \mathbb E[T_i(n)] \cdot (K_i +\epsilon)).

Performing algebra,

\displaystyle \frac{ \mathbb E[T_i(n)] }{ \log(n) } \geq \frac 1{K_i +\epsilon} \cdot \left(1 - \frac{ \log\left( \frac{ 4 \cdot (R_n + R_n') }{ \min \{ \Delta_i^\rho, \rho(\nu_i') - \rho(\nu_1)\}  } \right) }{ \log(n) } \right).

Since \pi is consistent, R_n/{\log(n)} \to 0 and R_n'/{\log(n)} \to 0 as n \to \infty, which yields 1 - ( \cdots ) \to 1, so that

\displaystyle \liminf_{n \to \infty} \frac{ \mathcal R_n^{\rho}(\pi, \nu) }{ \log(n) } = \sum_{i : \Delta_i^{\rho}} \frac {\mathbb E[T_i(n)]}{\log(n)} \cdot \Delta_i^{\rho} \geq \sum_{i : \Delta_i^{\rho}} \frac 1{K_i +\epsilon} \cdot \Delta_i^{\rho}.

Taking \epsilon \to 0^+ yields the desired result.

Example 3. For any policy \pi that is \rho_{\mathbb E}-consistent over \mathcal E_{\mathcal N}^K(1), given that \nu_1 = \mathcal N(\mu_1, 1) \in \mathcal N(1) is uniquely optimal, Example 3 yields

\displaystyle \liminf_{n \to \infty} \frac{\mathcal R_n^{\rho_{\mathbb E}}(\pi, \nu)}{\log(n)} \geq \sum_{i : \Delta_i^{\rho} > 0} \frac{1}{ \mathcal K_{\inf}^{\rho_{\mathbb E}} ( \nu_i, \mu_1 , \mathcal N(1) ) } \cdot \Delta_i^{\rho} = \sum_{i : \Delta_i^{\rho} > 0} \frac{2}{ \Delta_i }.

Definition 4. Using the notation in Theorem 1, we say that \pi is asymptotically optimal over \mathcal E if for any \nu \in \mathcal E,

\displaystyle \limsup_{n \to \infty} \frac{\mathcal R_n^{\rho}(\pi, \nu)}{\log(n)} \leq \sum_{i : \Delta_i^{\rho} > 0} \frac{1}{ \mathcal K_{\inf}^{\rho} ( \nu_i, \rho(\nu_1) , \mathcal M_i ) } \cdot \Delta_i^{\rho}.

Corollary 1. Thompson sampling is asymptotically optimal over \mathcal M_{\mathcal N}^K(1).

Proof. By previous discussions,

\displaystyle \limsup_{n \to \infty} \frac{ \mathcal R_n(\mathrm{TS}, \nu) }{ \log(n) } \leq \sum_{i=1}^K \frac 2{\Delta_i},

so that Thompson sampling is \rho_{\mathbb E}-consistent over \mathcal M_{\mathcal N}^K(1). By Example 3,

\displaystyle \liminf_{n \to \infty} \frac{ \mathcal R_n(\mathrm{TS}, \nu) }{ \log(n) } \geq \sum_{i=1}^K \frac 2{\Delta_i}.

By Definition 4, Thompson sampling is asymptotically optimal over \mathcal M_{\mathcal N}^K(1):

\displaystyle \lim_{n \to \infty} \frac{ \mathcal R_n(\mathrm{TS}, \nu) }{ \log(n) } = \sum_{i=1}^K \frac 2{\Delta_i}.

In that sense, Thompson sampling is the ultimate winner! And of course, the modified UCB that achieves the same regret bound is also the ultimate winner over \mathcal M_{\mathcal N}^K(1). Yes, Definition 4 admits multiple ultimate winners.

And the discussion is far from over. Denoting \mathrm{Ber} \equiv \{\mathrm{Ber}(p) : p \in [0, 1]\}, my joint work with my research supervisor shows that \rho-TS is the ultimate winner over the bandit environment \mathrm{Ber}^K for many types of risk functionals, including but not limited to the following famous ones:

  • Expectation: \rho_{\mathbb{E}}(\nu) = \mathbb{E}[X], X \sim \nu,
  • Variance: \rho_{\mathbb{V}}(\nu) = \mathbb{V}(X), X \sim \nu,
  • Conditional value-at-risk: \rho_{\mathrm{CVaR}_\alpha}(\nu) = \mathrm{CVaR}_\alpha(X), X \sim \nu,
  • Mean-variance: \rho_{\mathrm{MV}_\alpha} := \alpha \rho_{\mathbb E} - \rho_{\mathbb V},
  • Entropic risk: \rho_{\mathrm{ER}_\theta}(\nu) := -\frac 1{\theta} \log(\mathbb E[\exp(-\theta X)]), X \sim \nu.

Denoting the collection of probability distributions with support bounded in [0,1] by \mathcal M_{[0,1]}, my ambition was to show that \rho-TS is the ultimate winner over the bandit environment \mathcal M_{[0,1]}^K. Sadly, it was too ambitious, and is left as a highly non-trivial (possibly impossible?) exercise for the aspiring undergraduate. Perhaps it might work for the various possible \mathcal M^K?

  • \mathcal M := \mathcal C_{[0,1]}, i.e. the collection of \mathcal M_{[0,1]} with continuous p.d.f.s.,
  • \mathcal M := \mathcal N(1), i.e. the collection of 1-Gaussian distributions,
  • \mathcal M := \mathrm{SG}(1), i.e. the collection of 1-sub-Gaussian distributions?

I honestly don’t know, and don’t have enough funding in time or money to find out. But I have to say that the nostalgia of revisiting this topic four years later brought me a much-needed closure from those research years.

For now, we are done.

Happy Valentine’s Day!

—Joel Kindiak, 14 Feb 26, 1602H

,

Published by


Leave a comment