The Thompson Sampling Template

Let’s finally discuss Thompson sampling.

Previously, we looked at the UCB algorithm that incorporates the dynamic reward history into its exploration-exploitation tradeoff, yielding an upper-bound on its regret of the form

\displaystyle A \sum_{i=1}^K \Delta_i + f(n) \sum_{i : \Delta_i > 0} \frac{1}{\Delta_i}.

After an initial pull that yields the reward history (A_1,X_1,\dots, A_K, X_K), subsequent arm pulls are computed using said information in a deterministic manner relative to the reward history:

\displaystyle A_{t+1} = \underset{i=1,\dots, K}{\arg \max} \left( \hat{\mu}_{i,T_i(t)} + \sqrt{ \frac{2 \sigma^2 \log(1/\delta) }{ T_i(t) } } \right),

where the expression being arg-maxxed is motivated by, given s, a tail concentration bound satisfied by an arm i with \sigma_i-sub-Gaussian distribution:

\displaystyle \mathbb P\left( \mu_i > \hat{\mu}_{i,s} + \sqrt{ \frac{2 \sigma^2 \log(1/\delta) }{ s } }\right) \leq \delta

Thompson sampling aims to randomise this selection. To simplify our discussion, we will be given knowledge that each \nu_i has a distribution of \mathcal N(\mu_i, 1), and our goal is to learn \mu_i.

When we initialise the algorithm, we are completely ignorant of the distribution of any \nu_i. So we get to choose our initial guess. Let \mathbb P_{i, t} denote our guess of the distribution of \nu_i, so \mathbb P_{i, 0} denotes our initial guess. Intuitively, after t pulls of arm i, we want to update our guess \mathbb P_{i, t+1} given the previous guess \mathbb P_{i, t} and the reward X_{ t+1} \sim \nu_i obtained from said arm, i.e. we set

\mathbb P_{i, t+1}(\cdot \mid X_1,\dots, X_t, X_{t+1}) := \mathrm{Update}(\mathbb P_{i, t}(\cdot \mid X_1,\dots, X_t), X_{t+1})

where \mathrm{Update}(\cdot , \cdot) is carefully curated for our purposes.

The Thompson sampling algorithm then proceeds as follows. As an initial exploration phase, for each t = 1,\dots, K, pull the arm A_t = t, then update the arm distributions:

\mathbb P_{A_t, 1}(\cdot \mid X_{A_t}) := \mathrm{Update}(\mathbb P_{A_t, 0}, X_{A_t}).

For each t \geq K+1, select

A_{t+1} = \underset{i=1,\dots, K}{\arg \max}\ \theta_{i, t+1},

where \theta_{i, t+1} \sim \mathbb P_{i, T_i(t)} independently, and finally, collect the reward X_{A_{t+1}} \sim \nu_{A_{t+1}}. Then update the arms just like before:

\mathbb P_{A_{t+1}, T_i(t+1)} = \mathbb P_{A_{t+1}, T_i(t)+1} := \mathrm{Update}(\mathbb P_{A_t, T_i(t)}, X_{A_{t+1}}),

where the various probability measures are still conditioned on the reward history, but omitted for brevity.

Our current discussion has brought us far too high into the stratosphere, and we need to return to earth. It stands to reason then that, in the absence of additional information, a meaningful “initial guess” would be

\mathbb P_{i, 0} = \mathcal N(0, 1) \equiv \mathcal N(\theta_{i,0}, \sigma_{i,0}^2),\quad \theta_{i,0} := 0,\quad \sigma_{i,0}^2 := 1.

After pulling arm i once and receiving a reward X_{i,1} \sim \nu_i, our next guess should, intuitively, take the form of \mathbb P_{i,1} = \mathcal N(\theta_{i,1}, \sigma_{i,1}^2), where (\theta_{i,1}, \sigma_{i,1}^2) should be intimately related to ( \theta_{i,0}, \sigma_{i,0}^2, X_{i,1} ) . We can accomplish that goal using conjugate priors, justified by a measure-theoretic application of Bayes’ theorem.

Lemma 1. Suppose for any s \geq 0,

\mathbb P_{i,s+1}( \cdot \mid X_{i,1} \dots, X_{i,s}) = \mathcal N(\theta_{i,s+1}, \sigma_{i,s+1}^2).

Using conjugate priors,

\displaystyle \frac{ \theta_{i,s+1} }{ \sigma_{i,s+1}^2 } = \frac{ \theta_{i,s} }{ \sigma_{i,s}^2 } + X_{i,s+1},\quad \frac 1{\sigma_{i,s+1}^2} = \frac 1{\sigma_{i,s}^2}+1.

In particular, if \mathbb P_{i, 0} = \mathcal N(0, 1), then

\displaystyle \theta_{i,s+1} = \frac{X_{i,1} + \cdots + X_{i,s+1}}{s+1} =: \hat{\mu}_{i,s+1},\quad \frac 1{\sigma_{i,s+1}^2} = s+1.

Proof. See Section 34.3 and Example 34.7 in Lattimore and Szepesvari (2020).

Therefore, we would use the \mathrm{Update}(\cdot,\cdot) procedure as follows:

\mathrm{Update}(\mathcal N(\hat{\mu}_{i,t}, 1/t), X_{i, t+1}) = \mathcal N(\hat{\mu}_{i,t+1}, 1/(t+1)),

with all relevant quantities defined in Lemma 1.

Remark 1. Observe that \rho_{\mathbb E}(\mathcal N(\mu, \sigma^2)) = \mu. More generally, since \mathcal N(\mu, \sigma^2) is characterised by its mean and variance, for any risk functional \rho, there exists a unique function \tilde \rho : \mathbb R \times (0,\infty) \to \mathbb R such that \rho(\mathcal N(\mu, \sigma^2)) = \tilde \rho(\mu, \sigma^2).

Remark 2. If all we know is that \nu_i = \mathcal N(\mu_i, \sigma_i^2), then we would require a stronger update procedure as outlined by Zhu and Tan (2020) (Figure 2). The upside of this approach is that for any t, we can sample

\theta_{i, t + 1} \sim \mathcal N( \hat{\mu}_{i, T_i(t)}, 1/T_i(t)),\quad \tau_{i,t+1} \sim \Gamma(\alpha_{i,T_i(t)}, \beta_{i,T_i(t)} )

to form the randomised distribution \tilde{\nu}_{i,t+1} = \mathcal N(\theta_{i, t + 1}, 1/\tau_{i, t+1}). Here, the parameters (\alpha_{i,t}, \beta_{i,t}) are updated according to the rules

\begin{aligned} \alpha_{i,t+1} &= \alpha_{i,t} +1/2, \\ \beta_{i,t+1} &= \beta_{i,t} + \frac{T_i(t)}{T_i(t)+1} \cdot \frac{(X_{i,T_i(t+1)} - \hat{\mu}_i(t) )^2}{2}, \end{aligned}

similar in spirit to Lemma 1. Denoting

\mathbb P_{i,t+1} := \mathcal N( \hat{\mu}_{i, T_i(t)}, 1/T_i(t)) \times \Gamma(\alpha_{i,T_i(t)}, \beta_{i,T_i(t)} ),

we sampled \kappa_{i,t} = (\theta_{i, t + 1}, \tau_{i,t+1}) \sim \mathbb P_{i,t+1} and defined \tilde{\nu}_{i,t+1} := \Pi(\kappa_{i,t}), where \Pi (x,y) = \mathcal N(x,1/y). Then the arm-selection criterion would be similarly modified to

A_t := \underset{i=1,\dots, K}{\arg \max}\ \rho(\tilde{\nu}_{i,t+1} ).

Now we ask the crucial question: given \nu = (\nu_1,\dots,\nu_K) with \nu_i \sim \mathcal N(\mu_i, \sigma_i^2), how do we upper bound the following cumulative regret?

\displaystyle \mathcal R_n(\rho\text{-}\mathrm{TS},\nu) = \sum_{i : \Delta_i^{\rho} > 0} \mathbb E[T_i(n)] \Delta_i^{\rho}

My attempt at solving this problem for the risk functional \rho being the conditional value-at-risk (CVaR) was the substance of my undergraduate research, inspired by Zhu and Tan (2020) and Baudry et al (2021). In this post, we will ultimately only solve the simple case \sigma_i = 1 and \rho = \rho_{\mathbb E}, while we chart the way for future progress on the generalised problem.

Remark 3. There are many other distributions that \nu_i could take. In the case \nu_i are all Bernoulli bandits, my final year project solves the problem in the affirmative for general risk measures \rho, adapted from seminal works by Riou and Honda (2010) and Baudry et al (2021). It even solves the problem if \nu_i is simply known to have finite support, albeit only guaranteed for a stricter subset of (nonetheless useful) risk functionals. Unfortunately, it has not solved the general case of \nu_i with support [0, 1], or \nu_i being Gaussian, or even \nu_i being sub-Gaussian. I was tempted to restart this project, but shall relegate it as an exercise. Doing these posts on bandits brought me a much-needed closure that I never knew I needed.

Lemma 2. Assume that arm 1 is uniquely optimal. Fix \epsilon > 0. Define

E_{i,t} := \{ \rho( \tilde{\nu}_{i,t} ) \leq \rho(\nu_1) - \epsilon \}

and G_{i,s} := \mathbb P( E_{i,s+1}^c \mid X_{i,1}, \dots, X_{i,s}), where:

  • X_{i,t} \sim \nu_i independently,
  • \tilde{\nu}_{i,s+1} \sim \mathbb P_{i, s}(\cdot \mid X_{i,1}, \dots, X_{i,s}),
  • \mathbb P_{i, t+1}(\cdot \mid X_{i,1},\dots, X_{i,t}, X_{i,t+1}) = \mathrm{Update}(\mathbb P_{i,t}(\cdot \mid X_{i,1},\dots, X_{i,t}), X_{i,t+1}),
  • \rho is a risk functional (i.e. \rho_{\mathbb E}(\nu_i) = \mu_i).

Then

\displaystyle \mathbb E[T_i(n)] \leq \mathbb E\left[ \sum_{s=0}^{n-1} \left( \frac 1{G_{1,s}} - 1 \right) \right] + \mathbb E\left[ \sum_{s=0}^{n-1} \mathbb I\{ G_{i,s} > 1/n \} \right] + 1.

Hence, we remove the randomness of T_i(t) in our subsequent analyses.

Proof. The result and proof are all a mild generalisation of Theorem 36.2 in Lattimore and Szepesvari (2020). By construction,

\begin{aligned} \mathbb P(E_{1,t}^c \mid \mathcal F_{t-1}) &= \mathbb P( E_{1,t}^c \mid X_{1,1},\dots, X_{1,T_1(t-1)}) = G_{1, T_1(t-1)} \end{aligned}

almost surely. Therefore, we now decompose \mathbb E[T_i(n)]:

\begin{aligned} \mathbb E[T_i(n)] &= \mathbb E\left[ \sum_{t=1}^n \mathbb I\{ A_t = i \} \right] \\ &= \underbrace{ \mathbb E\left[ \sum_{t=1}^n \mathbb I\{ A_t = i, E_{i,t} \} \right] }_{M_1} + \underbrace{ \mathbb E\left[ \sum_{t=1}^n \mathbb I\{ A_t = i, E_{i,t}^c \} \right] }_{M_2}. \end{aligned}

To upper-bound M_2, we refer to the solutions to Problem 36.5 of Lattimore and Szepesvari (2020). Define

\mathcal T := \{ t \in \{1,\dots, n\} : G_{i, T_i(t-1)} > 1/n\}

Using a union bound,

\begin{aligned} \mathbb E\left[ \sum_{t=1}^n \mathbb I \{ A_t = i, E_{i,t}^c \} \right] &= \mathbb E\left[ \sum_{t \in \mathcal T} \mathbb I \{ A_t = i, E_{i,t}^c \} \right] + \mathbb E\left[ \sum_{t \notin \mathcal T} \mathbb I \{ A_t = i, E_{i,t}^c \} \right] \\ &\leq \underbrace{\mathbb E\left[ \sum_{t \in \mathcal T} \mathbb I \{ A_t = i \} \right] }_{M_{2,1}} + \underbrace{ \mathbb E\left[ \sum_{t \notin \mathcal T} \mathbb I \{ E_{i,t}^c \} \right] }_{M_{2,2}}.\end{aligned}

Therefore, it suffices to upper-bound

\mathbb E[T_i(n)] \leq M_1 + M_{2,1} + M_{2,2}.

To upper-bound M_1, first denote the chosen non-optimal arm by B_t:

B_{t+1} = \underset{i\neq 1}{\arg \max}\ \rho(\tilde{\nu}_{i,t}).

Then

\begin{aligned} \mathbb P(A_t = 1, E_{i,t} \mid \mathcal F_{t-1}) &\geq \mathbb P(B_t = i, E_{i,t} , E_{1,t}^c \mid \mathcal F_{t-1}) \\ &= \mathbb P(E_{1,t}^c \mid \mathcal F_{t-1}) \cdot \mathbb P(B_t = i, E_{i,t} \mid \mathcal F_{t-1}) \\ &= \mathbb P(E_{1,t}^c \mid \mathcal F_{t-1}) \cdot \mathbb P(B_t = i, E_{i,t} \mid \mathcal F_{t-1}), \end{aligned}

since E_{1,t}^c is conditionally independent of B_t = i, E_{i,t} given \mathcal F_{t-1}. To lower-bound the second term further,

\begin{aligned} \{A_t = i \} \cap E_{i,t} \subseteq (\{ B_t = i \} \cap E_{i, t}) \cap E_{1,t} \end{aligned}

and conditional independence given \mathcal F_{t-1} implies that

\begin{aligned} \mathbb P(A_t = i, E_{i,t} \mid \mathcal F_{t-1}) &\leq \mathbb P(B_t = i, E_{i,t} \mid \mathcal F_{t-1}) \cdot \mathbb P(E_{1,t} \mid \mathcal F_{t-1}) \\ &\leq \frac{\mathbb P(E_{1,t} \mid \mathcal F_{t-1})}{\mathbb P(E_{1,t}^c \mid \mathcal F_{t-1})}  \cdot \mathbb P(A_t = 1, E_{i,t} \mid \mathcal F_{t-1}) \\ &\leq \frac{\mathbb P(E_{1,t} \mid \mathcal F_{t-1})}{\mathbb P(E_{1,t}^c \mid \mathcal F_{t-1})}  \cdot \mathbb P(A_t = 1 \mid \mathcal F_{t-1}) \\ &= \left( \frac{1}{\mathbb P(E_{1,t}^c \mid \mathcal F_{t-1})} -1 \right) \cdot \mathbb P(A_t = 1 \mid \mathcal F_{t-1}) \\ &= \left( \frac{1}{G_{1,T_1(t-1)}} -1 \right) \cdot \mathbb P(A_t = 1 \mid \mathcal F_{t-1}). \end{aligned}

Therefore, we can upper-bound M_1 by

\begin{aligned} M_1 &= \mathbb E\left[ \sum_{t=1}^n \mathbb I\{ A_t = i, E_{i,t}\} \right] \\ &\leq \mathbb E \left[ \sum_{t=1}^n \mathbb P ( A_t = i, E_{i,t}  \mid \mathcal F_{t-1})\right] \\ &\leq \mathbb E \left[ \sum_{t=1}^n  \left( \frac{1}{G_{1,T_1(t-1)}} -1 \right) \cdot \mathbb P(A_t = 1 \mid \mathcal F_{t-1})\right] \\ &=\mathbb E \left[ \sum_{t=1}^n  \left( \frac{1}{G_{1,T_1(t-1)}} -1 \right) \cdot \mathbb I \{ A_t = 1 \} \right] \\ &\leq \mathbb E \left[ \sum_{s=0}^{n-1}  \left( \frac{1}{G_{1,s}} -1 \right) \right]. \end{aligned}

To upper-bound M_{2,1}, recall that

\mathcal T := \{ t \in \{1,\dots, n\} : G_{i,T_i(t-1)} > 1/n\}

Therefore,

\begin{aligned} \sum_{t \in \mathcal T} \mathbb I\{ A_t = i \} &\leq \sum_{t=1}^n \sum_{s=1}^n \mathbb I\{ T_i(t) = s, T_i(t-1) = s-1 ,  G_{i,T_i(t-1)} > 1/n  \} \\ &\leq \sum_{s=1}^n \sum_{t=1}^n \mathbb I\{ T_i(t) = s, T_i(t-1) = s-1 ,  G_{i,s-1} > 1/n \} \\ &\leq \sum_{s=1}^n \mathbb I\{ G_{i,s-1} > 1/n \} \underbrace{ \sum_{t=1}^n \mathbb I\{ T_i(t) = s, T_i(t-1) = s-1 \} }_{\leq 1} \\ &\leq \sum_{s=1}^n \mathbb I\{ G_{i,s-1} > 1/n \} \\ &= \sum_{s=0}^{n-1} \mathbb I\{ G_{i,s} > 1/n \}.  \end{aligned}

Taking expectations,

\begin{aligned} M_{2,1} = \mathbb E\left[ \sum_{t \in \mathcal T} \mathbb I\{ A_t = i \} \right] &\leq \mathbb E\left[ \sum_{s=0}^{n-1} \mathbb I\{ G_{i,s} > 1/n \} \right]. \end{aligned}

To upper-bound M_{2,2}, t \notin \mathcal T if and only if G_{i, T_i(t-1)} \leq 1/n:

\begin{aligned} M_{2,2} &= \mathbb E\left[ \sum_{t \notin \mathcal T} \mathbb I\{E_{i,t}^c\} \right] \\ &= \sum_{t=1}^n \mathbb E[ \mathbb I\{E_{i,t}^c, G_{i, T_i(t-1)} \leq 1/n\} ] \\ &= \sum_{t=1}^n \mathbb E[ \mathbb E[\mathbb I\{E_{i,t}^c, G_{i, T_i(t-1)} \leq 1/n\}\mid  \mathcal F_{t-1}]  ] \\ &= \sum_{t=1}^n \mathbb E[ \mathbb I\{ G_{i, T_i(t-1)} \leq 1/n \}] \cdot \mathbb P(E_{i,t}^c)  ] \\ &= \sum_{t=1}^n \mathbb E[ \mathbb I\{ G_{i, T_i(t-1)} \leq 1/n \}] \cdot G_{i,T_i(t-1)} ] \\ &\leq \sum_{t=1}^n \mathbb E[ \mathbb I\{ G_{i, T_i(t-1)} \leq 1/n \}] \cdot 1/n ] \\ &= \sum_{t=1}^n \mathbb P(G_{i, T_i(t-1)} \leq 1/n ) \cdot 1/n  \\ &\leq \sum_{t=1}^n 1 \cdot 1/n = 1. \end{aligned}

Therefore,

\begin{aligned} \mathbb E[T_i(n)] &\leq M_1 + M_{2,1} + M_{2,2} \\ &\leq \mathbb E\left[ \sum_{s=0}^{n-1} \left(\frac 1{G_{1,s}} - 1 \right) \right] + \mathbb E\left[ \sum_{s=1}^n \mathbb I\{G_{i,s-1} > 1/n\} \right] + 1. \end{aligned}

Lemma 2 gives us the general recipe to evaluate the regret bound for \rho-TS, but we don’t get any concrete result until we consider specific bandit instances. For simplicity, let’s suppose \nu_i = \mathcal N(\mu_i, 1) for each i. We initialise \mathbb P_{i,0} := \mathcal N(0, 1) and use Lemma 1 to update our belief distribution \mathbb P_{i,t+1} = \mathcal N(\hat{\mu}_{i,t+1}, 1/(t+1)) of arm i at time t after observing the data X_{i,1},\dots, X_{i,T_i(t)} \sim \nu_i independently.

How do we upper-bound the cumulative regret \mathcal R_n (\rho_{\mathbb E}\text{-}\mathrm{TS}, \nu)?

Fix \epsilon > 0 and consider the regret bound recipe in Lemma 2:

\displaystyle \mathbb E[T_i(n)] \leq \underbrace{ \mathbb E\left[ \sum_{s=0}^{n-1} \left( \frac 1{G_{1,s}} - 1 \right) \right] }_{C_1(n,\epsilon)} + \underbrace{ \mathbb E\left[ \sum_{s=0}^{n-1} \mathbb I\{ G_{i,s} > 1/n \} \right] }_{C_2(n,\epsilon)} + 1,

where G_{i,s} = \mathbb P(E_{i,s+1}^c \mid X_{i,1}, \dots, X_{i,s}) and

E_{i,t} = \{\rho_{\mathbb E}(\tilde{\nu}_{i,t} \leq \rho_{\mathbb E}(\nu_1) - \epsilon\} = \{ \theta_{i,t} \leq \mu_1 - \epsilon\},

where \theta_{i,s+1} \sim \mathbb P_{i,s}(\cdot \mid X_{i,1},\dots, X_{i,s}) = \mathcal N(\hat{\mu}_{i,s} , 1/s) independently.

We aim to upper-bound C_1(n,\epsilon) and C_2(n,\epsilon) then eventually take \epsilon \to 0^+.

Lemma 3. C_1(n,\epsilon)/{\log(n)} \to 0 as n \to \infty.

Proof. To upper-bound C_1(n,\epsilon), we first use linearity:

\displaystyle C_1(n,\epsilon) = \sum_{s=0}^{n-1} \mathbb E\left[ \frac{1}{ G_{1,s} } - 1 \right].

The key idea is to note that the value of

G_{1,s} = \mathbb P(\theta_{1,s+1} > \mu_1 - \epsilon \mid X_{i,1},\dots, X_{i,s})

is deterministic when conditioned on (X_{i,1},\dots, X_{i,s}), since

\theta_{i,s+1} \sim \mathcal N(\hat{\mu}_{i,s}, 1/s),

and the distribution of \hat{\mu}_{i,s}, in turn, is determined by the reward history:

\displaystyle \hat{\mu}_{i,s} = \frac{X_{i,1} + \cdots + X_{i,s}}{s} \sim \mathcal N(\mu_i, 1/s).

Therefore, we can let f denote the p.d.f of \mathcal N(0, 1/s) for flexibility:

\displaystyle f(y) = \sqrt{ \frac { s }{ 2\pi } } \exp\left(-\frac{s y^2}{2} \right),

and F denote its c.d.f.:

\displaystyle F(y) = \int_{-\infty}^y f(x)\, \mathrm dx.

Conditioned on \hat{\mu}_{i,s} := y,

\theta_{i,s+1} \sim \mathcal N(\hat{\mu}_{i,s}, 1/s) \quad \Rightarrow \quad \theta_{i,s+1} - \hat{\mu}_{i,s} \sim \mathcal N(0, 1/s)

so that

\begin{aligned} G_{1,s} &= \mathbb P(\theta_{1,s+1} > \mu_1 - \epsilon \mid \hat{\mu}_{1,s} = y)\\ &= 1 - \mathbb P(\theta_{1,s+1} \leq \mu_1 - \epsilon \mid \hat{\mu}_{1,s} = y)\\ &= 1 - \mathbb P(\theta_{1,s+1} - \hat{\mu}_{1,s} \leq \mu_1 - \hat{\mu}_{1,s} - \epsilon \mid \hat{\mu}_{1,s} = y)\\ &= 1 - \mathbb P(\theta_{1,s+1} - y \leq \mu_1 - y - \epsilon)\\  &= 1 - F(\mu_1 - y - \epsilon) \end{aligned}

almost surely. Therefore, by taking advantage of the marginal distribution and using the change of variables y \to \mu_1 -y - \epsilon,

\begin{aligned}\mathbb E\left[ \frac{1}{ G_{1,s} } - 1 \right] &= \int_{\mathbb R} \left( \frac 1{1-F( \mu_1 - y - \epsilon)} - 1 \right) f(\mu_1-y)\, \mathrm dy \\  &= \int_{\mathbb R} \left( \frac 1{1-F( y )} - 1 \right) f(y + \epsilon )\, \mathrm dy \\ &= \int_{\mathbb R} \frac {F(y)}{1 - F(y)} \cdot f( y + \epsilon)\, \mathrm dy \\ &= \left( \int_{-\infty}^0 + \int_0^\infty \right) \frac {F(y)}{1 - F(y)} \cdot f( y + \epsilon)\, \mathrm dy. \end{aligned}

Therefore, we need to upper-bound two integrals. For the first integral, 1-F_s(y) \geq 1/2 so that given W \sim \mathcal N(0, 1/s) and the Gaussian tail bound

\displaystyle \mathbb P(W \leq -\epsilon) \leq \exp(-s\epsilon^2/2).

we have

\begin{aligned} \int_{-\infty}^0 \frac {F(y)}{1 - F(y)} \cdot f( y + \epsilon)\, \mathrm dy &\leq 2 \int_{-\infty}^0 F_{s}(y) \cdot f_{s}( y + \epsilon)\, \mathrm dy \\ &\leq 2 \int_{-\infty}^0 \mathbb P(W \leq y - \epsilon) \cdot f( y + \epsilon)\, \mathrm dy \\ &\leq 2 \int_{-\infty}^0 \mathbb P(W \leq -\epsilon) \cdot f( y + \epsilon)\, \mathrm dy \\ &\leq 2 \int_{-\infty}^0 \exp(-s\epsilon^2/2) \cdot f( y + \epsilon)\, \mathrm dy \\ &\leq 2 \exp(-s\epsilon^2/2) \underbrace{\int_{\mathbb R} f_{s}( y + \epsilon)\, \mathrm dy }_1 \\ &= 2 \exp(-s\epsilon^2/2). \end{aligned}

For the second integral, F_s(y) \leq 1 and using a tail lower bound by Abramowitz and Stegun (1964),

\displaystyle 1 - F_s(y) \geq \frac{ \exp(-sy^2/2) }{ y\sqrt s + \sqrt{sy^2 + 4} },

followed by algebra and integration by parts,

\begin{aligned} &\int_0^\infty \frac {F(y)}{1 - F(y)} \cdot f( y + \epsilon)\, \mathrm dy \\ &\leq \int_0^\infty \exp(sy^2/2) \cdot (y\sqrt s + \sqrt{sy^2 + 4}) \cdot \sqrt{ \frac{s}{2\pi} } \exp(-s(y+\epsilon)^2/2)\, \mathrm dy \\ &\leq \sqrt{ \frac{s}{2\pi} } \int_0^\infty (y\sqrt s +y\sqrt s + 2) \cdot \exp(s(y^2-(y+\epsilon)^2)/2)\, \mathrm dy \\ &\leq 2\exp(-s\epsilon^2/2)\sqrt{ \frac{s}{2\pi} }\int_0^\infty (y\sqrt s + 1) \cdot \exp(-s y \epsilon)\, \mathrm dy \\ &\leq 2\exp(-s\epsilon^2/2)\sqrt{ \frac{s}{2\pi} } \cdot \frac{1+\epsilon \sqrt s}{\epsilon s\sqrt s} \\ &= 2\cdot\frac{1+\epsilon \sqrt{s}}{\epsilon s\sqrt{2\pi}} \cdot \exp(-s\epsilon^2/2). \end{aligned}

Combining both results,

\begin{aligned} C_1(n,\epsilon) &= \sum_{s=0}^{n-1} \mathbb E\left[ \frac 1{G_{1,s}} - 1 \right] \\ &\leq 1 + 2 \cdot \sum_{s=1}^{\infty} \left( \exp(-s\epsilon^2/2) + \frac{1+\epsilon \sqrt{s}}{\epsilon s\sqrt{2\pi}} \cdot \exp(-s\epsilon^2/2) \right) \\ &\leq 1 + 2 \cdot \left(1 + \frac{1+\epsilon }{\epsilon \sqrt{2\pi}}\right) \cdot \sum_{s=0}^{\infty} \exp(-s\epsilon^2/2) \\ &\leq 1 + 2 \cdot \left(1 + \frac{1+\epsilon }{\epsilon \sqrt{2\pi}}\right) \cdot \frac{1}{1 - \exp(-\epsilon^2/2)}. \end{aligned}

In particular, C_1(n,\epsilon)/{\log(n)} \to 0 as n \to \infty.

Remark 4. In the general setting, we would also aim to prove that C_1(n, \epsilon) \leq C_1'(\epsilon), so that C_1(n, \epsilon)/{\log(n)} \to 0 as n \to \infty. In the proof of Lemma 3, we have the dependency

\mathbf X_{i,s-1} := (X_{i,1},\dots, X_{i,s-1}) \to \hat{\mu}_{i,s} \to G_{i,s},

and the intermediate dependency has a relatively characterisable distribution conditioned on the reward history. More generally, at any time s, we would obtain the distribution

\mathbf X_{i,s-1}\to \mathbb P_{i,s}(\cdot \mid \mathbf X_{i,s-1}) \to \mathbb P(\rho(\tilde{\nu}_{i,s}) > \rho(\nu_i) - \epsilon \mid \mathbf X_{i,s-1}) =: G_{i,s},

where \tilde{\nu}_{i,s} = \Pi(\kappa_{i,s}) for some deterministic function \Pi, and \kappa_{i,s} \sim \mathbb P_{i,s-1}(\cdot \mid \mathbf X_{i,s-1}) . In the vanilla case:

  • \mathbb P_{i,s}(\cdot \mid \mathbf X_{i,s-1}) = \mathcal N(\hat{\mu}_{i,s-1}, 1/(s-1)),
  • \kappa_{i,s} = \theta_{i,s} \sim \mathbb P_{i,s-1}(\cdot \mid \mathbf X_{i,s-1}),
  • \Pi(x) = \mathcal N(x, 1),
  • \tilde{\nu}_{i,s} = \Pi(\kappa_{i,s}) = \mathcal N(\kappa_{i,s}, 1),
  • \rho_{\mathbb E}(\tilde{\nu}_{i,s}) = \kappa_{i,s} = \theta_{i,s}.

If we adopt the same strategy as in Lemma 3, then we would like \mathbb P_{i,s} to be sufficiently computationally familiar with Radon-Nikodým derivative f with respect to the measure \mu. Taking the integral over the appropriate measure space (\Omega, \mathcal F, \mu) that denotes the possible realisations of \kappa_{i,s}, conditional expectation should, once again, yield

\begin{aligned} \mathbb E\left[ \frac 1{G_{1,s}} - 1 \right] &= \int_{\Omega} \frac{1 - G_{1,s}(\omega)}{G_{1,s}(\omega)} f(\omega)\, \mathrm d\mu(\omega). \end{aligned}

Hence, we would require tail lower bounds on G_{1,s}(\omega) of the form

\displaystyle G_{1,s}(\omega) \geq \frac 1{\psi_1(s,\epsilon) + 1} , \quad G_{1,s}(\omega) \geq \frac{f(\omega)}{\psi_2(s,\epsilon)},

for non-negative functions \psi_1,\psi_2 such that

\displaystyle \sum_{s=1}^\infty \psi_1(s,\epsilon) =: C_{1,1}(\epsilon) < \infty,\quad \sum_{s=1}^\infty \psi_2(s,\epsilon) =: C_{1,2}(\epsilon) < \infty

and

\displaystyle \Omega = \underbrace{ \{\omega : 1 - G_{1,s}(\omega) \leq \psi_1(s,\epsilon)\} }_{\Omega_1} \cup \underbrace{ \left\{\omega : \frac{f(\omega)}{G_{1,s}(\omega)} \leq \psi_2(s,\epsilon)\right\} }_{\Omega_2},

so that

\begin{aligned} \mathbb E\left[ \frac 1{G_{1,s}} - 1 \right] &= \left( \int_{\Omega_1} + \int_{\Omega_2}\right) \frac{1 - G_{1,s}(\omega)}{G_{1,s}(\omega)} f(\omega)\, \mathrm d\mu(\omega) \\ &\leq \psi_1(s,\epsilon) + \psi_2(s,\epsilon), \end{aligned}

and C_1(n,\epsilon) \leq C_{1,1}(\epsilon) + C_{1,2}(\epsilon) yields C_1(n,\epsilon)/{\log(n)} \to 0. It suffices to say that the lion’s share of research in this field would go toward discovering and deriving useful tail lower bounds for

G_{i,s}(\omega) = \mathbb P( \rho( \tilde{\nu}_{i,s} ) \geq \rho(\nu_i) - \epsilon \mid \mathbf X_{i,s-1}),

where we are reminded that

\tilde{\nu}_{i,s} = \Pi(\kappa_{i,s}),\quad \kappa_{i,s} \sim \mathbb P_{i,s-1}( \cdot \mid \mathbf X_{i,s-1}).

Lemma 4. C_2(n,\epsilon)/{\log(n)} \to 2/(\Delta_i - \epsilon)^2 as n \to \infty.

Proof. To upper-bound C_2(n,\epsilon), we cheat. We first note that

\begin{aligned} \{G_{i,s} > 1/n\} &= \left\{\mathbb P(\theta_{i,s} > \mu_1 - \epsilon \mid X_{i,1},\dots, X_{i,s}) > 1/n\right\} \\ &= \left\{\mathbb P(\theta_{i,s} - \hat{\mu}_{i,s} > \mu_1 - \hat{\mu}_{i,s} - \epsilon \mid X_{i,1},\dots, X_{i,s}) > 1/n\right\} \\ &\subseteq \left\{\exp(-s(\mu_1 - \hat{\mu}_{i,s} - \epsilon)^2/2) > 1/n\right\} \\ &=\left\{ \hat{\mu}_{i,s} > \mu_1 - \epsilon - \sqrt{\frac{2\log(n)}{s}}\right\} \\ &=\left\{ \hat{\mu}_{i,s} -\mu_i > \Delta_i - \epsilon - \sqrt{\frac{2\log(n)}{s}} \right\}. \end{aligned}

Define u \in \mathbb N^+ by

\displaystyle  u \geq \left\lceil \frac{2\log(n)}{(\Delta_i -  \epsilon)^2} \right\rceil \quad \iff \quad  \Delta_i -  \epsilon - \sqrt{\frac{2\log(n)}{u}} > 0.

Using the Gaussian tail bound again,

\begin{aligned} C_2(n,\epsilon) &= \sum_{s=0}^{n-1} \mathbb P(G_{i,s} > 1/n) \\&\leq u + \sum_{s=u+1}^\infty \mathbb P\left( \hat{\mu}_{i,s} -\mu_i > \Delta_i -  \epsilon - \sqrt{\frac{2\log(n)}{s}} \right) \\ &\leq 1 + \frac{2\log(n)}{(\Delta_i -  \epsilon)^2}   + \sum_{s=u+1}^\infty \exp\left( - \frac{s\left(\Delta_i -  \epsilon - \sqrt{\frac{2\log(n)}{s}}\right)^2}{2} \right) \\ &\leq 1 + \frac{2\log(n)}{(\Delta_i -  \epsilon)^2}  + \frac{2}{(\Delta_i -  \epsilon)^2}(\sqrt{\pi \log(n)} + 1), \end{aligned}

where the last line follows from the proof of Theorem 8.1 in Lattimore and Szepesvari (2020). In particular,

\displaystyle \lim_{n \to \infty} \frac{C_2(n,\epsilon)}{\log(n)} = \frac {2}{(\Delta_i - \epsilon)^2}.

Remark 5. For the general case, we would like to obtain a concentration bound of the form

\begin{aligned} \mathbb P(G_{i,s} > 1/n) \leq \psi_3(\epsilon) \cdot \exp(-s \cdot \psi_4(\epsilon)), \end{aligned}

where \psi_3 is a non-negative function and 0 < \psi_4(\epsilon) < K_i for some desired constant such that \psi_4(\epsilon) \to K^- as \epsilon \to 0^+. Then for judiciously chosen u,

\begin{aligned} C_2(n,\epsilon) &\leq u + \sum_{s=u+1}^{n-1} \mathbb P(G_{i,s} > 1/n) \\ &\leq u + \psi_3(\epsilon) \cdot n\exp(-u \cdot \psi_4(\epsilon)). \end{aligned}

In particular, setting

\displaystyle u \geq \frac{ \log (n) }{ \psi_4(\epsilon) } \quad \iff \quad n\exp(-u \cdot \psi_4(\epsilon) ) \leq 1

yields

\begin{aligned} C_2(n,\epsilon) &\leq u + \sum_{s=u+1}^{n-1} \mathbb P(G_{i,s} > 1/n) \\ &\leq 1 + \frac{ \log (n )}{\psi_4(\epsilon) } + \psi_3(\epsilon), \end{aligned}

so that C_2(n,\epsilon)/{\log(n)} \to 1/\psi_4(\epsilon) as n \to \infty.

Theorem 1. Given that \nu_i = \mathcal N(\mu_i, 1) for each i, the regret bound of \rho_{\mathbb E}-TS is given by

\displaystyle \limsup_{n \to \infty} \frac{\mathcal R_n (\rho_{\mathbb E}\text{-}\mathrm{TS}, \nu) }{\log(n)} \leq \sum_{i : \Delta_i > 0} \frac{2}{\Delta_i}.

Proof. Combining the displays in Lemmas 2, 3 and 4,

\begin{aligned} \limsup_{n \to \infty} \frac{\mathbb E[T_i(n)]}{\log(n)} &\leq \lim_{n \to \infty} \frac{C_1(n,\epsilon) + C_2(n,\epsilon) + 1}{\log(n)} \\ &\leq \lim_{n \to \infty} \frac{C_1(n,\epsilon) }{\log(n)} + \lim_{n \to \infty} \frac{C_2(n,\epsilon) }{\log(n)} + \lim_{n \to \infty} \frac{1 }{\log(n)} \\ &= 0 + \frac {2}{(\Delta_i - \epsilon)^2} + 0 = \frac {2}{(\Delta_i - \epsilon)^2} . \end{aligned}

Plugging into the regret decomposition,

\displaystyle \limsup_{n \to \infty} \frac{\mathcal R_n (\rho_{\mathbb E}\text{-}\mathrm{TS}, \nu) }{\log(n)} \leq \sum_{i : \Delta_i > 0} \frac{2}{(\Delta_i - \epsilon)^2} \cdot \Delta_i.

Taking \epsilon \to 0^+,

\displaystyle \lim_{n \to \infty} \frac{\mathcal R_n (\rho_{\mathbb E}\text{-}\mathrm{TS}, \nu) }{\log(n)} \leq \sum_{i : \Delta_i > 0} \frac{2}{(\Delta_i - 0)^2} \cdot \Delta_i = \sum_{i : \Delta_i > 0} \frac{2}{\Delta_i}.

Remark 6. Assuming the presence of sufficiently nice concentration bounds in Remarks 4 and 5, we can recover the generalised regret bound

\displaystyle \limsup_{n \to \infty} \frac{ \mathcal R_n(\rho\text{-}\mathrm{TS}, \nu) }{ \log(n) } \leq \sum_{i : \Delta_i^{\rho} > 0} \frac{ \Delta_i^{\rho} }{ \psi_{i, 4}(\epsilon)} .

Taking \epsilon \to 0^+,

\displaystyle \limsup_{n \to \infty} \frac{ \mathcal R_n(\rho\text{-}\mathrm{TS}, \nu) }{ \log(n) } \leq \sum_{i : \Delta_i^{\rho} > 0} \frac{ \Delta_i^{\rho} }{K_i} .

That was…intense. I took at least six hours to type this out. And several more days to include the generalising remarks. And this was supposedly the warm-up for my undergraduate research projects. I have to say…I probably would have progressed much further if I took the time then to write this proof out as I did today.

Before we can finally lay our short detour into multi-armed bandits, we must ask the question: is this Thompson sampling algorithm, in a sense, the best? We will answer this question in the affirmative next time.

—Joel Kindiak, 13 Feb 26, 2117H

,

Published by


Leave a comment