The Thompson Sampling Template

Let’s finally discuss Thompson sampling.

Previously, we looked at the UCB algorithm that incorporates the dynamic reward history into its exploration-exploitation tradeoff, yielding an upper-bound on its regret of the form

$\displaystyle A \sum_{i=1}^K \Delta_i + f(n) \sum_{i : \Delta_i > 0} \frac{1}{\Delta_i}.$

After an initial pull that yields the reward history $(A_1,X_1,\dots, A_K, X_K)$ , subsequent arm pulls are computed using said information in a deterministic manner relative to the reward history:

$\displaystyle A_{t+1} = \underset{i=1,\dots, K}{\arg \max} \left( \hat{\mu}_{i,T_i(t)} + \sqrt{ \frac{2 \sigma^2 \log(1/\delta) }{ T_i(t) } } \right),$

where the expression being arg-maxxed is motivated by, given $s$ , a tail concentration bound satisfied by an arm $i$ with $\sigma_i$ -sub-Gaussian distribution:

$\displaystyle \mathbb P\left( \mu_i > \hat{\mu}_{i,s} + \sqrt{ \frac{2 \sigma^2 \log(1/\delta) }{ s } }\right) \leq \delta$

Thompson sampling aims to randomise this selection. To simplify our discussion, we will be given knowledge that each $\nu_i$ has a distribution of $\mathcal N(\mu_i, 1)$ , and our goal is to learn $\mu_i$ .

When we initialise the algorithm, we are completely ignorant of the distribution of any $\nu_i$ . So we get to choose our initial guess. Let $\mathbb P_{i, t}$ denote our guess of the distribution of $\nu_i$ , so $\mathbb P_{i, 0}$ denotes our initial guess. Intuitively, after $t$ pulls of arm $i$ , we want to update our guess $\mathbb P_{i, t+1}$ given the previous guess $\mathbb P_{i, t}$ and the reward $X_{ t+1} \sim \nu_i$ obtained from said arm, i.e. we set

$\mathbb P_{i, t+1}(\cdot \mid X_1,\dots, X_t, X_{t+1}) := \mathrm{Update}(\mathbb P_{i, t}(\cdot \mid X_1,\dots, X_t), X_{t+1})$

where $\mathrm{Update}(\cdot , \cdot)$ is carefully curated for our purposes.

The Thompson sampling algorithm then proceeds as follows. As an initial exploration phase, for each $t = 1,\dots, K$ , pull the arm $A_t = t$ , then update the arm distributions:

$\mathbb P_{A_t, 1}(\cdot \mid X_{A_t}) := \mathrm{Update}(\mathbb P_{A_t, 0}, X_{A_t}).$

For each $t \geq K+1$ , select

$A_{t+1} = \underset{i=1,\dots, K}{\arg \max}\ \theta_{i, t+1},$

where $\theta_{i, t+1} \sim \mathbb P_{i, T_i(t)}$ independently, and finally, collect the reward $X_{A_{t+1}} \sim \nu_{A_{t+1}}$ . Then update the arms just like before:

$\mathbb P_{A_{t+1}, T_i(t+1)} = \mathbb P_{A_{t+1}, T_i(t)+1} := \mathrm{Update}(\mathbb P_{A_t, T_i(t)}, X_{A_{t+1}}),$

where the various probability measures are still conditioned on the reward history, but omitted for brevity.

Our current discussion has brought us far too high into the stratosphere, and we need to return to earth. It stands to reason then that, in the absence of additional information, a meaningful “initial guess” would be

$\mathbb P_{i, 0} = \mathcal N(0, 1) \equiv \mathcal N(\theta_{i,0}, \sigma_{i,0}^2),\quad \theta_{i,0} := 0,\quad \sigma_{i,0}^2 := 1.$

After pulling arm $i$ once and receiving a reward $X_{i,1} \sim \nu_i$ , our next guess should, intuitively, take the form of $\mathbb P_{i,1} = \mathcal N(\theta_{i,1}, \sigma_{i,1}^2)$ , where $(\theta_{i,1}, \sigma_{i,1}^2)$ should be intimately related to $( \theta_{i,0}, \sigma_{i,0}^2, X_{i,1} )$ . We can accomplish that goal using conjugate priors, justified by a measure-theoretic application of Bayes’ theorem.

Lemma 1. Suppose for any $s \geq 0$ ,

$\mathbb P_{i,s+1}( \cdot \mid X_{i,1} \dots, X_{i,s}) = \mathcal N(\theta_{i,s+1}, \sigma_{i,s+1}^2).$

Using conjugate priors,

$\displaystyle \frac{ \theta_{i,s+1} }{ \sigma_{i,s+1}^2 } = \frac{ \theta_{i,s} }{ \sigma_{i,s}^2 } + X_{i,s+1},\quad \frac 1{\sigma_{i,s+1}^2} = \frac 1{\sigma_{i,s}^2}+1.$

In particular, if $\mathbb P_{i, 0} = \mathcal N(0, 1)$ , then

$\displaystyle \theta_{i,s+1} = \frac{X_{i,1} + \cdots + X_{i,s+1}}{s+1} =: \hat{\mu}_{i,s+1},\quad \frac 1{\sigma_{i,s+1}^2} = s+1.$

Proof. See Section 34.3 and Example 34.7 in Lattimore and Szepesvari (2020).

Therefore, we would use the $\mathrm{Update}(\cdot,\cdot)$ procedure as follows:

$\mathrm{Update}(\mathcal N(\hat{\mu}_{i,t}, 1/t), X_{i, t+1}) = \mathcal N(\hat{\mu}_{i,t+1}, 1/(t+1)),$

with all relevant quantities defined in Lemma 1.

Remark 1. Observe that $\rho_{\mathbb E}(\mathcal N(\mu, \sigma^2)) = \mu$ . More generally, since $\mathcal N(\mu, \sigma^2)$ is characterised by its mean and variance, for any risk functional $\rho$ , there exists a unique function $\tilde \rho : \mathbb R \times (0,\infty) \to \mathbb R$ such that $\rho(\mathcal N(\mu, \sigma^2)) = \tilde \rho(\mu, \sigma^2)$ .

Remark 2. If all we know is that $\nu_i = \mathcal N(\mu_i, \sigma_i^2)$ , then we would require a stronger update procedure as outlined by Zhu and Tan (2020) (Figure 2). The upside of this approach is that for any $t$ , we can sample

$\theta_{i, t + 1} \sim \mathcal N( \hat{\mu}_{i, T_i(t)}, 1/T_i(t)),\quad \tau_{i,t+1} \sim \Gamma(\alpha_{i,T_i(t)}, \beta_{i,T_i(t)} )$

to form the randomised distribution $\tilde{\nu}_{i,t+1} = \mathcal N(\theta_{i, t + 1}, 1/\tau_{i, t+1})$ . Here, the parameters $(\alpha_{i,t}, \beta_{i,t})$ are updated according to the rules

$\begin{aligned} \alpha_{i,t+1} &= \alpha_{i,t} +1/2, \\ \beta_{i,t+1} &= \beta_{i,t} + \frac{T_i(t)}{T_i(t)+1} \cdot \frac{(X_{i,T_i(t+1)} - \hat{\mu}_i(t) )^2}{2}, \end{aligned}$

similar in spirit to Lemma 1. Denoting

$\mathbb P_{i,t+1} := \mathcal N( \hat{\mu}_{i, T_i(t)}, 1/T_i(t)) \times \Gamma(\alpha_{i,T_i(t)}, \beta_{i,T_i(t)} ),$

we sampled $\kappa_{i,t} = (\theta_{i, t + 1}, \tau_{i,t+1}) \sim \mathbb P_{i,t+1}$ and defined $\tilde{\nu}_{i,t+1} := \Pi(\kappa_{i,t})$ , where $\Pi (x,y) = \mathcal N(x,1/y)$ . Then the arm-selection criterion would be similarly modified to

$A_t := \underset{i=1,\dots, K}{\arg \max}\ \rho(\tilde{\nu}_{i,t+1} ).$

Now we ask the crucial question: given $\nu = (\nu_1,\dots,\nu_K)$ with $\nu_i \sim \mathcal N(\mu_i, \sigma_i^2)$ , how do we upper bound the following cumulative regret?

$\displaystyle \mathcal R_n(\rho\text{-}\mathrm{TS},\nu) = \sum_{i : \Delta_i^{\rho} > 0} \mathbb E[T_i(n)] \Delta_i^{\rho}$

My attempt at solving this problem for the risk functional $\rho$ being the conditional value-at-risk (CVaR) was the substance of my undergraduate research, inspired by Zhu and Tan (2020) and Baudry et al (2021). In this post, we will ultimately only solve the simple case $\sigma_i = 1$ and $\rho = \rho_{\mathbb E}$ , while we chart the way for future progress on the generalised problem.

Remark 3. There are many other distributions that $\nu_i$ could take. In the case $\nu_i$ are all Bernoulli bandits, my final year project solves the problem in the affirmative for general risk measures $\rho$ , adapted from seminal works by Riou and Honda (2010) and Baudry et al (2021). It even solves the problem if $\nu_i$ is simply known to have finite support, albeit only guaranteed for a stricter subset of (nonetheless useful) risk functionals. Unfortunately, it has not solved the general case of $\nu_i$ with support $[0, 1]$ , or $\nu_i$ being Gaussian, or even $\nu_i$ being sub-Gaussian. I was tempted to restart this project, but shall relegate it as an exercise. Doing these posts on bandits brought me a much-needed closure that I never knew I needed.

Lemma 2. Assume that arm $1$ is uniquely optimal. Fix $\epsilon > 0$ . Define

$E_{i,t} := \{ \rho( \tilde{\nu}_{i,t} ) \leq \rho(\nu_1) - \epsilon \}$

and $G_{i,s} := \mathbb P( E_{i,s+1}^c \mid X_{i,1}, \dots, X_{i,s})$ , where:

$X_{i,t} \sim \nu_i$ independently,
$\tilde{\nu}_{i,s+1} \sim \mathbb P_{i, s}(\cdot \mid X_{i,1}, \dots, X_{i,s})$ ,
$\mathbb P_{i, t+1}(\cdot \mid X_{i,1},\dots, X_{i,t}, X_{i,t+1}) = \mathrm{Update}(\mathbb P_{i,t}(\cdot \mid X_{i,1},\dots, X_{i,t}), X_{i,t+1})$ ,
$\rho$ is a risk functional (i.e. $\rho_{\mathbb E}(\nu_i) = \mu_i$ ).

Then

$\displaystyle \mathbb E[T_i(n)] \leq \mathbb E\left[ \sum_{s=0}^{n-1} \left( \frac 1{G_{1,s}} - 1 \right) \right] + \mathbb E\left[ \sum_{s=0}^{n-1} \mathbb I\{ G_{i,s} > 1/n \} \right] + 1.$

Hence, we remove the randomness of $T_i(t)$ in our subsequent analyses.

Proof. The result and proof are all a mild generalisation of Theorem 36.2 in Lattimore and Szepesvari (2020). By construction,

$\begin{aligned} \mathbb P(E_{1,t}^c \mid \mathcal F_{t-1}) &= \mathbb P( E_{1,t}^c \mid X_{1,1},\dots, X_{1,T_1(t-1)}) = G_{1, T_1(t-1)} \end{aligned}$

almost surely. Therefore, we now decompose $\mathbb E[T_i(n)]$ :

$\begin{aligned} \mathbb E[T_i(n)] &= \mathbb E\left[ \sum_{t=1}^n \mathbb I\{ A_t = i \} \right] \\ &= \underbrace{ \mathbb E\left[ \sum_{t=1}^n \mathbb I\{ A_t = i, E_{i,t} \} \right] }_{M_1} + \underbrace{ \mathbb E\left[ \sum_{t=1}^n \mathbb I\{ A_t = i, E_{i,t}^c \} \right] }_{M_2}. \end{aligned}$

To upper-bound $M_2$ , we refer to the solutions to Problem 36.5 of Lattimore and Szepesvari (2020). Define

$\mathcal T := \{ t \in \{1,\dots, n\} : G_{i, T_i(t-1)} > 1/n\}$

Using a union bound,

$\begin{aligned} \mathbb E\left[ \sum_{t=1}^n \mathbb I \{ A_t = i, E_{i,t}^c \} \right] &= \mathbb E\left[ \sum_{t \in \mathcal T} \mathbb I \{ A_t = i, E_{i,t}^c \} \right] + \mathbb E\left[ \sum_{t \notin \mathcal T} \mathbb I \{ A_t = i, E_{i,t}^c \} \right] \\ &\leq \underbrace{\mathbb E\left[ \sum_{t \in \mathcal T} \mathbb I \{ A_t = i \} \right] }_{M_{2,1}} + \underbrace{ \mathbb E\left[ \sum_{t \notin \mathcal T} \mathbb I \{ E_{i,t}^c \} \right] }_{M_{2,2}}.\end{aligned}$

Therefore, it suffices to upper-bound

$\mathbb E[T_i(n)] \leq M_1 + M_{2,1} + M_{2,2}.$

To upper-bound $M_1$ , first denote the chosen non-optimal arm by $B_t$ :

$B_{t+1} = \underset{i\neq 1}{\arg \max}\ \rho(\tilde{\nu}_{i,t}).$

Then

$\begin{aligned} \mathbb P(A_t = 1, E_{i,t} \mid \mathcal F_{t-1}) &\geq \mathbb P(B_t = i, E_{i,t} , E_{1,t}^c \mid \mathcal F_{t-1}) \\ &= \mathbb P(E_{1,t}^c \mid \mathcal F_{t-1}) \cdot \mathbb P(B_t = i, E_{i,t} \mid \mathcal F_{t-1}) \\ &= \mathbb P(E_{1,t}^c \mid \mathcal F_{t-1}) \cdot \mathbb P(B_t = i, E_{i,t} \mid \mathcal F_{t-1}), \end{aligned}$

since $E_{1,t}^c$ is conditionally independent of $B_t = i, E_{i,t}$ given $\mathcal F_{t-1}$ . To lower-bound the second term further,

$\begin{aligned} \{A_t = i \} \cap E_{i,t} \subseteq (\{ B_t = i \} \cap E_{i, t}) \cap E_{1,t} \end{aligned}$

and conditional independence given $\mathcal F_{t-1}$ implies that

$\begin{aligned} \mathbb P(A_t = i, E_{i,t} \mid \mathcal F_{t-1}) &\leq \mathbb P(B_t = i, E_{i,t} \mid \mathcal F_{t-1}) \cdot \mathbb P(E_{1,t} \mid \mathcal F_{t-1}) \\ &\leq \frac{\mathbb P(E_{1,t} \mid \mathcal F_{t-1})}{\mathbb P(E_{1,t}^c \mid \mathcal F_{t-1})} \cdot \mathbb P(A_t = 1, E_{i,t} \mid \mathcal F_{t-1}) \\ &\leq \frac{\mathbb P(E_{1,t} \mid \mathcal F_{t-1})}{\mathbb P(E_{1,t}^c \mid \mathcal F_{t-1})} \cdot \mathbb P(A_t = 1 \mid \mathcal F_{t-1}) \\ &= \left( \frac{1}{\mathbb P(E_{1,t}^c \mid \mathcal F_{t-1})} -1 \right) \cdot \mathbb P(A_t = 1 \mid \mathcal F_{t-1}) \\ &= \left( \frac{1}{G_{1,T_1(t-1)}} -1 \right) \cdot \mathbb P(A_t = 1 \mid \mathcal F_{t-1}). \end{aligned}$

Therefore, we can upper-bound $M_1$ by

$\begin{aligned} M_1 &= \mathbb E\left[ \sum_{t=1}^n \mathbb I\{ A_t = i, E_{i,t}\} \right] \\ &\leq \mathbb E \left[ \sum_{t=1}^n \mathbb P ( A_t = i, E_{i,t} \mid \mathcal F_{t-1})\right] \\ &\leq \mathbb E \left[ \sum_{t=1}^n \left( \frac{1}{G_{1,T_1(t-1)}} -1 \right) \cdot \mathbb P(A_t = 1 \mid \mathcal F_{t-1})\right] \\ &=\mathbb E \left[ \sum_{t=1}^n \left( \frac{1}{G_{1,T_1(t-1)}} -1 \right) \cdot \mathbb I \{ A_t = 1 \} \right] \\ &\leq \mathbb E \left[ \sum_{s=0}^{n-1} \left( \frac{1}{G_{1,s}} -1 \right) \right]. \end{aligned}$

To upper-bound $M_{2,1}$ , recall that

$\mathcal T := \{ t \in \{1,\dots, n\} : G_{i,T_i(t-1)} > 1/n\}$

Therefore,

$\begin{aligned} \sum_{t \in \mathcal T} \mathbb I\{ A_t = i \} &\leq \sum_{t=1}^n \sum_{s=1}^n \mathbb I\{ T_i(t) = s, T_i(t-1) = s-1 , G_{i,T_i(t-1)} > 1/n \} \\ &\leq \sum_{s=1}^n \sum_{t=1}^n \mathbb I\{ T_i(t) = s, T_i(t-1) = s-1 , G_{i,s-1} > 1/n \} \\ &\leq \sum_{s=1}^n \mathbb I\{ G_{i,s-1} > 1/n \} \underbrace{ \sum_{t=1}^n \mathbb I\{ T_i(t) = s, T_i(t-1) = s-1 \} }_{\leq 1} \\ &\leq \sum_{s=1}^n \mathbb I\{ G_{i,s-1} > 1/n \} \\ &= \sum_{s=0}^{n-1} \mathbb I\{ G_{i,s} > 1/n \}. \end{aligned}$

Taking expectations,

$\begin{aligned} M_{2,1} = \mathbb E\left[ \sum_{t \in \mathcal T} \mathbb I\{ A_t = i \} \right] &\leq \mathbb E\left[ \sum_{s=0}^{n-1} \mathbb I\{ G_{i,s} > 1/n \} \right]. \end{aligned}$

To upper-bound $M_{2,2}$ , $t \notin \mathcal T$ if and only if $G_{i, T_i(t-1)} \leq 1/n$ :

$\begin{aligned} M_{2,2} &= \mathbb E\left[ \sum_{t \notin \mathcal T} \mathbb I\{E_{i,t}^c\} \right] \\ &= \sum_{t=1}^n \mathbb E[ \mathbb I\{E_{i,t}^c, G_{i, T_i(t-1)} \leq 1/n\} ] \\ &= \sum_{t=1}^n \mathbb E[ \mathbb E[\mathbb I\{E_{i,t}^c, G_{i, T_i(t-1)} \leq 1/n\}\mid \mathcal F_{t-1}] ] \\ &= \sum_{t=1}^n \mathbb E[ \mathbb I\{ G_{i, T_i(t-1)} \leq 1/n \}] \cdot \mathbb P(E_{i,t}^c) ] \\ &= \sum_{t=1}^n \mathbb E[ \mathbb I\{ G_{i, T_i(t-1)} \leq 1/n \}] \cdot G_{i,T_i(t-1)} ] \\ &\leq \sum_{t=1}^n \mathbb E[ \mathbb I\{ G_{i, T_i(t-1)} \leq 1/n \}] \cdot 1/n ] \\ &= \sum_{t=1}^n \mathbb P(G_{i, T_i(t-1)} \leq 1/n ) \cdot 1/n \\ &\leq \sum_{t=1}^n 1 \cdot 1/n = 1. \end{aligned}$

Therefore,

$\begin{aligned} \mathbb E[T_i(n)] &\leq M_1 + M_{2,1} + M_{2,2} \\ &\leq \mathbb E\left[ \sum_{s=0}^{n-1} \left(\frac 1{G_{1,s}} - 1 \right) \right] + \mathbb E\left[ \sum_{s=1}^n \mathbb I\{G_{i,s-1} > 1/n\} \right] + 1. \end{aligned}$

Lemma 2 gives us the general recipe to evaluate the regret bound for $\rho$ -TS, but we don’t get any concrete result until we consider specific bandit instances. For simplicity, let’s suppose $\nu_i = \mathcal N(\mu_i, 1)$ for each $i$ . We initialise $\mathbb P_{i,0} := \mathcal N(0, 1)$ and use Lemma 1 to update our belief distribution $\mathbb P_{i,t+1} = \mathcal N(\hat{\mu}_{i,t+1}, 1/(t+1))$ of arm $i$ at time $t$ after observing the data $X_{i,1},\dots, X_{i,T_i(t)} \sim \nu_i$ independently.

How do we upper-bound the cumulative regret $\mathcal R_n (\rho_{\mathbb E}\text{-}\mathrm{TS}, \nu)$ ?

Fix $\epsilon > 0$ and consider the regret bound recipe in Lemma 2:

$\displaystyle \mathbb E[T_i(n)] \leq \underbrace{ \mathbb E\left[ \sum_{s=0}^{n-1} \left( \frac 1{G_{1,s}} - 1 \right) \right] }_{C_1(n,\epsilon)} + \underbrace{ \mathbb E\left[ \sum_{s=0}^{n-1} \mathbb I\{ G_{i,s} > 1/n \} \right] }_{C_2(n,\epsilon)} + 1,$

where $G_{i,s} = \mathbb P(E_{i,s+1}^c \mid X_{i,1}, \dots, X_{i,s})$ and

$E_{i,t} = \{\rho_{\mathbb E}(\tilde{\nu}_{i,t} \leq \rho_{\mathbb E}(\nu_1) - \epsilon\} = \{ \theta_{i,t} \leq \mu_1 - \epsilon\},$

where $\theta_{i,s+1} \sim \mathbb P_{i,s}(\cdot \mid X_{i,1},\dots, X_{i,s}) = \mathcal N(\hat{\mu}_{i,s} , 1/s)$ independently.

We aim to upper-bound $C_1(n,\epsilon)$ and $C_2(n,\epsilon)$ then eventually take $\epsilon \to 0^+$ .

Lemma 3. $C_1(n,\epsilon)/{\log(n)} \to 0$ as $n \to \infty$ .

Proof. To upper-bound $C_1(n,\epsilon)$ , we first use linearity:

$\displaystyle C_1(n,\epsilon) = \sum_{s=0}^{n-1} \mathbb E\left[ \frac{1}{ G_{1,s} } - 1 \right].$

The key idea is to note that the value of

$G_{1,s} = \mathbb P(\theta_{1,s+1} > \mu_1 - \epsilon \mid X_{i,1},\dots, X_{i,s})$

is deterministic when conditioned on $(X_{i,1},\dots, X_{i,s})$ , since

$\theta_{i,s+1} \sim \mathcal N(\hat{\mu}_{i,s}, 1/s),$

and the distribution of $\hat{\mu}_{i,s}$ , in turn, is determined by the reward history:

$\displaystyle \hat{\mu}_{i,s} = \frac{X_{i,1} + \cdots + X_{i,s}}{s} \sim \mathcal N(\mu_i, 1/s).$

Therefore, we can let $f$ denote the p.d.f of $\mathcal N(0, 1/s)$ for flexibility:

$\displaystyle f(y) = \sqrt{ \frac { s }{ 2\pi } } \exp\left(-\frac{s y^2}{2} \right),$

and $F$ denote its c.d.f.:

$\displaystyle F(y) = \int_{-\infty}^y f(x)\, \mathrm dx.$

Conditioned on $\hat{\mu}_{i,s} := y$ ,

$\theta_{i,s+1} \sim \mathcal N(\hat{\mu}_{i,s}, 1/s) \quad \Rightarrow \quad \theta_{i,s+1} - \hat{\mu}_{i,s} \sim \mathcal N(0, 1/s)$

so that

$\begin{aligned} G_{1,s} &= \mathbb P(\theta_{1,s+1} > \mu_1 - \epsilon \mid \hat{\mu}_{1,s} = y)\\ &= 1 - \mathbb P(\theta_{1,s+1} \leq \mu_1 - \epsilon \mid \hat{\mu}_{1,s} = y)\\ &= 1 - \mathbb P(\theta_{1,s+1} - \hat{\mu}_{1,s} \leq \mu_1 - \hat{\mu}_{1,s} - \epsilon \mid \hat{\mu}_{1,s} = y)\\ &= 1 - \mathbb P(\theta_{1,s+1} - y \leq \mu_1 - y - \epsilon)\\ &= 1 - F(\mu_1 - y - \epsilon) \end{aligned}$

almost surely. Therefore, by taking advantage of the marginal distribution and using the change of variables $y \to \mu_1 -y - \epsilon$ ,

$\begin{aligned}\mathbb E\left[ \frac{1}{ G_{1,s} } - 1 \right] &= \int_{\mathbb R} \left( \frac 1{1-F( \mu_1 - y - \epsilon)} - 1 \right) f(\mu_1-y)\, \mathrm dy \\ &= \int_{\mathbb R} \left( \frac 1{1-F( y )} - 1 \right) f(y + \epsilon )\, \mathrm dy \\ &= \int_{\mathbb R} \frac {F(y)}{1 - F(y)} \cdot f( y + \epsilon)\, \mathrm dy \\ &= \left( \int_{-\infty}^0 + \int_0^\infty \right) \frac {F(y)}{1 - F(y)} \cdot f( y + \epsilon)\, \mathrm dy. \end{aligned}$

Therefore, we need to upper-bound two integrals. For the first integral, $1-F_s(y) \geq 1/2$ so that given $W \sim \mathcal N(0, 1/s)$ and the Gaussian tail bound

$\displaystyle \mathbb P(W \leq -\epsilon) \leq \exp(-s\epsilon^2/2).$

we have

$\begin{aligned} \int_{-\infty}^0 \frac {F(y)}{1 - F(y)} \cdot f( y + \epsilon)\, \mathrm dy &\leq 2 \int_{-\infty}^0 F_{s}(y) \cdot f_{s}( y + \epsilon)\, \mathrm dy \\ &\leq 2 \int_{-\infty}^0 \mathbb P(W \leq y - \epsilon) \cdot f( y + \epsilon)\, \mathrm dy \\ &\leq 2 \int_{-\infty}^0 \mathbb P(W \leq -\epsilon) \cdot f( y + \epsilon)\, \mathrm dy \\ &\leq 2 \int_{-\infty}^0 \exp(-s\epsilon^2/2) \cdot f( y + \epsilon)\, \mathrm dy \\ &\leq 2 \exp(-s\epsilon^2/2) \underbrace{\int_{\mathbb R} f_{s}( y + \epsilon)\, \mathrm dy }_1 \\ &= 2 \exp(-s\epsilon^2/2). \end{aligned}$

For the second integral, $F_s(y) \leq 1$ and using a tail lower bound by Abramowitz and Stegun (1964),

$\displaystyle 1 - F_s(y) \geq \frac{ \exp(-sy^2/2) }{ y\sqrt s + \sqrt{sy^2 + 4} },$

followed by algebra and integration by parts,

$\begin{aligned} &\int_0^\infty \frac {F(y)}{1 - F(y)} \cdot f( y + \epsilon)\, \mathrm dy \\ &\leq \int_0^\infty \exp(sy^2/2) \cdot (y\sqrt s + \sqrt{sy^2 + 4}) \cdot \sqrt{ \frac{s}{2\pi} } \exp(-s(y+\epsilon)^2/2)\, \mathrm dy \\ &\leq \sqrt{ \frac{s}{2\pi} } \int_0^\infty (y\sqrt s +y\sqrt s + 2) \cdot \exp(s(y^2-(y+\epsilon)^2)/2)\, \mathrm dy \\ &\leq 2\exp(-s\epsilon^2/2)\sqrt{ \frac{s}{2\pi} }\int_0^\infty (y\sqrt s + 1) \cdot \exp(-s y \epsilon)\, \mathrm dy \\ &\leq 2\exp(-s\epsilon^2/2)\sqrt{ \frac{s}{2\pi} } \cdot \frac{1+\epsilon \sqrt s}{\epsilon s\sqrt s} \\ &= 2\cdot\frac{1+\epsilon \sqrt{s}}{\epsilon s\sqrt{2\pi}} \cdot \exp(-s\epsilon^2/2). \end{aligned}$

Combining both results,

$\begin{aligned} C_1(n,\epsilon) &= \sum_{s=0}^{n-1} \mathbb E\left[ \frac 1{G_{1,s}} - 1 \right] \\ &\leq 1 + 2 \cdot \sum_{s=1}^{\infty} \left( \exp(-s\epsilon^2/2) + \frac{1+\epsilon \sqrt{s}}{\epsilon s\sqrt{2\pi}} \cdot \exp(-s\epsilon^2/2) \right) \\ &\leq 1 + 2 \cdot \left(1 + \frac{1+\epsilon }{\epsilon \sqrt{2\pi}}\right) \cdot \sum_{s=0}^{\infty} \exp(-s\epsilon^2/2) \\ &\leq 1 + 2 \cdot \left(1 + \frac{1+\epsilon }{\epsilon \sqrt{2\pi}}\right) \cdot \frac{1}{1 - \exp(-\epsilon^2/2)}. \end{aligned}$

In particular, $C_1(n,\epsilon)/{\log(n)} \to 0$ as $n \to \infty$ .

Remark 4. In the general setting, we would also aim to prove that $C_1(n, \epsilon) \leq C_1'(\epsilon)$ , so that $C_1(n, \epsilon)/{\log(n)} \to 0$ as $n \to \infty$ . In the proof of Lemma 3, we have the dependency

$\mathbf X_{i,s-1} := (X_{i,1},\dots, X_{i,s-1}) \to \hat{\mu}_{i,s} \to G_{i,s},$

and the intermediate dependency has a relatively characterisable distribution conditioned on the reward history. More generally, at any time $s$ , we would obtain the distribution

$\mathbf X_{i,s-1}\to \mathbb P_{i,s}(\cdot \mid \mathbf X_{i,s-1}) \to \mathbb P(\rho(\tilde{\nu}_{i,s}) > \rho(\nu_i) - \epsilon \mid \mathbf X_{i,s-1}) =: G_{i,s},$

where $\tilde{\nu}_{i,s} = \Pi(\kappa_{i,s})$ for some deterministic function $\Pi$ , and $\kappa_{i,s} \sim \mathbb P_{i,s-1}(\cdot \mid \mathbf X_{i,s-1})$ . In the vanilla case:

$\mathbb P_{i,s}(\cdot \mid \mathbf X_{i,s-1}) = \mathcal N(\hat{\mu}_{i,s-1}, 1/(s-1))$ ,
$\kappa_{i,s} = \theta_{i,s} \sim \mathbb P_{i,s-1}(\cdot \mid \mathbf X_{i,s-1})$ ,
$\Pi(x) = \mathcal N(x, 1)$ ,
$\tilde{\nu}_{i,s} = \Pi(\kappa_{i,s}) = \mathcal N(\kappa_{i,s}, 1)$ ,
$\rho_{\mathbb E}(\tilde{\nu}_{i,s}) = \kappa_{i,s} = \theta_{i,s}$ .

If we adopt the same strategy as in Lemma 3, then we would like $\mathbb P_{i,s}$ to be sufficiently computationally familiar with Radon-Nikodým derivative $f$ with respect to the measure $\mu$ . Taking the integral over the appropriate measure space $(\Omega, \mathcal F, \mu)$ that denotes the possible realisations of $\kappa_{i,s}$ , conditional expectation should, once again, yield

$\begin{aligned} \mathbb E\left[ \frac 1{G_{1,s}} - 1 \right] &= \int_{\Omega} \frac{1 - G_{1,s}(\omega)}{G_{1,s}(\omega)} f(\omega)\, \mathrm d\mu(\omega). \end{aligned}$

Hence, we would require tail lower bounds on $G_{1,s}(\omega)$ of the form

$\displaystyle G_{1,s}(\omega) \geq \frac 1{\psi_1(s,\epsilon) + 1} , \quad G_{1,s}(\omega) \geq \frac{f(\omega)}{\psi_2(s,\epsilon)},$

for non-negative functions $\psi_1,\psi_2$ such that

$\displaystyle \sum_{s=1}^\infty \psi_1(s,\epsilon) =: C_{1,1}(\epsilon) < \infty,\quad \sum_{s=1}^\infty \psi_2(s,\epsilon) =: C_{1,2}(\epsilon) < \infty$

and

$\displaystyle \Omega = \underbrace{ \{\omega : 1 - G_{1,s}(\omega) \leq \psi_1(s,\epsilon)\} }_{\Omega_1} \cup \underbrace{ \left\{\omega : \frac{f(\omega)}{G_{1,s}(\omega)} \leq \psi_2(s,\epsilon)\right\} }_{\Omega_2},$

so that

$\begin{aligned} \mathbb E\left[ \frac 1{G_{1,s}} - 1 \right] &= \left( \int_{\Omega_1} + \int_{\Omega_2}\right) \frac{1 - G_{1,s}(\omega)}{G_{1,s}(\omega)} f(\omega)\, \mathrm d\mu(\omega) \\ &\leq \psi_1(s,\epsilon) + \psi_2(s,\epsilon), \end{aligned}$

and $C_1(n,\epsilon) \leq C_{1,1}(\epsilon) + C_{1,2}(\epsilon)$ yields $C_1(n,\epsilon)/{\log(n)} \to 0$ . It suffices to say that the lion’s share of research in this field would go toward discovering and deriving useful tail lower bounds for

$G_{i,s}(\omega) = \mathbb P( \rho( \tilde{\nu}_{i,s} ) \geq \rho(\nu_i) - \epsilon \mid \mathbf X_{i,s-1}),$

where we are reminded that

$\tilde{\nu}_{i,s} = \Pi(\kappa_{i,s}),\quad \kappa_{i,s} \sim \mathbb P_{i,s-1}( \cdot \mid \mathbf X_{i,s-1}).$

Lemma 4. $C_2(n,\epsilon)/{\log(n)} \to 2/(\Delta_i - \epsilon)^2$ as $n \to \infty$ .

Proof. To upper-bound $C_2(n,\epsilon)$ , we cheat. We first note that

$\begin{aligned} \{G_{i,s} > 1/n\} &= \left\{\mathbb P(\theta_{i,s} > \mu_1 - \epsilon \mid X_{i,1},\dots, X_{i,s}) > 1/n\right\} \\ &= \left\{\mathbb P(\theta_{i,s} - \hat{\mu}_{i,s} > \mu_1 - \hat{\mu}_{i,s} - \epsilon \mid X_{i,1},\dots, X_{i,s}) > 1/n\right\} \\ &\subseteq \left\{\exp(-s(\mu_1 - \hat{\mu}_{i,s} - \epsilon)^2/2) > 1/n\right\} \\ &=\left\{ \hat{\mu}_{i,s} > \mu_1 - \epsilon - \sqrt{\frac{2\log(n)}{s}}\right\} \\ &=\left\{ \hat{\mu}_{i,s} -\mu_i > \Delta_i - \epsilon - \sqrt{\frac{2\log(n)}{s}} \right\}. \end{aligned}$

Define $u \in \mathbb N^+$ by

$\displaystyle u \geq \left\lceil \frac{2\log(n)}{(\Delta_i - \epsilon)^2} \right\rceil \quad \iff \quad \Delta_i - \epsilon - \sqrt{\frac{2\log(n)}{u}} > 0.$

Using the Gaussian tail bound again,

$\begin{aligned} C_2(n,\epsilon) &= \sum_{s=0}^{n-1} \mathbb P(G_{i,s} > 1/n) \\&\leq u + \sum_{s=u+1}^\infty \mathbb P\left( \hat{\mu}_{i,s} -\mu_i > \Delta_i - \epsilon - \sqrt{\frac{2\log(n)}{s}} \right) \\ &\leq 1 + \frac{2\log(n)}{(\Delta_i - \epsilon)^2} + \sum_{s=u+1}^\infty \exp\left( - \frac{s\left(\Delta_i - \epsilon - \sqrt{\frac{2\log(n)}{s}}\right)^2}{2} \right) \\ &\leq 1 + \frac{2\log(n)}{(\Delta_i - \epsilon)^2} + \frac{2}{(\Delta_i - \epsilon)^2}(\sqrt{\pi \log(n)} + 1), \end{aligned}$

where the last line follows from the proof of Theorem 8.1 in Lattimore and Szepesvari (2020). In particular,

$\displaystyle \lim_{n \to \infty} \frac{C_2(n,\epsilon)}{\log(n)} = \frac {2}{(\Delta_i - \epsilon)^2}.$

Remark 5. For the general case, we would like to obtain a concentration bound of the form

$\begin{aligned} \mathbb P(G_{i,s} > 1/n) \leq \psi_3(\epsilon) \cdot \exp(-s \cdot \psi_4(\epsilon)), \end{aligned}$

where $\psi_3$ is a non-negative function and $0 < \psi_4(\epsilon) < K_i$ for some desired constant such that $\psi_4(\epsilon) \to K^-$ as $\epsilon \to 0^+$ . Then for judiciously chosen $u$ ,

$\begin{aligned} C_2(n,\epsilon) &\leq u + \sum_{s=u+1}^{n-1} \mathbb P(G_{i,s} > 1/n) \\ &\leq u + \psi_3(\epsilon) \cdot n\exp(-u \cdot \psi_4(\epsilon)). \end{aligned}$

In particular, setting

$\displaystyle u \geq \frac{ \log (n) }{ \psi_4(\epsilon) } \quad \iff \quad n\exp(-u \cdot \psi_4(\epsilon) ) \leq 1$

yields

$\begin{aligned} C_2(n,\epsilon) &\leq u + \sum_{s=u+1}^{n-1} \mathbb P(G_{i,s} > 1/n) \\ &\leq 1 + \frac{ \log (n )}{\psi_4(\epsilon) } + \psi_3(\epsilon), \end{aligned}$

so that $C_2(n,\epsilon)/{\log(n)} \to 1/\psi_4(\epsilon)$ as $n \to \infty$ .

Theorem 1. Given that $\nu_i = \mathcal N(\mu_i, 1)$ for each $i$ , the regret bound of $\rho_{\mathbb E}$ -TS is given by

$\displaystyle \limsup_{n \to \infty} \frac{\mathcal R_n (\rho_{\mathbb E}\text{-}\mathrm{TS}, \nu) }{\log(n)} \leq \sum_{i : \Delta_i > 0} \frac{2}{\Delta_i}.$

Proof. Combining the displays in Lemmas 2, 3 and 4,

$\begin{aligned} \limsup_{n \to \infty} \frac{\mathbb E[T_i(n)]}{\log(n)} &\leq \lim_{n \to \infty} \frac{C_1(n,\epsilon) + C_2(n,\epsilon) + 1}{\log(n)} \\ &\leq \lim_{n \to \infty} \frac{C_1(n,\epsilon) }{\log(n)} + \lim_{n \to \infty} \frac{C_2(n,\epsilon) }{\log(n)} + \lim_{n \to \infty} \frac{1 }{\log(n)} \\ &= 0 + \frac {2}{(\Delta_i - \epsilon)^2} + 0 = \frac {2}{(\Delta_i - \epsilon)^2} . \end{aligned}$

Plugging into the regret decomposition,

$\displaystyle \limsup_{n \to \infty} \frac{\mathcal R_n (\rho_{\mathbb E}\text{-}\mathrm{TS}, \nu) }{\log(n)} \leq \sum_{i : \Delta_i > 0} \frac{2}{(\Delta_i - \epsilon)^2} \cdot \Delta_i.$

Taking $\epsilon \to 0^+$ ,

$\displaystyle \lim_{n \to \infty} \frac{\mathcal R_n (\rho_{\mathbb E}\text{-}\mathrm{TS}, \nu) }{\log(n)} \leq \sum_{i : \Delta_i > 0} \frac{2}{(\Delta_i - 0)^2} \cdot \Delta_i = \sum_{i : \Delta_i > 0} \frac{2}{\Delta_i}.$

Remark 6. Assuming the presence of sufficiently nice concentration bounds in Remarks 4 and 5, we can recover the generalised regret bound

$\displaystyle \limsup_{n \to \infty} \frac{ \mathcal R_n(\rho\text{-}\mathrm{TS}, \nu) }{ \log(n) } \leq \sum_{i : \Delta_i^{\rho} > 0} \frac{ \Delta_i^{\rho} }{ \psi_{i, 4}(\epsilon)} .$

Taking $\epsilon \to 0^+$ ,

$\displaystyle \limsup_{n \to \infty} \frac{ \mathcal R_n(\rho\text{-}\mathrm{TS}, \nu) }{ \log(n) } \leq \sum_{i : \Delta_i^{\rho} > 0} \frac{ \Delta_i^{\rho} }{K_i} .$

That was…intense. I took at least six hours to type this out. And several more days to include the generalising remarks. And this was supposedly the warm-up for my undergraduate research projects. I have to say…I probably would have progressed much further if I took the time then to write this proof out as I did today.

Before we can finally lay our short detour into multi-armed bandits, we must ask the question: is this Thompson sampling algorithm, in a sense, the best? We will answer this question in the affirmative next time.

—Joel Kindiak, 13 Feb 26, 2117H

KindiakMath

The Thompson Sampling Template

Leave a comment Cancel reply

The Thompson Sampling Template

Share this:

Leave a comment Cancel reply