Automated Confidence

Previously, we looked at the explore-then-commit algorithm, and asked how we can “automate” the optimisation between exploring and exploiting arms.

Lemma 1. Let $\nu$ be $\sigma$ -sub-Gaussian with mean $\mu$ and sample $X_1,\dots, X_n \sim \nu$ independently. Define $\hat{\mu} := (X_1 + \dots + X_n)/n$ . Then for any $\delta > 0$ ,

$\displaystyle \mathbb P\left( \mu \geq \hat{\mu} + \sqrt{ \frac{2\sigma^2\log(1/\delta)}{n} } \right) \leq \delta.$

Proof. Since $\nu$ is $\sigma$ -sub-Gaussian, $( \mu - \hat{\mu} )$ has zero mean and is also $(\sigma/\sqrt{n})$ -sub-Gaussian. Using the tail concentration for sub-Gaussian random variables, for any $\epsilon > 0$ ,

$\displaystyle \mathbb P\left( \mu - \hat{\mu} \geq \epsilon \right) \leq \exp\left( -\frac{\epsilon^2}{2(\sigma/\sqrt n)^2} \right) = \exp \left( -\frac{n\epsilon^2}{2\sigma^2} \right).$

Setting the right-hand side equals $\delta$ and solving for $\epsilon$ ,

$\displaystyle \exp\left( -\frac{n\epsilon^2}{2\sigma^2} \right) = \delta \quad \iff \quad \epsilon = \sqrt{ \frac{2\sigma^2\log(1/\delta)}{n} }.$

Therefore,

$\displaystyle \mathbb P\left( \mu - \hat{\mu} \geq \sqrt{ \frac{2\sigma^2\log(1/\delta)}{n} } \right) \leq \delta.$

We want to use Lemma 1 to design our algorithm. What Lemma 1 communicates to us is that for small $\delta$ and large $n$ ,

$\displaystyle \mathbb P\left( \mu \leq \hat{\mu} + \sqrt{ \frac{2\sigma^2\log(1/\delta)}{n} } \right) \geq 1 - \delta.$

That is, with probability close to $1$ , $\mu$ can be estimated using $\hat{\mu}$ and some upper confidence bound (UCB). We can use this observation to design our UCB algorithm.

Let $\nu = (\nu_1,\dots, \nu_K)$ denote a $K$ -armed bandit with $\nu_1$ having the uniquely largest mean, and suppose for simplicity that $\nu_i$ is known to be $\sigma_i$ -sub-Gaussian. Recall that for any policy $\pi$ , we can upper-bound the cumulative regret by

$\displaystyle \mathcal R_n(\pi, \nu) \leq \sum_{i=1}^n \mathbb E[T_i(n)] \cdot \Delta_i,$

where $n$ denotes the time horizon, and $\Delta_i := \mu_1 - \mu_i$ denotes the sub-optimality gap. Before designing the UCB algorithm, we make several fixed-time notions for a given arm $i$ with distribution $\nu_i$ :

For any $u_i \geq 1$ , define $\hat{\mu}_{i,u} = \displaystyle \frac 1u \sum_{s=1}^u X_{i,s}$ , where $X_{i,s} \sim \nu_i$ independently.
For any $u_i \geq 1$ , define $\displaystyle \mathrm{UCB}_{i, u}(\delta) := \hat{\mu}_{i,u} + \sqrt{ \frac{ 2 \sigma_i^2 \log(1/\delta) }{u} }$ .

Hence, given the number of pulls $T_i(t) \geq 1$ of arm $i$ up to and including time $t$ , define

$\hat{\mu}_{i}(t):= \hat{\mu}_{i, T_i(t)},\quad \mathrm{UCB}_{i}(t, \delta) := \mathrm{UCB}_{i, T_i(t)}(\delta).$

We shall design the UCB algorithm as follows. Pull each arm once, so that $T_i(K) = 1$ . For time $t = K,\dots, n-1$ , pull the arm $A_{t+1} = i$ that maximises $\mathrm{UCB}_i (t)$ , that is,

$\displaystyle A_{t+1} := \underset{i=1,\dots, K}{\arg \max}\ \mathrm{UCB}_i(t, \delta).$

Lemma 2. For any event $G$ , the number of pulls of arm $i$ can be upper-bounded by

$\begin{aligned} \mathbb E[T_i(n)] &\leq \mathbb E[\mathbb E[T_i(n)] \mid G]] + n \cdot \mathbb P(G^c). \end{aligned}$

Proof. By the law of total expectation, since $T_i(n) \leq n$ is always true,

$\begin{aligned} \mathbb E[T_i(n)] &\leq \mathbb E[\mathbb E[T_i(n)] \mid G] \cdot \underbrace{ \mathbb P(G) }_{\leq 1} + \mathbb E[ \underbrace{\mathbb E[T_i(n)]}_{\leq n} \mid G^c] \cdot \mathbb P(G^c) \\ &\leq \mathbb E[\mathbb E[T_i(n)] \mid G]] + n \cdot \mathbb P(G^c).\end{aligned}$

For each $i$ , define the ‘good event’ $G_i$ that satisfies the following two conditions:

For any $t$ , $\mu_1 < \mathrm{UCB}_1(t, \delta)$ .
We have $\mathrm{UCB}_{i, u_i}(\delta) < \mu_1$ .

The first condition requires $\mathrm{UCB}_1(t)$ to always be ‘optimistic’ relative to $\mu_1$ , and the second condition requires that at time $u_i$ , $\mathrm{UCB}_{i,u_i}(\delta)$ is less optimistic than $\mu_1$ . Expressed in terms of events,

$\displaystyle G_i = \left\{ \mu_1 < \min_{t=1,\dots, n} \mathrm{UCB}_1(t, \delta) \right\} \cap \{ \mathrm{UCB}_{i,u_i}(\delta) < \mu_1 \}.$

occurs. By Lemma 2,

$\begin{aligned} \mathbb E[T_i(n)] &\leq \mathbb E[\mathbb E[T_i(n)] \mid G_i]] + n \cdot \mathbb P(G_i^c).\end{aligned}$

It remains to upper-bound $\mathbb E[T_i(n)]$ under $G_i$ , as well as $\mathbb P(G_i^c)$ .

Lemma 3. Under the good event $G_i$ , $T_i(n) \leq u_i$ .

Proof. Suppose $G_i$ happens, and assume for a contradiction that $T_i(n) > u_i$ . Then there exists some $t$ such that $T_i(t) = u_i$ and $A_{t+1} = i$ . At this point in time,

$\displaystyle \mathrm{UCB}_i(t, \delta) \geq \mathrm{UCB}_1(t, \delta).$

By the definition of $G_i$ ,

$\begin{aligned} \mathrm{UCB}_i(t, \delta) &= \mathrm{UCB}_{i, u_i}(\delta) < \mu_1 < \mathrm{UCB}_1(t,\delta),\end{aligned}$

a contradiction.

Lemma 4. There exists $c \in (0, 1)$ depending on $u_i, \sigma_i^2, \delta$ such that

$\displaystyle \mathbb P(G_i^c) \leq n\delta + \exp\left( -\frac{u_i c^2 \Delta_i^2}{2\sigma_i^2} \right).$

Proof. Taking the complement of $G_i$ ,

$\displaystyle G_i^c = \left\{ \mu_1 \geq \min_{t=1,\dots, n} \mathrm{UCB}_1(t, \delta) \right\} \cup \{ \mathrm{UCB}_{i,u_i}(\delta) \geq \mu_1 \}.$

Taking a union bound,

$\begin{aligned} \mathbb P(G_i^c) &\leq \underbrace{ \mathbb P\left( \mu_1 \geq \min_{t=1,\dots, n} \mathrm{UCB}_1(t, \delta) \right) }_{p_1} + \underbrace{ \mathbb P ( \mathrm{UCB}_{i,u_i}(\delta) \geq \mu_1 ) }_{p_2}. \end{aligned}$

We can upper-bound $p_1$ by containing the first event:

$\begin{aligned} \left\{ \mu_1 \geq \min_{t=1,\dots, n} \mathrm{UCB}_1(t, \delta) \right\} &\subseteq \left\{ \mu_1 \geq \min_{s=1,\dots, n} \mathrm{UCB}_{1,s}(\delta) \right\} \\ &\subseteq \bigcup_{s=1}^n \left\{ \mu_1 \geq \mathrm{UCB}_{1,s}(\delta) \right\} \\ &= \bigcup_{s=1}^n \left\{ \mu_1 \geq \hat{\mu}_{1,s} + \sqrt{\frac{2 \sigma_i^2 \log(1/\delta)}{s}} \right\}. \end{aligned}$

By taking a union bound and applying Lemma 1,

$\begin{aligned} p_1 &= \mathbb P\left( \mu_1 \geq \min_{t=1,\dots, n} \mathrm{UCB}_1(t, \delta) \right) \\ &\leq \sum_{s=1}^n \mathbb P \left( \mu_1 \geq \hat{\mu}_{1,s} + \sqrt{\frac{2 \sigma_i^2 \log(1/\delta)}{s}} \right) \\ &\leq \sum_{s=1}^n \delta = n\delta. \end{aligned}$

To upper-bound $p_2$ , we unravel its definition:

$\begin{aligned} \{ \mathrm{UCB}_{i,u_i}(\delta) \geq \mu_1 \} &= \left\{ \hat{\mu}_{i,u_i} + \sqrt{\frac{ 2 \sigma_i^2 \log(1/\delta) }{u_i}} \geq \mu_1 \right\} \\ &= \left\{ \hat{\mu}_{i,u_i} -\mu_i + \sqrt{\frac{ 2 \sigma_i^2 \log(1/\delta) }{u_i}} \geq \mu_1 - \mu_i \right\} \\ &= \left\{ \hat{\mu}_{i,u_i} -\mu_i \geq \Delta_i - \sqrt{\frac{ 2 \sigma_i^2 \log(1/\delta) }{u_i}} \right\} \\ &\subseteq \{ \hat{\mu}_{i,u_i} -\mu_i \geq c\Delta_i \}, \end{aligned}$

where the final inclusion is obtained by a choice of $c \in (0, 1)$ such that

$\displaystyle \Delta_i - \sqrt{\frac{ 2 \sigma_i^2 \log(1/\delta) }{u_i}} \geq c \Delta_i.$

Using the same tail concentration in Lemma 1,

$\begin{aligned} \mathbb P( \mathrm{UCB}_{i,u_i}(\delta) \geq \mu_1 ) &\leq \mathbb P( \hat{\mu}_{i,u_i} -\mu_i \geq c\Delta_i ) \leq \exp\left( -\frac{u_i c^2 \Delta_i^2}{2\sigma_i^2} \right). \end{aligned}$

Therefore,

$\displaystyle \mathbb P(C_i^c) \leq p_1 + p_2 \leq n\delta + \exp\left( -\frac{u_i c^2 \Delta_i^2}{2\sigma_i^2} \right).$

Theorem 1. Given the horizon $n$ , define the error threshold $\delta := 1/n^2$ . Then the cumulative regret can be upper-bounded by

$\displaystyle \mathcal R_n(\mathrm{UCB}, \nu) \leq \left( 2 + \frac 1n\right) \sum_{i=1}^K\Delta_i + 16 \log(n) \sum_{i : \Delta_i > 0} \frac{ \sigma_i^2 }{ \Delta_i }.$

If furthermore that $\sigma_i = 1$ for all $i$ , then we recover the following more visually-friendly regret bound:

$\displaystyle \mathcal R_n(\mathrm{UCB}, \nu) \leq 3\sum_{i=1}^K\Delta_i + 16 \log(n) \sum_{i=1}^K \frac{ 1 }{ \Delta_i }.$

Proof. Combining Lemmas 3 and 4, there exists $c \in (0, 1)$ such that

$\displaystyle \mathbb E[T_i(n)] \leq u_i + n \left[ n\delta + \exp\left( -\frac{u_i c^2 \Delta_i^2}{2\sigma_i^2} \right)\right].$

As per Lemma 4, we required

$\displaystyle \Delta_i - \sqrt{ \frac{ 2 \sigma_i^2 \log(1/\delta) }{u_i} } \geq c \Delta_i.$

For the arms $i$ such that $\Delta_i > 0$ , solving for $u_i$ ,

$\displaystyle u_i \geq \frac{ 2 \sigma_i^2 \log(1/\delta) }{ (1-c)^2 \Delta_i^2 } \quad \Rightarrow \quad \frac{u_ic^2\Delta_i^2}{2 \sigma_i^2} \geq \frac{ c^2 \log(1/\delta) }{ (1-c)^2 }.$

This lower bound allows us to upper-bound the exponential term as

$\displaystyle \exp\left( -\frac{u_i c^2 \Delta_i^2}{2\sigma_i^2} \right) \leq \delta^{c^2/(1-c)^2}.$

Hence, we set $u_i$ to be the smallest integer such that

$\displaystyle u_i = \left\lceil \frac{ 2 \sigma_i^2 \log(1/\delta) }{ (1-c)^2 \Delta_i^2 } \right\rceil \leq \frac{ 2 \sigma_i^2 \log(1/\delta) }{ (1-c)^2 \Delta_i^2 } + 1 .$

Therefore,

$\displaystyle \mathbb E[T_i(n)] \leq \frac{ 2 \sigma_i^2 \log(1/\delta) }{ (1-c)^2 \Delta_i^2 } + 1 + n^2 \delta + n\delta^{c^2/(1-c)^2}.$

For convenience, we choose $c = 1/2$ so that $c/(1-c) = 1$ :

$\displaystyle \mathbb E[T_i(n)] \leq \frac{ 8 \sigma_i^2 \log(1/\delta) }{ \Delta_i^2 } + 1 + n^2 \delta + n\delta.$

Setting $\delta = 1/n^2$ , $\log(1/\delta) = 2 \log(n)$ and

$\displaystyle 1 + n^2 \delta + n \delta = 2 + \frac 1n.$

Therefore,

$\displaystyle \mathcal R_n(\mathrm{UCB}, \nu) \leq \left( 2 + \frac 1n\right) \sum_{i=1}^K\Delta_i + 16 \log(n) \sum_{i : \Delta_i > 0} \frac{ \sigma_i^2 }{ \Delta_i }.$

My undergraduate research professor and his collaborators proved a generalised version for $\Delta_i^{\rho}$ , where $\rho$ denotes sufficiently well-behaved risk functionals.

Theorem 2 (Tan et al, 2022). By modifying the UCB algorithm to account for general risk measures $\rho$ , called $\rho$ -LCB,

$\displaystyle \mathcal R_n(\rho\text{-}\mathrm{LCB}, \nu) \leq 5 K \sum_{i = 1}^K \Delta_i^{\rho} + 4L^2 \sigma^2 (32 \sqrt{e \log(n)} + 256 \sqrt 2)^2 \sum_{i : \Delta_i^{\rho} > 0} \frac{ 1}{ \Delta_i^{\rho} },$

where $K, L, \sigma$ are pre-defined universal constants in the aforementioned paper. Here, $\Delta_i^{\rho} = \rho(\nu_1) - \rho(\nu_i)$ , and we assume $\rho(\nu_1)$ to be maximum.

Proof. See Theorem 51 in the paper here. The technical portion of this proof is deriving a concentration bound analogous to that of Lemma 1, and the rest of the proof follows the same basic strategy in that of Theorem 1.

Though more convoluted due to the generality of $\rho$ , the regret bound in Theorem 2 matches in spirit to Theorem 1, taking the form

$\displaystyle \mathcal R_n(\pi, \nu) \leq A \sum_{i : \Delta_i^{\rho} > 0} \Delta_i^{\rho} + f(n) \sum_{i : \Delta_i^{\rho} > 0} \frac{ 1}{ \Delta_i^{\rho} }$

for the computable constant $A$ and function $f(n)$ .

The next algorithm is a modification of the UCB algorithm and is the one closest to my heart—Thompson sampling. In the UCB algorithm, we compute our “optimism” of arm $i$ using its upper confidence bound $\mathrm{UCB}_i(t+1,\delta)$ . Conditioned on the number of pulls $T_i(t)$ and the reward history $(X_{i,1},\dots, X_{i,T_i(t)})$ , $\mathrm{UCB}_i(t+1,\delta)$ is a deterministic value.

The idea for Thompson sampling is to randomise this process in a mathematically meaningful manner: Conditioned on the number of pulls $T_i(t)$ and the reward history $(X_{i,1},\dots, X_{i,T_i(t)})$ , we sample a random distribution $\tilde{\nu}_i$ , and pull the arm that maximises $\rho(\tilde{\nu}_i)$ . In the vanilla case $\rho = \rho_{\mathbb E}$ , we have $\rho(\tilde{\nu}_i) = \mathbb E[Y]$ for $Y \sim \tilde{\nu}_i$ .

My work aimed to solve this modified $K$ -armed bandit problem by using Thompson sampling. Spanning sufficiently many commonly used risk functionals $\rho$ , the algorithm does work for Bernoulli distributions. For a non-singular subset of these functionals, it even works for any distribution with finite support. However, just like how the heart of Theorem 2 is in the vanilla case of Theorem 1, the heart of the risk-averse Thompson sampling approach lies in the vanilla expectation approach.

We will explore this approach the next time, and also eventually address what we mean for an algorithm to be optimal.

—Joel Kindiak, 11 Feb 26, 1735H

KindiakMath

Automated Confidence

Leave a comment Cancel reply

Automated Confidence

Share this:

Leave a comment Cancel reply