Automated Confidence

Previously, we looked at the explore-then-commit algorithm, and asked how we can “automate” the optimisation between exploring and exploiting arms.

Lemma 1. Let \nu be \sigma-sub-Gaussian with mean \mu and sample X_1,\dots, X_n \sim \nu independently. Define \hat{\mu} := (X_1 + \dots + X_n)/n. Then for any \delta > 0,

\displaystyle \mathbb P\left( \mu \geq \hat{\mu} + \sqrt{ \frac{2\sigma^2\log(1/\delta)}{n} } \right) \leq \delta.

Proof. Since \nu is \sigma-sub-Gaussian, ( \mu - \hat{\mu} ) has zero mean and is also (\sigma/\sqrt{n})-sub-Gaussian. Using the tail concentration for sub-Gaussian random variables, for any \epsilon > 0,

\displaystyle \mathbb P\left( \mu - \hat{\mu} \geq \epsilon \right) \leq \exp\left( -\frac{\epsilon^2}{2(\sigma/\sqrt n)^2} \right) = \exp \left( -\frac{n\epsilon^2}{2\sigma^2} \right).

Setting the right-hand side equals \delta and solving for \epsilon,

\displaystyle \exp\left( -\frac{n\epsilon^2}{2\sigma^2} \right) = \delta \quad \iff \quad \epsilon = \sqrt{ \frac{2\sigma^2\log(1/\delta)}{n} }.

Therefore,

\displaystyle \mathbb P\left( \mu - \hat{\mu} \geq \sqrt{ \frac{2\sigma^2\log(1/\delta)}{n} } \right) \leq \delta.

We want to use Lemma 1 to design our algorithm. What Lemma 1 communicates to us is that for small \delta and large n,

\displaystyle \mathbb P\left( \mu  \leq \hat{\mu} + \sqrt{ \frac{2\sigma^2\log(1/\delta)}{n} } \right) \geq 1 - \delta.

That is, with probability close to 1, \mu can be estimated using \hat{\mu} and some upper confidence bound (UCB). We can use this observation to design our UCB algorithm.

Let \nu = (\nu_1,\dots, \nu_K) denote a K-armed bandit with \nu_1 having the uniquely largest mean, and suppose for simplicity that \nu_i is known to be \sigma_i-sub-Gaussian. Recall that for any policy \pi, we can upper-bound the cumulative regret by

\displaystyle \mathcal R_n(\pi, \nu) \leq \sum_{i=1}^n \mathbb E[T_i(n)] \cdot \Delta_i,

where n denotes the time horizon, and \Delta_i := \mu_1 - \mu_i denotes the sub-optimality gap. Before designing the UCB algorithm, we make several fixed-time notions for a given arm i with distribution \nu_i:

  • For any u_i \geq 1, define \hat{\mu}_{i,u} = \displaystyle \frac 1u \sum_{s=1}^u X_{i,s}, where X_{i,s} \sim \nu_i independently.
  • For any u_i \geq 1, define \displaystyle \mathrm{UCB}_{i, u}(\delta) := \hat{\mu}_{i,u} + \sqrt{ \frac{ 2 \sigma_i^2 \log(1/\delta) }{u} }.

Hence, given the number of pulls T_i(t) \geq 1 of arm i up to and including time t, define

\hat{\mu}_{i}(t):= \hat{\mu}_{i, T_i(t)},\quad \mathrm{UCB}_{i}(t, \delta) := \mathrm{UCB}_{i, T_i(t)}(\delta).

We shall design the UCB algorithm as follows. Pull each arm once, so that T_i(K) = 1. For time t = K,\dots, n-1, pull the arm A_{t+1} = i that maximises \mathrm{UCB}_i (t), that is,

\displaystyle A_{t+1} := \underset{i=1,\dots, K}{\arg \max}\ \mathrm{UCB}_i(t, \delta).

Lemma 2. For any event G, the number of pulls of arm i can be upper-bounded by

\begin{aligned} \mathbb E[T_i(n)] &\leq \mathbb E[\mathbb E[T_i(n)] \mid G]] + n \cdot \mathbb P(G^c). \end{aligned}

Proof. By the law of total expectation, since T_i(n) \leq n is always true,

\begin{aligned} \mathbb E[T_i(n)] &\leq \mathbb E[\mathbb E[T_i(n)] \mid G] \cdot \underbrace{ \mathbb P(G) }_{\leq 1} + \mathbb E[ \underbrace{\mathbb E[T_i(n)]}_{\leq n} \mid G^c] \cdot \mathbb P(G^c) \\ &\leq \mathbb E[\mathbb E[T_i(n)] \mid G]] + n \cdot \mathbb P(G^c).\end{aligned}

For each i, define the ‘good event’ G_i that satisfies the following two conditions:

  • For any t, \mu_1 < \mathrm{UCB}_1(t, \delta).
  • We have \mathrm{UCB}_{i, u_i}(\delta) < \mu_1.

The first condition requires \mathrm{UCB}_1(t) to always be ‘optimistic’ relative to \mu_1, and the second condition requires that at time u_i, \mathrm{UCB}_{i,u_i}(\delta) is less optimistic than \mu_1. Expressed in terms of events,

\displaystyle G_i = \left\{ \mu_1 < \min_{t=1,\dots, n} \mathrm{UCB}_1(t, \delta) \right\} \cap \{ \mathrm{UCB}_{i,u_i}(\delta) < \mu_1 \}.

occurs. By Lemma 2,

\begin{aligned} \mathbb E[T_i(n)] &\leq \mathbb E[\mathbb E[T_i(n)] \mid G_i]] + n \cdot \mathbb P(G_i^c).\end{aligned}

It remains to upper-bound \mathbb E[T_i(n)] under G_i, as well as \mathbb P(G_i^c).

Lemma 3. Under the good event G_i, T_i(n) \leq u_i.

Proof. Suppose G_i happens, and assume for a contradiction that T_i(n) > u_i. Then there exists some t such that T_i(t) = u_i and A_{t+1} = i. At this point in time,

\displaystyle \mathrm{UCB}_i(t, \delta) \geq \mathrm{UCB}_1(t, \delta).

By the definition of G_i,

\begin{aligned} \mathrm{UCB}_i(t, \delta) &= \mathrm{UCB}_{i, u_i}(\delta) < \mu_1 < \mathrm{UCB}_1(t,\delta),\end{aligned}

a contradiction.

Lemma 4. There exists c \in (0, 1) depending on u_i, \sigma_i^2, \delta such that

\displaystyle \mathbb P(G_i^c) \leq n\delta + \exp\left( -\frac{u_i c^2 \Delta_i^2}{2\sigma_i^2} \right).

Proof. Taking the complement of G_i,

\displaystyle G_i^c = \left\{ \mu_1 \geq \min_{t=1,\dots, n} \mathrm{UCB}_1(t, \delta) \right\} \cup \{ \mathrm{UCB}_{i,u_i}(\delta) \geq \mu_1 \}.

Taking a union bound,

\begin{aligned} \mathbb P(G_i^c) &\leq \underbrace{ \mathbb P\left( \mu_1 \geq \min_{t=1,\dots, n} \mathrm{UCB}_1(t, \delta) \right) }_{p_1} + \underbrace{ \mathbb P ( \mathrm{UCB}_{i,u_i}(\delta) \geq \mu_1 ) }_{p_2}. \end{aligned}

We can upper-bound p_1 by containing the first event:

\begin{aligned} \left\{ \mu_1 \geq \min_{t=1,\dots, n} \mathrm{UCB}_1(t, \delta) \right\} &\subseteq \left\{ \mu_1 \geq \min_{s=1,\dots, n} \mathrm{UCB}_{1,s}(\delta) \right\} \\ &\subseteq \bigcup_{s=1}^n \left\{ \mu_1 \geq \mathrm{UCB}_{1,s}(\delta)  \right\} \\ &= \bigcup_{s=1}^n \left\{ \mu_1 \geq \hat{\mu}_{1,s} + \sqrt{\frac{2 \sigma_i^2 \log(1/\delta)}{s}}  \right\}. \end{aligned}

By taking a union bound and applying Lemma 1,

\begin{aligned} p_1 &= \mathbb P\left( \mu_1 \geq \min_{t=1,\dots, n} \mathrm{UCB}_1(t, \delta) \right) \\ &\leq \sum_{s=1}^n \mathbb P \left( \mu_1 \geq \hat{\mu}_{1,s} + \sqrt{\frac{2 \sigma_i^2 \log(1/\delta)}{s}}  \right) \\ &\leq \sum_{s=1}^n \delta = n\delta. \end{aligned}

To upper-bound p_2, we unravel its definition:

\begin{aligned} \{ \mathrm{UCB}_{i,u_i}(\delta) \geq \mu_1 \} &= \left\{ \hat{\mu}_{i,u_i} + \sqrt{\frac{ 2 \sigma_i^2 \log(1/\delta) }{u_i}} \geq \mu_1 \right\} \\ &= \left\{ \hat{\mu}_{i,u_i} -\mu_i + \sqrt{\frac{ 2 \sigma_i^2 \log(1/\delta) }{u_i}} \geq \mu_1 - \mu_i \right\} \\ &= \left\{ \hat{\mu}_{i,u_i} -\mu_i \geq \Delta_i - \sqrt{\frac{ 2 \sigma_i^2 \log(1/\delta) }{u_i}} \right\} \\ &\subseteq \{ \hat{\mu}_{i,u_i} -\mu_i \geq c\Delta_i \}, \end{aligned}

where the final inclusion is obtained by a choice of c \in (0, 1) such that

\displaystyle \Delta_i - \sqrt{\frac{ 2 \sigma_i^2 \log(1/\delta) }{u_i}} \geq c \Delta_i.

Using the same tail concentration in Lemma 1,

\begin{aligned} \mathbb P( \mathrm{UCB}_{i,u_i}(\delta) \geq \mu_1 ) &\leq \mathbb P( \hat{\mu}_{i,u_i} -\mu_i \geq c\Delta_i ) \leq \exp\left( -\frac{u_i c^2 \Delta_i^2}{2\sigma_i^2} \right). \end{aligned}

Therefore,

\displaystyle \mathbb P(C_i^c) \leq p_1 + p_2 \leq n\delta + \exp\left( -\frac{u_i c^2 \Delta_i^2}{2\sigma_i^2} \right).

Theorem 1. Given the horizon n, define the error threshold \delta := 1/n^2. Then the cumulative regret can be upper-bounded by

\displaystyle \mathcal R_n(\mathrm{UCB}, \nu) \leq \left( 2 + \frac 1n\right) \sum_{i=1}^K\Delta_i + 16 \log(n) \sum_{i : \Delta_i > 0}  \frac{ \sigma_i^2 }{  \Delta_i }.

If furthermore that \sigma_i = 1 for all i, then we recover the following more visually-friendly regret bound:

\displaystyle \mathcal R_n(\mathrm{UCB}, \nu) \leq 3\sum_{i=1}^K\Delta_i + 16 \log(n) \sum_{i=1}^K  \frac{ 1 }{  \Delta_i }.

Proof. Combining Lemmas 3 and 4, there exists c \in (0, 1) such that

\displaystyle \mathbb E[T_i(n)] \leq u_i + n \left[ n\delta + \exp\left( -\frac{u_i c^2 \Delta_i^2}{2\sigma_i^2} \right)\right].

As per Lemma 4, we required

\displaystyle \Delta_i - \sqrt{ \frac{ 2 \sigma_i^2 \log(1/\delta) }{u_i} } \geq c \Delta_i.

For the arms i such that \Delta_i > 0, solving for u_i,

\displaystyle u_i \geq \frac{ 2 \sigma_i^2 \log(1/\delta) }{ (1-c)^2 \Delta_i^2 } \quad \Rightarrow \quad \frac{u_ic^2\Delta_i^2}{2 \sigma_i^2} \geq \frac{ c^2 \log(1/\delta) }{ (1-c)^2  }.

This lower bound allows us to upper-bound the exponential term as

\displaystyle \exp\left( -\frac{u_i c^2 \Delta_i^2}{2\sigma_i^2} \right) \leq  \delta^{c^2/(1-c)^2}.

Hence, we set u_i to be the smallest integer such that

\displaystyle u_i = \left\lceil \frac{ 2 \sigma_i^2 \log(1/\delta) }{ (1-c)^2 \Delta_i^2 } \right\rceil \leq \frac{ 2 \sigma_i^2 \log(1/\delta) }{ (1-c)^2 \Delta_i^2 }  + 1 .

Therefore,

\displaystyle \mathbb E[T_i(n)] \leq \frac{ 2 \sigma_i^2 \log(1/\delta) }{ (1-c)^2 \Delta_i^2 }  + 1 + n^2 \delta + n\delta^{c^2/(1-c)^2}.

For convenience, we choose c = 1/2 so that c/(1-c) = 1:

\displaystyle \mathbb E[T_i(n)] \leq \frac{ 8 \sigma_i^2 \log(1/\delta) }{  \Delta_i^2 }  + 1 + n^2 \delta + n\delta.

Setting \delta = 1/n^2, \log(1/\delta) = 2 \log(n) and

\displaystyle 1 + n^2 \delta + n \delta = 2 + \frac 1n.

Therefore,

\displaystyle \mathcal R_n(\mathrm{UCB}, \nu) \leq \left( 2 + \frac 1n\right) \sum_{i=1}^K\Delta_i + 16 \log(n) \sum_{i : \Delta_i > 0}  \frac{ \sigma_i^2 }{  \Delta_i }.

My undergraduate research professor and his collaborators proved a generalised version for \Delta_i^{\rho}, where \rho denotes sufficiently well-behaved risk functionals.

Theorem 2 (Tan et al, 2022). By modifying the UCB algorithm to account for general risk measures \rho, called \rho-LCB,

\displaystyle \mathcal R_n(\rho\text{-}\mathrm{LCB}, \nu) \leq 5 K \sum_{i = 1}^K \Delta_i^{\rho} + 4L^2 \sigma^2 (32 \sqrt{e \log(n)} + 256 \sqrt 2)^2 \sum_{i : \Delta_i^{\rho} > 0} \frac{ 1}{ \Delta_i^{\rho} },

where K, L, \sigma are pre-defined universal constants in the aforementioned paper. Here, \Delta_i^{\rho} = \rho(\nu_1) - \rho(\nu_i), and we assume \rho(\nu_1) to be maximum.

Proof. See Theorem 51 in the paper here. The technical portion of this proof is deriving a concentration bound analogous to that of Lemma 1, and the rest of the proof follows the same basic strategy in that of Theorem 1.

Though more convoluted due to the generality of \rho, the regret bound in Theorem 2 matches in spirit to Theorem 1, taking the form

\displaystyle \mathcal R_n(\pi, \nu) \leq A \sum_{i : \Delta_i^{\rho} > 0} \Delta_i^{\rho} + f(n) \sum_{i : \Delta_i^{\rho} > 0} \frac{ 1}{ \Delta_i^{\rho} }

for the computable constant A and function f(n).

The next algorithm is a modification of the UCB algorithm and is the one closest to my heart—Thompson sampling. In the UCB algorithm, we compute our “optimism” of arm i using its upper confidence bound \mathrm{UCB}_i(t+1,\delta). Conditioned on the number of pulls T_i(t) and the reward history (X_{i,1},\dots, X_{i,T_i(t)}), \mathrm{UCB}_i(t+1,\delta) is a deterministic value.

The idea for Thompson sampling is to randomise this process in a mathematically meaningful manner: Conditioned on the number of pulls T_i(t) and the reward history (X_{i,1},\dots, X_{i,T_i(t)}), we sample a random distribution \tilde{\nu}_i, and pull the arm that maximises \rho(\tilde{\nu}_i). In the vanilla case \rho = \rho_{\mathbb E}, we have \rho(\tilde{\nu}_i) = \mathbb E[Y] for Y \sim \tilde{\nu}_i.

My work aimed to solve this modified K-armed bandit problem by using Thompson sampling. Spanning sufficiently many commonly used risk functionals \rho, the algorithm does work for Bernoulli distributions. For a non-singular subset of these functionals, it even works for any distribution with finite support. However, just like how the heart of Theorem 2 is in the vanilla case of Theorem 1, the heart of the risk-averse Thompson sampling approach lies in the vanilla expectation approach.

We will explore this approach the next time, and also eventually address what we mean for an algorithm to be optimal.

—Joel Kindiak, 11 Feb 26, 1735H

,

Published by


Leave a comment