A Zeroth-Order Solution

What would be a zero-th order solution to the $K$ -armed bandit problem? Given the $K$ -armed bandit $\nu \equiv (\nu_1,\dots, \nu_K)$ and a time horizon $n$ , our goal is to minimise the quantity

$\displaystyle \mathcal R_n(\pi, \nu) = \sum_{i=1}^K \mathbb E[T_i(n)] \Delta_i,$

where:

$\pi$ denotes our choice of arms, i.e., our solution,
$T_i(n)$ denotes the total number of pulls of arm $i$ as of time $n$ ,
and $\Delta_i = \rho_{\mathbb E}(\nu_1) - \rho_{\mathbb E}(\nu_i)$ denotes the suboptimality gap.

Here, we assumed that $\rho_{\mathbb E}(\nu_1)$ is the unique maximum of $\{ \rho_{\mathbb E}(\nu_1), \dots, \rho_{\mathbb E}(\nu_K)\}$ , so that $\Delta_1 = 0$ and $\Delta_i > 0$ for $i \neq 1$ .

As a first pass, let’s explore, then commit.

We fix an exploration time $m \in \mathbb N$ , and we explore each arm $m$ times first. During this exploration phase, we would obtain the quantity $\hat{\mu}_i = (X_{i,1} + \dots + X_{i,m})/m$ , where each $X_{i,t} \sim \nu_i$ independently.
For the remaining $(n - mK)$ turns we pull the arm $i$ that corresponds to $\hat{\mu}_{\max} := \max_i \hat{\mu}_i$ .

This algorithm is called the explore-then-commit algorithm.

Lemma 1. For each arm $i$ ,

$\displaystyle \mathbb E[T_i(n)] \leq m + (n - mK) \cdot \mathbb P((\hat{\mu}_i - \mu_i) - (\hat{\mu}_1 - \mu_1) \geq \Delta_i).$

Proof. For any arm $i \neq 1$ , after the initial exploration phase,

$\begin{aligned} E_i &:= \{T_i(n) = n - mK + m\} \\ &= \{\hat{\mu}_i = \hat{\mu}_{\max}\} \\ &= \bigcap_{j \neq i} \{\hat{\mu}_i \geq \hat{\mu}_j\} \\ &\subseteq \{\hat{\mu}_i \geq \hat{\mu}_1\} \\ &=\{ (\hat{\mu}_i - \mu_i) - (\hat{\mu}_1 - \mu_1) \geq \mu_1 - \mu_i\} \\ &=\{ (\hat{\mu}_i - \mu_i) - (\hat{\mu}_1 - \mu_1) \geq \Delta_i\} \end{aligned}$

Therefore, for each arm $i$ ,

$\begin{aligned} \mathbb E[T_i(n)] &= \mathbb E[T_i(n) \mid E_i] \cdot \mathbb P(E_i) \\ &= m + (n-mK) \cdot \mathbb P(E_i) \\ &\leq m + (n - mK) \cdot \mathbb P((\hat{\mu}_i - \mu_i) - (\hat{\mu}_1 - \mu_1) \geq \Delta_i). \end{aligned}$

Therefore, the efficacy of the explore-then-commit algorithm depends on our ability to bound the term

$\displaystyle p_i := \mathbb P((\hat{\mu}_i - \mu_i) - (\hat{\mu}_1 - \mu_1) \geq \Delta_i),$

which itself depends on the underlying distributions $\nu$ . There are many probability distributions that we could consider, such as the Bernoulli distribution or the Gaussian distribution. The Gaussian distribution, in particular, yields particularly pleasant concentration results.

Lemma 2. For any $\sigma > 0$ , if $X \sim \mathcal N(0, \sigma^2)$ , then

$\mathbb E[\exp(\lambda X)] = \exp(\lambda^2 \sigma^2/2).$

Proof. Fix $\lambda \in \mathbb R$ and $Z \sim \mathcal N(0, 1)$ . By definition,

$\begin{aligned} \mathbb E[\exp(\lambda X)] &= \int_{\mathbb R} e^{\lambda x} \cdot \frac 1{\sqrt{2\pi}} e^{-x^2/2}\, \mathrm dx \\ &= \frac 1{\sqrt{2\pi}} \int_{\mathbb R} e^{-((x - \lambda)^2 - \lambda^2)/2}\, \mathrm dx \\ &= e^{\lambda^2/2} \cdot \frac 1{\sqrt{2\pi}} \int_{\mathbb R} e^{-(x - \lambda)^2/2}\, \mathrm dx \\ &=e^{\lambda^2/2} \cdot 1 = e^{\lambda^2/2}. \end{aligned}$

Therefore, $\mathcal N(0, 1)$ is $1$ -sub-Gaussian.

For the general case, write $X = \sigma Z \sim \mathcal N(0, \sigma^2)$ . Then

$\mathbb E[\exp(\lambda X)] = \mathbb E[\exp((\sigma \lambda) Z)] = e^{(\sigma \lambda)^2/2} = e^{\lambda^2 \sigma^2/2}.$

Therefore, $\mathcal N(0, \sigma^2)$ is $\sigma$ -sub-Gaussian.

Hence, we generalise this inequality property. Any distribution that satisfies this inequality property shall hence-forth be called sub-Gaussian.

Definition 1. Let $\nu$ denote a probability distribution. We say that $\nu$ is $\sigma$ -sub-Gaussian if for any $X \sim \nu$ and $\lambda \in \mathbb R$ ,

$\mathbb E[\exp(\lambda X)] \leq \exp(\lambda^2 \sigma^2/2).$

Unsurprisingly, $\mathcal N(0, \sigma^2)$ is $\sigma$ -sub-Gaussian. Intuitively, its tails behave like a Gaussian.

Theorem 1. Let $\nu$ be $\sigma$ -sub-Gaussian. Then for any $X \sim \nu$ and $\epsilon > 0$ ,

$\begin{aligned} \mathbb P(X \geq \epsilon) &\leq \exp\left( -\frac{\epsilon^2}{2\sigma^2} \right). \end{aligned}$

Proof. We use the Cramer-Chernoff method by taking exponentials followed by applying Markov’s inequality: for any $\lambda > 0$ ,

$\begin{aligned} \mathbb P(X \geq \epsilon) & = \mathbb P(\exp(\lambda X) \geq \exp(\lambda \epsilon)) \\ &\leq \frac{\mathbb E[\exp(\lambda X)]}{\exp(\lambda \epsilon)} \\ &\leq \frac{ \exp(\lambda^2 \sigma^2/2) }{\exp(\lambda \epsilon)} \\ &= \exp(\lambda^2 \sigma^2/2 - \lambda \epsilon). \end{aligned}$

In particular, $\lambda^2 \sigma^2/2 - \lambda$ is minimised by $\lambda = \frac 12 (2\epsilon/\sigma^2)) = \epsilon/\sigma^2$ , so that

$\displaystyle \frac{\lambda^2 \sigma^2}2 - \lambda \epsilon = \frac{\epsilon^2}{\sigma^4} \cdot \frac{\sigma^2}{2} - \frac{\epsilon}{\sigma^2} \cdot \epsilon = -\frac{\epsilon^2}{2\sigma^2}.$

Therefore, particularising to $\lambda = \epsilon^2/\sigma^2$ ,

$\begin{aligned} \mathbb P(X \geq \epsilon) &\leq \exp\left( -\frac{\epsilon^2}{2\sigma^2} \right). \end{aligned}$

Remark 1. Given $X \sim \nu$ with zero mean, we say that $X$ is $\sigma$ -sub-Gaussian if $\nu$ is $\sigma$ -sub-Gaussian.

Sub-Gaussian distributions can produce new distributions.

Lemma 3. Let $X$ be $\sigma$ -sub-Gaussian, $X_1$ be $\sigma_1$ -sub-Gaussian, and $X_2$ be $\sigma_2$ -sub-Gaussian, all independent of each other. Then the following hold:

For any $\alpha \in \mathbb R$ , $\alpha X$ is $( |\alpha| \sigma)$ -sub-Gaussian.
$X_1 + X_2$ is $\sqrt{\sigma_1^2 + \sigma_2^2}$ -sub-Gaussian.

Proof. By a similar argument in Lemma 2, for any $\lambda \in \mathbb R$ ,

$\begin{aligned} \mathbb E[\exp(\lambda (\alpha X))] &= \mathbb E[\exp( (\lambda \alpha) X)] \\ &\leq \exp((\lambda \alpha)^2 \sigma^2/2) \\ &\leq \exp(\lambda^2 (|\alpha| \sigma)^2/2), \end{aligned}$

as required. The second result follows similarly by denoting $\sigma^2 = \sigma_1^2 + \sigma_2^2$ and the independence of the random variables:

$\begin{aligned} \mathbb E[\exp(\lambda (X_1 + X_2))] &= \mathbb E[\exp(\lambda(X_1 + X_2))] \\ &= \mathbb E[\exp(\lambda X_1)\exp(\lambda X_2)] \\ &= \mathbb E[\exp(\lambda X_1)] \cdot \mathbb E[\exp(\lambda X_2)] \\ &\leq \exp(\lambda^2 \sigma_1^2/2) \cdot \exp(\lambda^2 \sigma_2^2/2) \\ &\leq \exp(\lambda^2 (\sigma_1^2 + \sigma_2^2)/2) \\ &\leq \exp(\lambda^2 \sigma^2/2), \end{aligned}$

as required.

Definition 2.We say that a probability distribution $\nu$ has zero mean if for any $X \sim \nu$ , $\mathbb E[X] = \nu$ .

Corollary 1. Suppose for simplicity that each $\nu_i$ is $1$ -sub-Gaussian. Then the explore-then-commit (i.e. ETC) algorithm yields a regret upper bound of

$\displaystyle \mathcal R_n(\text{ETC}, \nu) \leq m \sum_{i=1}^K \Delta_i + (n-mK) \cdot \sum_{i=1}^K \exp\left( -\frac{m \Delta_i^2}{4} \right) \Delta_i.$

Proof. Suppose each $\nu_i$ is $1$ -sub-Gaussian. By Lemma 3, the random variable $\hat{\mu}_i - \mu_i$ has zero mean and is $(1/\sqrt m)$ -sub-Gaussian. Likewise,

$(\hat{\mu}_i - \mu_i) - (\hat{\mu}_1 - \mu_1)$

has zero mean and is $\sqrt{2/m}$ -sub-Gaussian. By Theorem 1,

$\displaystyle \mathbb P(\hat{\mu}_i - \mu_i - (\hat{\mu}_1 - \mu_1) \geq \Delta_i) \leq \exp\left( -\frac{m \Delta_i^2}{4} \right).$

Substituting into Lemma 1 and the definition of $\mathcal R_n(\text{ETC}, \nu)$ , we have

$\displaystyle \mathcal R_n(\text{ETC}, \nu) \leq m \sum_{i=1}^K \Delta_i + (n-mK) \cdot \sum_{i=1}^K \exp\left( -\frac{m \Delta_i^2}{4} \right) \Delta_i.$

Having a result that works for basically any sub-Gaussian distribution $\nu$ , we must ask a question: what other distributions are sub-Gaussian? Obviously, Lemma 2 shows us that the Gaussian is sub-Gaussian par excellence. Bounded random variables ought not be too far off either.

Example 1. Given a probability distribution $\nu$ with zero mean, suppose that there exists $a < b$ such that for any $X \sim \nu$ , $X \in [a, b]$ almost surely. Then $\nu$ is $(b-a)/2$ -sub-Gaussian.

Proof. We follow this post. Denote $M_X(t) = \mathbb E[\exp(tX)]$ . By differentiating under the integral sign, we have

$M_X^{(n)}(t) = \mathbb E[X^n \exp(tX)]$

for any natural number $n$ . Denote $\psi_X(t) := \log M_X(t)$ , where the logarithm is always taken with base $e$ . Differentiating twice,

$\displaystyle \psi_X'(t) = \frac{M_X'(t)}{M_X(t)}$

and hence,

$\begin{aligned} \psi_X''(t) &= \frac{M_X''(t)}{M_X(t)} - (\psi_X'(t))^2 \\ &= \mathbb E\left[ X^2 \cdot \frac{\exp(tX)}{M_X(t)} \right] - \mathbb E\left[X \cdot \frac{\exp(tX)}{M_X(t)}\right]^2 \\ &= \mathbb E[X^2 \cdot g_t(X)] - \mathbb E[X g_t(X)]^2,\end{aligned}$

where we denoted $g_t(x) := \exp(tx)/M_X(t)$ for brevity. For each $t$ , define the probability measure $\mathbb P_t$ by

$\displaystyle \mathbb P_t(K) := \int_K g_t(X)\, \mathrm d\nu = \mathbb E[\mathbb I_K \cdot g_t(X)],$

so that $\mathrm d\mathbb P_t/\mathrm d\nu = g_t(X)$ . For any random variable $Y$ ,

$\displaystyle \mathbb E_t[Y] := \int_{\mathbb R}Y\, \mathrm{d}\mathbb P_t = \int_{\mathbb R}Y \cdot g_t(X)\, \mathrm{d}\nu = \mathbb E[Y \cdot g_t(X)].$

Therefore,

$\begin{aligned} \psi_X''(t) &= \mathbb E[X^2 \cdot g_t(X)] - \mathbb E[X g_t(X)]^2 \\ &= \mathbb E_t[X^2] - \mathbb E_t[X^2] \\ &= \mathbb E_t[ (X - \mathbb E_t[X])^2 ] \\ &= \min_{c \in \mathbb R} \mathbb E_t[ (X - c)^2 ] \\&\leq \mathbb E_t\left[ \left(X - \frac{a+b}{2} \right)^2 \right] \\&\leq \mathbb E_t\left[ \left(b - \frac{a+b}{2} \right)^2 \right] \\ &= \frac{(b-a)^2}{4} \cdot \underbrace{ \mathbb E[g_t(X)] }_1 = \frac{(b-a)^2}{4}.\end{aligned}$

Now fix $\lambda \in \mathbb R$ . Use Taylor’s theorem to construct $\eta$ between $0$ and $\lambda$ such that

$\begin{aligned} \psi(\lambda) &= \psi(0) + \psi'(0) \lambda + \frac{\psi''(\eta)}{2} \cdot \lambda^2 \\ &\leq \log(1) + \mathbb E[X] \cdot \lambda + \frac{(b-a)^2}{4} \cdot \frac 12 \cdot \lambda^2 \\ &= 0 + 0 + \frac{\lambda^2 (b-a)^2}{8}. \end{aligned}$

Therefore, $\mathbb E[\exp(\lambda X)] = \exp(\psi(\lambda)) \leq \exp(\lambda^2 (b-a)^2/8)$ , which implies that $\nu$ is $(b-a)/2$ -sub-Gaussian.

Remark 2. In the case where $X$ has nonzero finite mean $\mathbb E[X] \neq 0$ , we say that $X$ is $\sigma$ -sub-Gaussian if its noise $(X - \mathbb E[X])$ is.

Example 2. Given a probability distribution $\nu$ , suppose that there exists $a < b$ such that for any $X \sim \nu$ , $X \in [a, b]$ almost surely. Then $\nu$ is $(b-a)/2$ -sub-Gaussian.

Proof. $X - \mathbb E[X] \in [a - \mathbb E[X], b - \mathbb E[X]]$ has zero mean and by Example 1, is sub-Gaussian with parameter

$((b - \mathbb E[X]) - (a - \mathbb E[X]))/2 = (b-a)/2.$

Example 3. For any $p \in [0, 1]$ , the distribution $\mathrm{Ber}(p)$ is $1/2$ -sub-Gaussian.

Proof. Given $X \in \mathrm{Ber}(p)$ , $X \in \{0, 1\} \subseteq [0, 1]$ . By Example 4, it is $(1-0)/2 = 1/2$ -sub-Gaussian.

Returning to Corollary 1, we notice that the exploration time $m$ could plausibly be optimised to minimise the right-hand side. The smaller the $m$ term, the less exploration we do, yielding a higher probability of inefficient exploitation. On the other hand, the higher the $m$ term, the more exploration we do, potentially compromising on the magnitude of our exploitation. Could we balance this trade-off more organically?

The next algorithm aims to do just that—the upper confidence bound (UCB) algorithm. We turn to this algorithm next time.

—Joel Kindiak, 10 Feb 26, 2229H

KindiakMath

A Zeroth-Order Solution

Leave a comment Cancel reply

A Zeroth-Order Solution

Share this:

Leave a comment Cancel reply