Return to Bandits

This set of posts hit close to my heart, as the subject matter of my undergraduate research experience. Most of what I write references the “Bandit Bible” by Tor Lattimore and Csaba Szepesvári. In a sense, the goal of these posts is to communicate my research, and hopefully, accomplish its lofty potential (that did not happen while I was an undergraduate).

Suppose you entered a casino with 1000 one-dollar coins, and there are 5 slot machines in front of you. Each turn, you insert one coin into a machine of your choice. Your goal is to maximise your gains from the machine. How would you allocate your coins?

Definition 1. A stochastic K-armed bandit is a collection of K probability distributions \nu := ( \nu_1,\dots, \nu_k ), called arms.

Suppose you pull the arms for a total of n turns, called the time horizon of your experiment. At each time t = 1,2,\dots, n, denote the arm pulled by A_t \in \{1,\dots, k\}. Given the arm being pulled, you receive a reward X_t \sim \nu_{A_t}. Your goal is to maximise your total reward:

\displaystyle \sum_{t=1}^n X_t.

Seeing that this task is rather ambitious, since we need to optimise over uncountably many possibilities, we can, and should, simplify our objectives. Instead of maximising our total reward, lets maximise the total expected reward:

\displaystyle \mathbb E\left[ \sum_{t=1}^n X_t \right].

For this goal to be a meaningful one, one of the arms should have the highest mean. Let’s formalise this intuition a bit more:

  • For each i = 1, \dots, n, assume that the mean of the i-th arm \mu_i := \mathbb E[Y_i], where Y_i \sim \nu_i, is finite.
  • In this case, the quantity \mu_{\max} := \max\{ \mu_1,\dots, \mu_n\} exists. Suppose \mu_{\max} = \mu_1 without loss of generality.
  • Suppose furthermore that for i \neq 1, \mu_i < \mu_1. Define the suboptimality gap by \Delta_i := \mu_1 - \mu_i.

Hence, we would like to choose our arms so that we can minimise our cumulative regret:

\displaystyle \mathcal R_n(\nu) = n \mu_1 - \mathbb E\left[ \sum_{t=1}^n X_t \right].

We call our choice of arms our policy, denoted \pi. More formally, we observe that at each time (t+1), we have access to which arms we have pulled and collected rewards from. Due to our ignorance of arm 1‘s optimality, we can, and should, select arms probabilistically:

A_t \sim \pi_t( \cdot \mid A_1, X_1, \dots, A_t, X_t).

Different policies \pi would lead to different probability kernels \pi_t(\cdot \mid \cdot), and some distributions yield better (or worse) results. Hence, our regret should really by denoted

\displaystyle \mathcal R_n(\pi, \nu) = n \mu_1 - \mathbb E\left[ \sum_{t=1}^n X_t \right] \geq 0,

to emphasise that the rewards X_t obtained depends on the policy \pi. Furthermore, this quantity is randomised, since the choice of X_t is randomised.

Theorem 1. Using our formalisations, we can decompose the regret as follows:

\displaystyle \mathcal R_n(\pi, \nu) = \sum_{i=1}^K \mathbb E[T_n(i)] \Delta_i ,\quad T_n(i) := \sum_{t=1}^n \mathbb I\{A_t = i\}.

Proof. At each t,

\displaystyle X_t = \sum_{i=1}^K X_{A_t} \cdot \mathbb I\{A_t = i\}.

Summing over t = 1, \dots, n on both sides and interchanging the finite sums,

\displaystyle \sum_{t=1}^n X_t = \sum_{i=1}^K \sum_{t=1}^n X_{A_t} \cdot \mathbb I\{A_t = i\}.

Taking expectations,

\begin{aligned} \mathbb E\left[ \sum_{t=1}^n X_t \right] &= \sum_{i=1}^K \sum_{t=1}^n \mathbb E[ X_{A_t} \cdot \mathbb I\{A_t = i\} ]. \end{aligned}

Using the tower rule,

\begin{aligned} \mathbb E[ X_{A_t} \cdot \mathbb I\{A_t = i\} ] &= \mathbb E[ \mathbb E[ X_{A_t} \cdot \mathbb I\{A_t = i\} \mid A_t]] \\ &= \mathbb E[ \mathbb I\{A_t = i\} \cdot \mathbb E[ X_{A_t}  \mid A_t]] \\ &= \mathbb E[ \mathbb I\{A_t = i\} \cdot \mu_{A_t}] \\ &= \mathbb E[ \mathbb I\{A_t = i\} \cdot \mu_i].\end{aligned}

Using the definition of N_t(i),

\begin{aligned}  \mathcal R_n(\pi, \nu) &= n \mu_1 - \mathbb E\left[ \sum_{t=1}^n X_t \right] \\ &= n \mu_1  - \sum_{i=1}^K \sum_{t=1}^n \mathbb E[ X_{A_t} \cdot \mathbb I\{A_t = i\} ] \\ &= n \mu_1  - \sum_{i=1}^K \sum_{t=1}^n \mathbb E[ \mathbb I\{A_t = i\} \cdot \mu_{i}] \\ &= \sum_{i=1}^K \sum_{t=1}^n \mathbb E[ \mathbb I\{A_t = i\} \cdot (\mu_1 - \mu_i)] \\ &= \sum_{i=1}^K \sum_{t=1}^n \mathbb E[ \mathbb I\{A_t = i\} \cdot \Delta_i] \\ &= \sum_{i=1}^K\mathbb E\left[  \sum_{t=1}^n \mathbb I\{A_t = i\} \right]\Delta_i \\ &= \sum_{i=1}^K\mathbb E[T_n(i)] \Delta_i. \end{aligned}

My work focused on risk-averse bandits, which basically asks for the same cumulative regret minimisation goal, except that the vanilla suboptimality gap \Delta_i gets spiced up just a bit:

\displaystyle \mathcal R_n^{\rho}(\pi, \nu) = \sum_{i=1}^K \mathbb E[T_n(i)] \Delta _i^\rho.

Here, \rho is not a number. Rather, it is a risk-functional. Let me elaborate on that. Given X \sim \nu_i, we have \mathbb E[X] = \mu_i. Furthermore, given that X, Y \sim \nu_i, we have \mathbb E[X] = \nu_i = \mathbb E[Y]. By denoting \rho_{\mathbb E}(\nu_i) := \mathbb E[X] where X \sim \nu_i, we have

\Delta_i = \mu_1 - \mu_i = \rho_{\mathbb E}(\nu_1) - \rho_{\mathbb E}(\nu_i).

Therefore, \rho_{\mathbb E} takes probability measures as inputs and returns a real number as an output that measures the relative risk of the arm, and is called a risk functional. For instance, \rho_{\mathbb{V}} could correspond to the variance risk:

\rho_{\mathbb{V}}(\nu_i) := \mathrm{Var}(X_i) \equiv \mathbb E[(X-\mathbb E[X])^2],\quad X \sim \nu_i.

A popular risk measure in modern portfolio theory is the mean-variance with parameter r > 0:

\rho_{\mathrm{MV}} := r \cdot \rho_{\mathbb E} - \rho_{\mathbb{V}}.

The risk measure that I was initially interested in was the conditional value-at-risk.

In any case, given a risk functional \rho, assuming that \rho(\nu_1) is the uniquely maximum value, we can define the \rho-suboptimality gap by

\Delta_i^\rho := \rho(\nu_1) - \rho(\nu_i).

Hence, our goal is to design a policy that minimises the cumulative \rho-regret:

\displaystyle \mathcal R_n^{\rho}(\pi, \nu) = \sum_{i=1}^K \mathbb E[T_n(i)] \Delta _i^\rho.

We are going to first work with the usual case \rho_{\mathbb E}(\nu_i) = \mathbb E[X], where X \sim \nu_i. Here, \rho = \rho_{\mathbb E} corresponds to the usual expectation. We will explore the following three algorithms:

  • Explore-then-Commit
  • Upper Confidence Bound
  • Thompson Sampling

It is this last algorithm that consumed my waking and sleeping hours in my undergraduate years.

—Joel Kindiak, 9 Feb 26, 1306H

,

Published by


Leave a comment