Previously, we looked at the explore-then-commit algorithm, and asked how we can “automate” the optimisation between exploring and exploiting arms.
Lemma 1. Let be
-sub-Gaussian with mean
and sample
independently. Define
. Then for any
,
Proof. Since is
-sub-Gaussian,
has zero mean and is also
-sub-Gaussian. Using the tail concentration for sub-Gaussian random variables, for any
,
Setting the right-hand side equals and solving for
,
Therefore,
We want to use Lemma 1 to design our algorithm. What Lemma 1 communicates to us is that for small and large
,
That is, with probability close to ,
can be estimated using
and some upper confidence bound (UCB). We can use this observation to design our UCB algorithm.
Let denote a
-armed bandit with
having the uniquely largest mean, and suppose for simplicity that
is known to be
-sub-Gaussian. Recall that for any policy
, we can upper-bound the cumulative regret by
where denotes the time horizon, and
denotes the sub-optimality gap. Before designing the UCB algorithm, we make several fixed-time notions for a given arm
with distribution
:
- For any
, define
, where
independently.
- For any
, define
.
Hence, given the number of pulls of arm
up to and including time
, define
We shall design the UCB algorithm as follows. Pull each arm once, so that . For time
, pull the arm
that maximises
, that is,
Lemma 2. For any event , the number of pulls of arm
can be upper-bounded by
Proof. By the law of total expectation, since is always true,
For each , define the ‘good event’
that satisfies the following two conditions:
- For any
,
.
- We have
.
The first condition requires to always be ‘optimistic’ relative to
, and the second condition requires that at time
,
is less optimistic than
. Expressed in terms of events,
occurs. By Lemma 2,
It remains to upper-bound under
, as well as
.
Lemma 3. Under the good event ,
.
Proof. Suppose happens, and assume for a contradiction that
. Then there exists some
such that
and
. At this point in time,
By the definition of ,
a contradiction.
Lemma 4. There exists depending on
such that
Proof. Taking the complement of ,
Taking a union bound,
We can upper-bound by containing the first event:
By taking a union bound and applying Lemma 1,
To upper-bound , we unravel its definition:
where the final inclusion is obtained by a choice of such that
Using the same tail concentration in Lemma 1,
Therefore,
Theorem 1. Given the horizon , define the error threshold
. Then the cumulative regret can be upper-bounded by
If furthermore that for all
, then we recover the following more visually-friendly regret bound:
Proof. Combining Lemmas 3 and 4, there exists such that
As per Lemma 4, we required
For the arms such that
, solving for
,
This lower bound allows us to upper-bound the exponential term as
Hence, we set to be the smallest integer such that
Therefore,
For convenience, we choose so that
:
Setting ,
and
Therefore,
My undergraduate research professor and his collaborators proved a generalised version for , where
denotes sufficiently well-behaved risk functionals.
Theorem 2 (Tan et al, 2022). By modifying the UCB algorithm to account for general risk measures , called
-LCB,
where are pre-defined universal constants in the aforementioned paper. Here,
, and we assume
to be maximum.
Proof. See Theorem 51 in the paper here. The technical portion of this proof is deriving a concentration bound analogous to that of Lemma 1, and the rest of the proof follows the same basic strategy in that of Theorem 1.
Though more convoluted due to the generality of , the regret bound in Theorem 2 matches in spirit to Theorem 1, taking the form
for the computable constant and function
.
The next algorithm is a modification of the UCB algorithm and is the one closest to my heart—Thompson sampling. In the UCB algorithm, we compute our “optimism” of arm using its upper confidence bound
. Conditioned on the number of pulls
and the reward history
,
is a deterministic value.
The idea for Thompson sampling is to randomise this process in a mathematically meaningful manner: Conditioned on the number of pulls and the reward history
, we sample a random distribution
, and pull the arm that maximises
. In the vanilla case
, we have
for
.
My work aimed to solve this modified -armed bandit problem by using Thompson sampling. Spanning sufficiently many commonly used risk functionals
, the algorithm does work for Bernoulli distributions. For a non-singular subset of these functionals, it even works for any distribution with finite support. However, just like how the heart of Theorem 2 is in the vanilla case of Theorem 1, the heart of the risk-averse Thompson sampling approach lies in the vanilla expectation approach.
We will explore this approach the next time, and also eventually address what we mean for an algorithm to be optimal.
—Joel Kindiak, 11 Feb 26, 1735H
Leave a comment