What would be a zero-th order solution to the -armed bandit problem? Given the
-armed bandit
and a time horizon
, our goal is to minimise the quantity
where:
denotes our choice of arms, i.e., our solution,
denotes the total number of pulls of arm
as of time
,
- and
denotes the suboptimality gap.
Here, we assumed that is the unique maximum of
, so that
and
for
.
As a first pass, let’s explore, then commit.
- We fix an exploration time
, and we explore each arm
times first. During this exploration phase, we would obtain the quantity
, where each
independently.
- For the remaining
turns we pull the arm
that corresponds to
.
This algorithm is called the explore-then-commit algorithm.
Lemma 1. For each arm ,
Proof. For any arm , after the initial exploration phase,
Therefore, for each arm ,
Therefore, the efficacy of the explore-then-commit algorithm depends on our ability to bound the term
which itself depends on the underlying distributions . There are many probability distributions that we could consider, such as the Bernoulli distribution or the Gaussian distribution. The Gaussian distribution, in particular, yields particularly pleasant concentration results.
Lemma 2. For any , if
, then
Proof. Fix and
. By definition,
Therefore, is
-sub-Gaussian.
For the general case, write . Then
Therefore, is
-sub-Gaussian.
Hence, we generalise this inequality property. Any distribution that satisfies this inequality property shall hence-forth be called sub-Gaussian.
Definition 1. Let denote a probability distribution. We say that
is
-sub-Gaussian if for any
and
,
Unsurprisingly, is
-sub-Gaussian. Intuitively, its tails behave like a Gaussian.
Theorem 1. Let be
-sub-Gaussian. Then for any
and
,
Proof. We use the Cramer-Chernoff method by taking exponentials followed by applying Markov’s inequality: for any ,
In particular, is minimised by
, so that
Therefore, particularising to ,
Remark 1. Given with zero mean, we say that
is
-sub-Gaussian if
is
-sub-Gaussian.
Sub-Gaussian distributions can produce new distributions.
Lemma 3. Let be
-sub-Gaussian,
be
-sub-Gaussian, and
be
-sub-Gaussian, all independent of each other. Then the following hold:
- For any
,
is
-sub-Gaussian.
is
-sub-Gaussian.
Proof. By a similar argument in Lemma 2, for any ,
as required. The second result follows similarly by denoting and the independence of the random variables:
as required.
Definition 2.We say that a probability distribution has zero mean if for any
,
.
Corollary 1. Suppose for simplicity that each is
-sub-Gaussian. Then the explore-then-commit (i.e. ETC) algorithm yields a regret upper bound of
Proof. Suppose each is
-sub-Gaussian. By Lemma 3, the random variable
has zero mean and is
-sub-Gaussian. Likewise,
has zero mean and is -sub-Gaussian. By Theorem 1,
Substituting into Lemma 1 and the definition of , we have
Having a result that works for basically any sub-Gaussian distribution , we must ask a question: what other distributions are sub-Gaussian? Obviously, Lemma 2 shows us that the Gaussian is sub-Gaussian par excellence. Bounded random variables ought not be too far off either.
Example 1. Given a probability distribution with zero mean, suppose that there exists
such that for any
,
almost surely. Then
is
-sub-Gaussian.
Proof. We follow this post. Denote . By differentiating under the integral sign, we have
for any natural number . Denote
, where the logarithm is always taken with base
. Differentiating twice,
and hence,
where we denoted for brevity. For each
, define the probability measure
by
so that . For any random variable
,
Therefore,
Now fix . Use Taylor’s theorem to construct
between
and
such that
Therefore, , which implies that
is
-sub-Gaussian.
Remark 2. In the case where has nonzero finite mean
, we say that
is
-sub-Gaussian if its noise
is.
Example 2. Given a probability distribution , suppose that there exists
such that for any
,
almost surely. Then
is
-sub-Gaussian.
Proof. has zero mean and by Example 1, is sub-Gaussian with parameter
Example 3. For any , the distribution
is
-sub-Gaussian.
Proof. Given ,
. By Example 4, it is
-sub-Gaussian.
Returning to Corollary 1, we notice that the exploration time could plausibly be optimised to minimise the right-hand side. The smaller the
term, the less exploration we do, yielding a higher probability of inefficient exploitation. On the other hand, the higher the
term, the more exploration we do, potentially compromising on the magnitude of our exploitation. Could we balance this trade-off more organically?
The next algorithm aims to do just that—the upper confidence bound (UCB) algorithm. We turn to this algorithm next time.
—Joel Kindiak, 10 Feb 26, 2229H
Leave a comment