Let’s finally discuss Thompson sampling.
Previously, we looked at the UCB algorithm that incorporates the dynamic reward history into its exploration-exploitation tradeoff, yielding an upper-bound on its regret of the form
After an initial pull that yields the reward history , subsequent arm pulls are computed using said information in a deterministic manner relative to the reward history:
where the expression being arg-maxxed is motivated by, given , a tail concentration bound satisfied by an arm
with
-sub-Gaussian distribution:
Thompson sampling aims to randomise this selection. To simplify our discussion, we will be given knowledge that each has a distribution of
, and our goal is to learn
.
When we initialise the algorithm, we are completely ignorant of the distribution of any . So we get to choose our initial guess. Let
denote our guess of the distribution of
, so
denotes our initial guess. Intuitively, after
pulls of arm
, we want to update our guess
given the previous guess
and the reward
obtained from said arm, i.e. we set
where is carefully curated for our purposes.
The Thompson sampling algorithm then proceeds as follows. As an initial exploration phase, for each , pull the arm
, then update the arm distributions:
For each , select
where independently, and finally, collect the reward
. Then update the arms just like before:
where the various probability measures are still conditioned on the reward history, but omitted for brevity.
Our current discussion has brought us far too high into the stratosphere, and we need to return to earth. It stands to reason then that, in the absence of additional information, a meaningful “initial guess” would be
After pulling arm once and receiving a reward
, our next guess should, intuitively, take the form of
, where
should be intimately related to
. We can accomplish that goal using conjugate priors, justified by a measure-theoretic application of Bayes’ theorem.
Lemma 1. Suppose for any ,
Using conjugate priors,
In particular, if , then
Proof. See Section 34.3 and Example 34.7 in Lattimore and Szepesvari (2020).
Therefore, we would use the procedure as follows:
with all relevant quantities defined in Lemma 1.
Remark 1. Observe that . More generally, since
is characterised by its mean and variance, for any risk functional
, there exists a unique function
such that
.
Remark 2. If all we know is that , then we would require a stronger update procedure as outlined by Zhu and Tan (2020) (Figure 2). The upside of this approach is that for any
, we can sample
to form the randomised distribution . Here, the parameters
are updated according to the rules
similar in spirit to Lemma 1. Denoting
we sampled and defined
, where
. Then the arm-selection criterion would be similarly modified to
Now we ask the crucial question: given with
, how do we upper bound the following cumulative regret?
My attempt at solving this problem for the risk functional being the conditional value-at-risk (CVaR) was the substance of my undergraduate research, inspired by Zhu and Tan (2020) and Baudry et al (2021). In this post, we will ultimately only solve the simple case
and
, while we chart the way for future progress on the generalised problem.
Remark 3. There are many other distributions that could take. In the case
are all Bernoulli bandits, my final year project solves the problem in the affirmative for general risk measures
, adapted from seminal works by Riou and Honda (2010) and Baudry et al (2021). It even solves the problem if
is simply known to have finite support, albeit only guaranteed for a stricter subset of (nonetheless useful) risk functionals. Unfortunately, it has not solved the general case of
with support
, or
being Gaussian, or even
being sub-Gaussian. I was tempted to restart this project, but shall relegate it as an exercise. Doing these posts on bandits brought me a much-needed closure that I never knew I needed.
Lemma 2. Assume that arm is uniquely optimal. Fix
. Define
and , where:
independently,
,
,
is a risk functional (i.e.
).
Then
Hence, we remove the randomness of in our subsequent analyses.
Proof. The result and proof are all a mild generalisation of Theorem 36.2 in Lattimore and Szepesvari (2020). By construction,
almost surely. Therefore, we now decompose :
To upper-bound , we refer to the solutions to Problem 36.5 of Lattimore and Szepesvari (2020). Define
Using a union bound,
Therefore, it suffices to upper-bound
To upper-bound , first denote the chosen non-optimal arm by
:
Then
since is conditionally independent of
given
. To lower-bound the second term further,
and conditional independence given implies that
Therefore, we can upper-bound by
To upper-bound , recall that
Therefore,
Taking expectations,
To upper-bound ,
if and only if
:
Therefore,
Lemma 2 gives us the general recipe to evaluate the regret bound for -TS, but we don’t get any concrete result until we consider specific bandit instances. For simplicity, let’s suppose
for each
. We initialise
and use Lemma 1 to update our belief distribution
of arm
at time
after observing the data
independently.
How do we upper-bound the cumulative regret ?
Fix and consider the regret bound recipe in Lemma 2:
where and
where independently.
We aim to upper-bound and
then eventually take
.
Lemma 3. as
.
Proof. To upper-bound , we first use linearity:
The key idea is to note that the value of
is deterministic when conditioned on , since
and the distribution of , in turn, is determined by the reward history:
Therefore, we can let denote the p.d.f of
for flexibility:
and denote its c.d.f.:
Conditioned on ,
so that
almost surely. Therefore, by taking advantage of the marginal distribution and using the change of variables ,
Therefore, we need to upper-bound two integrals. For the first integral, so that given
and the Gaussian tail bound
we have
For the second integral, and using a tail lower bound by Abramowitz and Stegun (1964),
followed by algebra and integration by parts,
Combining both results,
In particular, as
.
Remark 4. In the general setting, we would also aim to prove that , so that
as
. In the proof of Lemma 3, we have the dependency
and the intermediate dependency has a relatively characterisable distribution conditioned on the reward history. More generally, at any time , we would obtain the distribution
where for some deterministic function
, and
. In the vanilla case:
,
,
,
,
.
If we adopt the same strategy as in Lemma 3, then we would like to be sufficiently computationally familiar with Radon-Nikodým derivative
with respect to the measure
. Taking the integral over the appropriate measure space
that denotes the possible realisations of
, conditional expectation should, once again, yield
Hence, we would require tail lower bounds on of the form
for non-negative functions such that
and
so that
and yields
. It suffices to say that the lion’s share of research in this field would go toward discovering and deriving useful tail lower bounds for
where we are reminded that
Lemma 4. as
.
Proof. To upper-bound , we cheat. We first note that
Define by
Using the Gaussian tail bound again,
where the last line follows from the proof of Theorem 8.1 in Lattimore and Szepesvari (2020). In particular,
Remark 5. For the general case, we would like to obtain a concentration bound of the form
where is a non-negative function and
for some desired constant such that
as
. Then for judiciously chosen
,
In particular, setting
yields
so that as
.
Theorem 1. Given that for each
, the regret bound of
-TS is given by
Proof. Combining the displays in Lemmas 2, 3 and 4,
Plugging into the regret decomposition,
Taking ,
Remark 6. Assuming the presence of sufficiently nice concentration bounds in Remarks 4 and 5, we can recover the generalised regret bound
Taking ,
That was…intense. I took at least six hours to type this out. And several more days to include the generalising remarks. And this was supposedly the warm-up for my undergraduate research projects. I have to say…I probably would have progressed much further if I took the time then to write this proof out as I did today.
Before we can finally lay our short detour into multi-armed bandits, we must ask the question: is this Thompson sampling algorithm, in a sense, the best? We will answer this question in the affirmative next time.
—Joel Kindiak, 13 Feb 26, 2117H
Leave a comment