Risky Bernoulli Bandits

In this appendix for the multi-armed bandit writeups, I thought I’d revisit my final year project in a relatively readable manner, demonstrating how $\rho$ -Thomoson Sampling is asymptotically optimal for Bernoulli bandits (that is, $\nu_i = \mathrm{Ber}(p_i)$ for each $i$ ). This is the instantiation $M = 1$ of my final year project.

As a set-up, we initialise $\mathbb P_{i,0} = \mathrm{Beta}(\alpha, \beta)$ with p.d.f.

$\displaystyle f_{\alpha , \beta}(x) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} \cdot x^{\alpha - 1} \cdot (1-x)^{\beta - 1},$

since the conjugate prior of a Bernoulli distribution is the Beta distribution. For the update function, we use

$\mathrm{Update}(\Gamma(\alpha, \beta)) = \Gamma(\alpha+ 1,\beta)$

and $\mathbb P_{i,t+1} = \mathrm{Update}(\mathbb P_{i,t})$ . At each time $t$ , we sample $\theta_{i,t+1} \sim \mathbb P_{i,t}$ and pull the arm

$\displaystyle A_{t+1} = \underset{i=1,\dots, K}{\arg \max}\ \rho(\mathrm{Ber}( \theta_{i,t+1} )).$

Observe that the map $p \mapsto \mathrm{Ber}(p) \mapsto \rho(\mathrm{Ber}(p))$ induces a unique map $\tilde{\rho} : [0, 1] \to \mathbb R$ given by $\tilde{\rho} = \rho(\mathrm{Ber}(p))$ . Following the Thompson sampling strategy, we need to obtain useful tail concentration bounds. We will impose some technical conditions on the risk functional $\rho$ to make this happen.

Definition 1. Call a risk functional $\rho$ continuous if $\tilde{\rho} : [0,1] \to \mathbb R$ is continuous.

If $\rho$ is continuous, then by the extreme value theorem, $\tilde{\rho}([0,1])$ is compact. Hence, there exists $p_1, p_2 \in [0,1]$ such that for any $p \in [0, 1]$ ,

$\tilde{\rho}(p_1) \leq \tilde{\rho}(p) \leq \tilde{\rho}(p_2).$

Lemma 1. If $\rho$ is continuous, then for any $p \in [0,1]$ and $r \in \mathbb R$ , there exists $p_* \in [0, 1]$ such that

$\mathcal{K}_{\inf}^{\rho}(\mathrm{Ber}(p), r) \equiv \mathcal{K}_{\inf}^{\rho}(\mathrm{Ber}(p), r, \mathrm{Ber}) = \mathrm{KL}(\mathrm{Ber}(p), \mathrm{Ber}(p_*)).$

Proof. By definition,

$\displaystyle \mathcal{K}_{\inf}^{\rho}(\mathrm{Ber}(p), r, \mathrm{Ber}) = \inf_{q : \tilde{\rho}(q) > r} \underbrace{ \mathrm{KL}(\mathrm{Ber}(p), \mathrm{Ber}(q)) }_{f(q)}.$

By a direct computation,

$\displaystyle \mathcal{K}_{\inf}^{\rho}(\mathrm{Ber}(p), r, \mathrm{Ber}) = \inf_{q : \tilde{\rho}(q) > r} \underbrace{ p \log \frac pq + (1-p) \log \frac{1-p}{1-q} }_{f(q)},$

so that $f$ is continuous on $(0, 1)$ .

If $r < \tilde{\rho}(p_1)$ , then we may simply choose $p_* = p$ . If $r \geq \tilde{\rho}(p_2)$ , then the constraint set is empty and the left-hand side is infinite, so we choose $p_* = 0$ if $p \neq 0$ and $p_* = 1$ otherwise.

Suppose $\tilde{\rho}(p_1) \leq r < \tilde{\rho}(p_2)$ . If $p = 0$ or $p = 1$ , then $f(q) = 0$ identically, so that we can choose $p_* = p_2$ .

Suppose instead that $p \in (0, 1)$ . Fix $\epsilon \in (0, \tilde{\rho}(p_2) - r)$ . Then $[r + \epsilon, \tilde{\rho}(p_2)]$ is compact, so that

$\tilde{\rho}^{-1}( [r + \epsilon, \infty) ) = \tilde{\rho}^{-1}( [r+ \epsilon, \tilde{\rho}(p_2) ] )$

is closed and bounded, and thus compact by the Heine-Borel theorem. Since the sets

$\begin{aligned} C_1(\epsilon) &:= \tilde{\rho}^{-1}( [r+ \epsilon, \infty) ) \cap [0, p],\\ C_2 (\epsilon)&:= \tilde{\rho}^{-1}( [r+ \epsilon, \infty) ) \cap [p, 1], \end{aligned}$

are also compact, we can define their extrema

$q_1(\epsilon) := \max C_1(\epsilon), \quad q_2(\epsilon) := \min C_2(\epsilon).$

At least one of these sets will always be non-empty, since $r_2 \in [0, 1]$ and $\rho(r_2) > r+\epsilon > r$ . If $C_1(\epsilon)$ is always empty, then we can omit its discussion later on. However, if $C_1(\epsilon_0) \neq \emptyset$ for some $\epsilon_0$ , then $C_1(\epsilon) \neq \emptyset$ for any $\epsilon \in (0, \epsilon_0)$ . Without loss of generality then, we assume both sets are non-empty for some sufficiently small $\epsilon > 0$ .

Using monotonicity properties, $f(q_j(\epsilon)) \leq f(q)$ for any $q \in C_j(\epsilon)$ . Therefore,

$\displaystyle \inf_{q \in C_j(\epsilon)} f(q) = f(q_j(\epsilon)).$

Now it is clear that $C_j(\epsilon_1) \subseteq C_j(\epsilon_2)$ whenever $\epsilon_1 \geq \epsilon_2$ . By the monotone convergence theorem, there exist $q_j \in [0,1]$ such that $q_j(1/n) \to q_j$ and

$\displaystyle \inf_{q \in C_j(0)} f(q) = f(q_j).$

Define $C_j'(\epsilon) := C_j(\epsilon) \backslash \rho^{-1}(r)$ . Since smaller sets yield larger infimums, for sufficiently large $n$ ,

$\displaystyle f(q_j) = \inf_{q \in C_j(0)} f(q) \leq \inf_{q \in C_j'(0)} f(q) \leq \inf_{C_j(1/n)} f(q) \leq f(q_j(1/n)).$

Taking $n \to \infty$ , by the squeeze theorem,

$\displaystyle \inf_{q \in C_j'(0)} f(q) = f(q_j).$

Finally, choose

$\displaystyle p_* = \underset{j=1,2}{\arg \min}\ f(q_j),$

to deduce that

$\mathcal{K}_{\inf}^{\rho}(\mathrm{Ber}(p), r, \mathrm{Ber}) = f(p_*) = \mathrm{KL}(\mathrm{Ber}(p), \mathrm{Ber}(p_*)).$

Remark 1. Most arguments in Lemma 1 boils down to the compactness of $[a, b]$ , and by extension, the space $\mathrm{Ber}$ of Bernoulli distributions, as well as the relative continuity properties of $\tilde{\rho}$ and $\mathrm{KL}(\cdot, \cdot)$ . Here are some proposed steps for further exploration:

Generalise the risk functional to $\rho : \mathcal C \to \mathbb R$ , where $\mathcal C$ is a space of probability distributions that is compact under a suitable metric or topology.
We would probably need to partition $\mathcal P = \{M_1,\dots, M_n\}$ into $n$ compact sets, then define the compact sets $C_n(\epsilon) = \tilde{\rho}^{-1}([r+\epsilon, \infty)) \cap M_n$ .
During the sequential argument, we could use sequential compactness to concoct a convergent subsequence $q_j(1/n) \to q$ in place of a by-default convergent subsequence. That way, we might still be able to infimise over $C_j(0)$ via $C_j(1/n)$ .
Since we only care about infimising, we do not need the strength of full continuity for $\mathrm{KL}(\cdot,\cdot)$ ; rather we would only require its lower semi-continuity property, which could be conceived as the “lower half” of vanilla continuity.

Thanks to the continuity of $\rho$ , we can obtain the pleasant tail upper bounds below.

Theorem 1. Fix $r \in \mathbb R$ and natural numbers $\alpha, \beta$ , $n := \alpha + \beta$ . Then there exists a universal constant $C_1$ such that for any random variable with Beta distribution $X \sim \mathrm{Beta}(\alpha, \beta)$ ,

$\begin{aligned} \mathbb P(\tilde{\rho}(X) \geq r) &\leq C_1\sqrt{n} \cdot \exp( -n \cdot \mathcal{K}_{\inf}^{\rho}(\mathrm{Ber}(\alpha/n), r) ), \\ \mathbb P(\tilde{\rho}(X) \leq r) &\leq C_1\sqrt{n} \cdot \exp( -n \cdot \mathcal{K}_{\inf}^{\rho}(\mathrm{Ber}(\alpha/n), r) ). \end{aligned}$

Proof. Denote the closed (and thus, compact) sets $S_1 := \tilde{\rho}^{-1}([r,\infty))$ and $S_2 := \tilde{\rho}^{-1}((-\infty, r])$ . Use Lemma 1 to construct $p_* \in [0, 1]$ such that

$\mathcal{K}_{\inf}^{\rho}(\mathrm{Ber}(\alpha/n), r) ) = \mathrm{KL}(\mathrm{Ber}(\alpha/n), \mathrm{Ber}(p_*)) \equiv \mathrm{KL}(\alpha/n, p_*)$

for brevity. Using the conjugate-prior connection between Beta distributions and Bernoulli distributions, since the sum of i.i.d. Bernoullis yield a Binomial,

$\begin{aligned} f_{Y\mid X}(y\mid x) &= \mathbb P(Y = y \mid X = x) \\ &= {n \choose y} \cdot x^{y} (1-x)^{n-y}, \end{aligned}$

where $Y = \xi_1 + \dots + \xi_n \sim \mathrm{Bin}(n,x)$ for i.i.d. $\xi_i \sim \mathrm{Ber}(x)$ . By Bayes’ rule and the law of total probability, using the uniform distribution prior $\mathrm{Beta}(1, 1)$ with p.d.f. $1$ :

$\begin{aligned} f_{X|Y}(x \mid y) &= \frac{ f_{Y\mid X}(y \mid x) \cdot f_X(x) }{ f_Y(y) } \\ &= \frac{ f_{Y\mid X}(y \mid x) \cdot f_X(x) }{ \int_{[0,1]} f_{Y \mid X}(y \mid x) \cdot f_X(x) \, \mathrm dx} \\ &= \frac{ {n \choose y} \cdot x^{y} (1-x)^{n-y} \cdot 1 }{ \int_{[0,1]} {n \choose y} \cdot x^{y} (1-x)^{n-y} \cdot 1 \, \mathrm dx} \\ &= \frac{ x^{y} (1-x)^{n-y} }{ \int_{[0,1]} x^{y} (1-x)^{n-y} \, \mathrm dx}. \end{aligned}$

In particular,

$\begin{aligned} \mathbb P(X \in S \mid Y = \alpha) &= \int_{S} f_{X\mid Y}(x \mid \alpha)\, \mathrm dx \\ &= \int_{S} \frac{ x^{y} (1-x)^{n-y} }{ \int_{[0,1]} x^{\alpha} (1-x)^{n-\alpha} \, \mathrm dx}, \mathrm dx \\ &= \frac{ \int_{S} x^{\alpha} (1-x)^{n-\alpha} \, \mathrm dx }{ \int_{[0,1]} x^{\alpha} (1-x)^{n-\alpha} \, \mathrm dx} . \end{aligned}$

In what follows, set $M = 1$ and follow the proof of Lemma 13 in Riou and Honda (2020) to obtain the upper bound

$\begin{aligned} \mathbb P(\tilde{\rho}(X) \geq r) = \mathbb P(X \in S_1) &\leq C\sqrt{n} \cdot \exp( -n \cdot \mathrm{KL}(\alpha/n, p_*)), \\ \mathbb P(\tilde{\rho}(X) \leq r) = \mathbb P(X \in S_2) &\leq C\sqrt{n} \cdot \exp( -n \cdot \mathrm{KL}(\alpha/n, p_*)), \end{aligned}$

where $C = e^{1/12}/2\pi$ using Stirling’s approximation. We remark that in the live algorithm, at time $t+1$ ,

$\displaystyle \alpha = \sum_{s=1}^{T_i(t)} \mathbb I\{X_s = 1\}$

would be a somewhat confusing random variable, and the tail concentration bounds are conditioned on $\alpha$ .

Remark 2. The challenge to generalise this result comes in require concrete distributions to work with, and so our general version would need to be “simplified” into, or expressed in terms of, a more computationally tractable option. One possible area for exploration would be considering the compact set of bounded-mean Gaussian distributions

$\mathcal N_K := \{ \mathcal N(\mu, \sigma^2) : (\mu, \sigma) \in K \},$

where $K \subseteq \mathbb R \times (0, \infty)$ is compact, then applying meaningful tail-bounds of the Gaussian to derive the risk version of said concentration bounds.

These upper bounds are, with more technical bookkeeping, ultimately responsible for the asymptotically optimal regret bounds. And since contiunity is a relatively benign condition, many risk functionals enjoy the tail upper bound of Theorem 1, and potentially, the asymptotically optimal regret bound for $\rho$ -Thompson Sampling.

Example 1. Given continuous risk functionals $\rho_1, \rho_2$ and constants $c_1, c_2$ , the linear combination $\rho = c_1 \rho_1 + c_2 \rho_2$ is also a continuous risk functional. See Examples 2 and 3 for myriads of continuous risk functionals that can be combined.

However, a proper proof of the Thompson sampling algorithm requires tail lower bounds. To achieve that goal, we introduce the notion of a dominant risk functional.

Definition 2. For any $p \in [0, 1]$ , define

$V(p, 0) = [p, 1],\quad V(p, 1) = [0,p].$

We say that a risk functional $\rho$ is dominant if for any $p \in [0,1]$ and $r \in \mathbb R$ , there exists $q \in [0,1]$ and $j \in \{0, 1\}$ such that

$\mathcal{K}_{\inf}^{\rho}(\mathrm{Ber}(p), r, \mathrm{Ber}) = \mathrm{KL}(\mathrm{Ber}(p), \mathrm{Ber}(q))$

and $\tilde{\rho}(V(q,j)) \subseteq [ \tilde{\rho}(q), \infty)$ . We remark that by Lemma 1, if $\rho$ is continuous, then we are guaranteed the optimised KL-divergence result.

In the original version, it was this concept that I dreamt of while struggling to solve the bandit problem. I prayed long and hard, and solved the problem in my dream thrice. I was shocked, and said to myself, “I must be dreaming. I will wake up and write down my solution.” And so at 4.30am sometime in February 2022, I did just that, and after a sanity check at 8.30am the next morning, concluded that the solution was correct.

In any case, the dominant risk functional property guarantees for us a much-needed tail lower bound.

Theorem 2. Fix $r \in \mathbb R$ and natural numbers $\alpha, \beta$ , $n := \alpha + \beta$ . If $\rho$ is dominant, then there exists another universal constant $C_2$ such that for any random variable with Beta distribution $X \sim \mathrm{Beta}(\alpha, \beta)$ ,

$\begin{aligned} \mathbb P(\tilde{\rho}(X) \geq r) &\geq \frac{C_2}{n} \cdot \exp( -n \cdot \mathcal{K}_{\inf}^{\rho}(\mathrm{Ber}(\alpha/n), r) ). \end{aligned}$

Proof. Fix $r \in \mathbb R$ . Since $\rho$ is dominant, there exists $q \in [0,1]$ and $j \in \{0, 1\}$ such that

$\mathcal{K}_{\inf}^{\rho}(\mathrm{Ber}(p), r, \mathrm{Ber}) = \mathrm{KL}(\mathrm{Ber}(p), \mathrm{Ber}(q))$

and $\tilde{\rho}(V(q,j)) \subseteq [ \tilde{\rho}(q), \infty) \subseteq [r, \infty)$ , where the last inclusion holds by $\tilde{\rho}(q) \geq r$ . Assume $j = 0$ for simplicity. Taking probabilities, and denoting

$\begin{aligned} \mathbb P(\tilde{\rho}(X) \in [r, \infty)) &\geq \mathbb P( \tilde{\rho}(X) \in \tilde{\rho}( V(q,0) )) \\ &= \mathbb P( \tilde{\rho}(X) \in \tilde{\rho}( [q, 1] )) \\ &= \mathbb P(X \in [q, 1]) \\ &= \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\int_q^1 x^{\alpha-1} (1-x)^{\beta-1}\, \mathrm dx \\ &\geq \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} \cdot q^{\alpha-1} \int_q^1 x^{\beta-1}\, \mathrm dx \\ &=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} \cdot q^{\alpha-1} \cdot \frac{(1-q)^{\beta}}{\beta} \\ &=\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} \cdot \underbrace{ \frac{\beta}{q} }_{\geq \beta} \cdot \underbrace{ \frac{q^{\alpha}}{(\alpha/n)^{\alpha}} \cdot \frac{(1-q)^{\beta}}{(\beta/n)^{\beta}} }_{\exp(-n \cdot \mathrm{KL}( \alpha/n, q ) )} \cdot \frac{\alpha^{\alpha} \cdot \beta^{\beta}}{n^n} \\ &\geq \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta+1)} \cdot \frac{\alpha^{\alpha} \cdot \beta^{\beta}}{n^n} \cdot \exp(-n \cdot \mathrm{KL}( \alpha/n, q ) ) . \end{aligned}$

The rest of the calculation follows from the proof of Lemma 2 in Baudry et al (2021) by using Stirling’s approximation, and yields the desired lower bound

$\displaystyle \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta+1)} \cdot \frac{\alpha^{\alpha} \cdot \beta^{\beta}}{n^n} \geq \underbrace{ \sqrt{\frac{2\pi}{2.13}} }_{C_2} \cdot \frac 1n .$

Remark 3. This tail bound eventually bounds all other terms by constants, once the exponentials cancel other exponentials out in subsequent calculations. The dream dealt with the slightly more general case $p = (p_0,p_1,p_2)$ , where $p_0+p_1+p_2 = 1$ . I couldn’t dream of the higher-dimensional scenarios. Nor did I need to, to be very honest. See Remark 2 for the computational challenge of generalising Theorem 2 to more general forms for the theoretical use of $\rho$ -Thompson Sampling.

Remark 4. The bounds in Theorems 1 and 2 work precisely for a special class of distributions, namely Bernoulli bandits (or slightly more generally, multinomial bandits). For other common classes of distributions, like Gaussians for example, we would need different tail lower bounds. Moreover, we would need to work with specific conjugate pairs of distributions, and approximate non-parametric bandits using parametric ones, which leads to messier approximation-controlling calculations when evaluating the regret bound.

But you might wonder—what functionals could pass the dominant risk functional criteria?

Lemma 2. Let $\rho$ be continuous and $c \in [0, 1]$ be a constant pivot point. Suppose $\tilde{\rho}|_{[0,c]}$ is non-increasing and $\tilde{\rho}|_{[c,1]}$ is non-decreasing. Then $\rho$ is dominant.

Proof. Fix $p \in [0, 1]$ and $r \in \mathbb R$ . Use Lemma 1 to produce $p_* \in [0, 1]$ such that

$\mathcal K_{\inf}^{\rho}(\mathrm{Ber}(p), r) = \mathrm{KL}(\mathrm{Ber}(p), \mathrm{Ber}(p_*)).$

Set $q = p_*$ . First suppose $c \in (0, 1)$ . Either $q \in [0, c]$ or $q \in [c, 1]$ . In the former, $\tilde{\rho}(s) \geq \tilde{\rho}(q)$ for $s \in [0, q]$ , which implies

$\tilde{\rho}(V(q,1)) = \tilde{\rho}([0, q]) \subseteq [\tilde{\rho}(q), \infty),$

as required. In the latter, $\tilde{\rho}(s) \geq \tilde{\rho}(q)$ for $s \in [q,1]$ , which implies

$\tilde{\rho}(V(q,0)) = \tilde{\rho}([q, 1]) \subseteq [\tilde{\rho}(q), \infty),$

as required. If $c = 0$ , then $\tilde{\rho}$ is non-decreasing, and the second case holds. Finally, if $c = 1$ , then $\tilde{\rho}$ is non-increasing, and the previous argument holds.

Remark 5. For two risk functionals $\rho_1, \rho_2$ satisfying the hypotheses of Lemma 2 with the same pivot point $c$ and non-negative constants $\alpha_1, \alpha_2$ such that $\alpha_1 + \alpha_2 = 1$ , $\rho := \alpha_1 \cdot \rho_1 + \alpha_2 \cdot \rho_2$ is dominant.

Example 2. Given $\nu_p = \mathrm{Ber}(p)$ , denote $\tilde{\rho}(p) \equiv \rho(\nu_p)$ . The following risk functionals are continuous and satisfy the hypotheses of Lemma 2 (here, let $X \sim \nu_p$ , $\phi : [0, 1] \to [0,\infty)$ have total integral $1$ , and $\Phi$ denote the c.d.f. of the standard Gaussian):

Expected value: $\rho_{\mathbb E}(\nu_p) = \mathbb E[X] = p$ ,
Conditional value-at-risk: $\rho_\alpha(\nu_p) = \min\{p/(1-\alpha), 1\}$
Proportional hazard: $\rho_\alpha(\nu_p) = p^{\alpha}$
Lookback: $\rho_q(\nu_p) = p^q(1-q\log p)$
Spectral risk: $\rho_\phi(\nu_p) = \int_0^p \phi(t)\, \mathrm dt$
Entropic risk: $\rho_{\mathrm{ER},\theta}(\nu_p) = -\frac 1{\theta} \log((1-p) + pe^{-\theta})$
Dual power distortion: $\rho_{\mathrm{DP},\gamma}(\nu_p) = 1 - (1-p)^{\gamma}$
Wang transform: $\rho_{\mathrm{W},\sigma}(\nu_p) = \Phi(\Phi^{-1}(p) + \sigma)$
Logarithmic distortion: $\rho_{\mathrm{L}, \beta}(\nu_p) = \log(1+\beta p)/{ \log(1+\beta) }$
$\delta$ -Sharpe ratio: $\displaystyle \rho_{\mathrm{SR}, \delta}(\nu_p) = \rho_{\mathrm{SR}}(\nu_{(1-\delta)p}) = \sqrt{\frac{ (1-\delta)p }{ 1 - (1-\delta)p}}$

Remark 6. Since the Sharpe ratio is not well-defined at $\mathrm{Ber}(1)$ , we do not have the nice compactness property of Lemma 1 for the vanilla Sharpe ratio. The dilation by $(1-\delta)$ allows the Sharpe ratio to be defined for $[0, (1-\delta)^{-1}) \supseteq [0,1]$ . Given a $K$ -armed Bernoulli bandit with maximum probability $p_{\max} \in (0,1)$ ,

$\rho_{\mathrm{SR}} (\nu_p) = \rho_{\mathrm{SR}, 1-p_{\max}} (\nu_{p/p_{\max}})$

satisfies the requirements of Lemma 1, and we recover its useful tail bounds.

Lemma 3. If $\tilde{\rho}$ is differentiable on $[0,1]$ with non-decreasing derivative $\tilde{\rho}'$ , then $\rho$ is dominant.

Proof. We claim that no matter what, $\tilde{\rho}$ will satisfy the hypotheses of Lemma 2.

If $\tilde{\rho}'(0) \geq 0$ , then $\tilde{\rho}$ is non-decreasing.
If $\tilde{\rho}' \leq 0$ , then $\tilde{\rho}$ is non-increasing.
If there exists $c_0 \in (0, 1]$ such that $\tilde{\rho}'(c_0) > 0$ , then the intermediate value property of derivatives yields $c \in (0, c_0)$ such that $\tilde{\rho}'(c) = 0$ . Since $\tilde{\rho}'$ is non-decreasing, $\tilde{\rho}|_{[0, c]}$ is non-increasing and $\tilde{\rho}|_{[c, 1]}$ .

In all three cases, $\rho$ satisfies the hypotheses of of Lemma 2, and thus are dominant.

Remark 7. More generally, if $\tilde{\rho}$ is convex (which generalises Lemma 3), i.e. for any $t \in [0, 1]$ and $x, y \in [0, 1]$ ,

$\tilde{\rho}(t \cdot x + (1-t) \cdot y) \leq t \cdot \tilde{\rho}(x) + (1-t) \cdot \tilde{\rho}(y),$

the risk functional $\rho$ will be dominant. Furthermore, for any two risk functionals $\rho_1, \rho_2$ satisfying the hypotheses of Lemma 3 and non-negative constants $\alpha_1, \alpha_2$ such that $\alpha_1 + \alpha_2 = 1$ , $\rho := \alpha_1 \cdot \rho_1 + \alpha_2 \cdot \rho_2$ is dominant.

Example 3. Given $\nu_p = \mathrm{Ber}(p)$ , denote $\tilde{\rho}(p) \equiv \rho(\nu_p)$ . The following risk functionals satisfy the hypotheses of Lemma 2 (here, let $X \sim \nu_p$ ):

Second moment: $\rho_{\text{SM}}(\nu_p) = \mathbb E[X^2] = p$
Negative variance: $\rho_{-\mathbb V}(\nu_p) = -p(1-p)$
Mean-variance: $\rho_{\mathrm{MV}_\gamma} = \gamma \rho_{\mathbb E} + \rho_{-\mathbb V}$
Target semi-variance: $\rho_{\mathrm{TSV}_r}(\nu) = -\mathbb E[\min\{ X-r , 0 \}^2 ]$
Exponential tilt: $\rho_\lambda(\nu_p) = \exp(p\lambda)$
Quadratic utility: $\rho_a(\nu_p) = ap^2 + bp + c$

Furthermore, for two risk functionals $\tilde{\rho}_1,\tilde{\rho}_2$ and non-negative constants $c_1, c_2$ satisfying the hypotheses of Lemma 2, $\rho := c_1 \tilde{\rho}_1 + c_2 \tilde{\rho}_2$ is dominant.

Therefore, most risk functionals as listed in Examples 1, 2, and 3 are the ones that most people care about are in fact continuous and dominant, and therefore, by passing the relevant arguments through much book-keeping, enjoy the asymptotically optimal regret bound for $\rho$ -TS on the Bernoulli bandit environment.

Theorem 3. The regret bound of $\rho$ -TS over a Bernoulli bandit environment is asymptotically optimal, and given by

$\displaystyle \lim_{n \to \infty} \frac{ \mathcal R_n(\mathrm{Ber}^K, \rho\text{-}\mathrm{TS}) }{\log(n)} = \sum_{i : \Delta_i^{\rho}} \frac{\Delta_i^{\rho}}{\mathcal K_{\inf}^{\rho} (\nu_i, \rho(\nu_1))}.$

Proof. Follow the proof of Theorem 1 in Chang and Tan (2022), and apply Theorems 1 and 2 in the analysis. Left as an exercise (effectively) in algebra and mildly clever calculus.

Oh, by the way, with the help of ChatGPT, here’s a Jupyter writeup of the implemented algorithm and some pretty pictures!

The red curve indicates the theoretical asymptotic lower bound, and each diagram reflects the algorithm running for a fixed $10$ -armed Bernoulli bandit, with different risk functionals, even combinations of them:

$\rho_1 = 0.5 \cdot \mathrm{CVaR}_{0.05} + 0.5 \cdot \mathrm{ER}_{0.5}$
$\rho_2(\nu_p) = \rho_{\mathrm{MV}_{2}}(\nu_p) + \exp(2p)$

And for them all, the asymptotic lower bound lies happily in their $1$ -sigma bands. (Yes, each algorithm was averaged across $100$ runs!)

And with that, we are truly done. Happy lunar new year!

—Joel Kindiak, 16 Feb 26, 1131H

KindiakMath

Risky Bernoulli Bandits

Leave a comment Cancel reply

Risky Bernoulli Bandits

Share this:

Leave a comment Cancel reply