# 2. Explore-Then-Commit algorithm

$\newcommand{\argmin}{\mathop{\mathrm{argmin}}\limits}$ $\newcommand{\argmax}{\mathop{\mathrm{argmax}}\limits}$

Here, we continue to describe the multi-armed bandit problem in detail. The notion of regret will be introduced. Then our first bandit algorithm, explore-then-commit (ETC) will be described.

## Multi-armed bandit (continued)

Suppose we have a set of possible actions $\mathcal{A}$ and a stochastic bandit (a collection of bandits) $\nu$ where

$\nu := (P_{\theta_i})_{i \in \mathcal{A}}.$

The algorithm interact with bandits through time $t=1,\cdots,T.$ At each time $t$ the algorithm choose an arm $a(t) \in \mathcal{A}.$ Then the reward $r_i(t) \sim P_{\theta_{a(t)}}$ is generated and revealed to the algorithm. Such interactions (action-reward pairs) between the algorithm and the environment is logged into history $\mathcal{H}$ where

$\mathcal{H}_{t-1} := \{\left(a(t), r(t)\right): t=1,\cdots,t-1\}.$

The algorithm can be characterized by a sequence of probabilities $(\pi_t)_{t=1,\cdots,T}.$ Arms are chosen according to conditional probability given history $\pi_t = \pi(\cdot|\mathcal{H}_{t-1}).$ Our goal is to find $(\pi_t)$ that chooses the best actions at each $t.$

Usually such problem is solved by not maximizing the cumulative reward, but by minimizing the regret function $R$. As the name implies, $R$ is the value that represents the total loss that we had under our choices. To formally define the regret, let $\mu_i$ be the (unknown) expected reward from the $i$th arm, and $\mu_{i^*}$ be the expected reward from the (also unknown) best $i^*$th arm.

$\mu_i = E r_i(t) \tag{1}, \\ \mu_{i^*} \ge \mu_j,~ \forall j \ne i^*.$

Let $\Delta_j$ be the difference between the mean reward from the best choice and the choice $j$.

$\Delta_j := \mu_{i^*} - \mu_j.$

The regret $R$ is defines as follows:

\begin{aligned} R(T) &:= E \sum_{t=1}^T \left( \mu_{i^*} - r_{a(t)}(t) \right) \\ &= \sum_{t=1}^T \Delta_{a(t)}. \end{aligned}

What makes the bandit problem so challenging is that only a small portion of rewards are revealed: we do not know underlying $P_{\theta_i}$’s unless we chose them to get enough samples. This problem is called exploration-exploitation trade-off. Simply put, we need to find and use the best among $\nu$ in limited amount of time $T.$

• Exploration: Try different arms to get information about $P_{\theta_i}.$
• Exploitation: Pick the best arm according to estimates of rewards based on current information about $P_{\theta_i}.$

One thing to note is that in exploitation, we not only need to get the estimate, but also the variabiliy of it.

## Theoretical tools

The followings are some of the tools that will be used in regret analysis.

### Markov and Chebeshev’s inequality

For a random variable $X \ge 0$ and $\epsilon > 0,$ $$P(X \ge \epsilon) \le \frac{EX}{\epsilon}.$$
For a random variable $X$ and $\epsilon > 0,$ $$P(|X - EX| \ge \epsilon) \le \frac{\text{Var}(X)}{\epsilon^2}.$$

### subgaussianity

A random variable $X$ is $\sigma$-subgaussian, if there exists a $\sigma>0$ such that $$E e^{\lambda X} \le e^{\lambda^2\sigma^2/2}.$$ That is, the MGF of $X$ is bounded above by another random variable that follows $\mathcal{N}(0,\sigma^2).$

A $\sigma$-subgaussian $X$ can be viewed as a random variable that has tail probabilities smaller than that of $\mathcal{N}(0, \sigma^2).$ To be specific, the right tail probablity can be achieved by Chernoff method. Be careful since the bound is not tight for normal random variables.

For a $\sigma$-subgaussian random variable $X,$ $$P(X \ge \epsilon) \le e^{-\frac{\epsilon^2}{2\sigma^2}}.$$

Let $\lambda = \epsilon / \sigma^2,$ then \begin{aligned} P(X \ge \epsilon) &= P(e^{\lambda X} \ge e^{\lambda \epsilon}) \\ &\le Ee^{\lambda X}/e^{\lambda \epsilon} \\ &\le e^{\lambda^2\sigma^2/2}/e^{\lambda\epsilon} \\ &= e^{-\frac{\epsilon^2}{2\sigma^2}}. \end{aligned} The first inequality follows from Markov's inequality and the second follows from subgaussianity.

Similarly the same inequality holds for the lower tail, thus we get the bound

$P(|X| \ge \epsilon) \le 2 e^{-\frac{\epsilon^2}{2\sigma^2}},$

which is (as we set $\delta = 2e^{-\frac{\epsilon^2}{2\sigma^2}}$) equivalent to

$P\left(|X| \ge \sqrt{2\sigma^2 \log\left(\frac{2}{\delta}\right)}\right) \le \delta.$
$X$ is $\sigma$-subgaussian, $X_1, X_2$ are $\sigma_1$- and $\sigma_2$-subgaussian, respectively. Then
(i) $EX=0$ and $\text{Var}(X) \le \sigma^2.$
(ii) $cX$ is $|c|\sigma$-subgaussian for all $c \in \mathbb{R}.$
(iii) $X_1+X_2$ is $\sqrt{\sigma_1^2 + \sigma_2^2}$-subgaussian.

With the help of the lemma, applying Chernoff bound to the sample mean $\hat \mu=\frac{\sum_{t=1}^T X_t}{T}$ of $\sigma$-subgaussian $X$ yields

$P(|\hat \mu - \mu| \ge \epsilon) \le 2 e^{-T\frac{\epsilon^2}{2\sigma^2}}.$

### Bernstein inequality

If $P(|X| \le c) = 1$ and $EX=0,$ $\text{Var}(X)=\sigma^2,$ then for any $t>0,$ $$Ee^{tX} \le \exp\left( t^2\sigma^2\frac{e^{tc} - 1 - tc}{(tc)^2} \right).$$

\begin{aligned} Ee^{tX} &= E\left(1 + tX + \sum_{r=2}^\infty \frac{t^rX^r}{r!} \right) \\ &= 1 + t^2\sigma^2F \\ &\le e^{t^2\sigma^2F} \end{aligned} where $F=\sum_{r=2}^\infty \frac{t^{r-2}}{\sigma^2} \frac{EX^r}{r!}.$ Since $EX^r \le c^{r-2}\sigma^2$ for $r\ge2,$ we get the desired result.
Let $X_t$'s be independent random variables with $P(|X_t| \le c) = 1$ and $EX_i=0,$ $\sigma^2 = \sum_{t=1}^T \text{Var}(X_t).$ Then for any $\epsilon>0,$ $$P(|\hat \mu| \ge \epsilon) \le 2 \exp\left( -\frac{T\epsilon^2}{2\sigma^2 + 2c\epsilon/3} \right).$$

## Explore-then-Commit (ETC)

### The algorithm

The first and probably the simplest form of the bandit algorithm is explore-then-commit (ETC). As the name speaks for it self, ETC divides total time $T$ into exploration step and exploitation step.

1. Exploration step. During rounds $1,\cdots,mN,$ the algorithm plays each $N$ arms $m$ times.

2. Exploitation step. The algorithm selects the arm with the highest empirical mean reward

$a = \argmax_{i=1,\cdots,N} \hat \mu_i (mN)$

where

$\hat \mu_i(t) := \frac{\sum_{\tau=1}^t r_i(\tau) \mathbf{1}(i=a(\tau))}{\sum_{\tau=1}^t \mathbf{1}(i=a(\tau))}.$

During the rest $mN+1,\cdots,T$ rounds, the algorithm chooses that $a$. That is, $a(t) = a$ for all $t=mN+1,\cdots,T.$

Although the algorithm is simple, it is (quite predictively) not a great performing one. Choice of the exploration parameter $m$ depends on the underlying knowledge of $\Delta_i$’s, which is not available. In addition, the algorithm is suboptimal in regret analysis.

### Regret analysis

[To be appended]

References

• Lattimore, Szepesvari. 2020. Bandit Algorithms. Cambridge University .
• Seminar in Recent Development of Applied Statistics (Spring, 2021) @ Seoul National University, Republic of Korea (instructor: Prof. Myunghee Cho Paik).