1O notation¶
Throughout this chapter and the rest of the book, we will describe the + c-1.4,4.9-3.6,8.6-6.8,11.1c-3.1,2.5-7.3,3.7-12.4,3.8H63c-6.7,0-10,2.9-10,8.8">Made with MyST
1O notation¶
Throughout this chapter and the rest of the book, we will describe the asymptotic behavior of a function using notation.
For two functions and , we say that if is asymptotically upper bounded by . Formally, this means that there exists some constant such that for @@ -32,9 +32,9 @@ that for some and all .
Occasionally, we will also use (or one of the other symbols) as shorthand to manipulate function classes. For example, we might write to mean that the sum of two -functions in and is in .
2Python¶
3.1Introduction¶
The multi-armed bandits (MAB) setting is a simple setting for studying the basic challenges of sequential decision-making. -In this setting, an agent repeatedly chooses from a fixed set of actions, called arms, each of which has an associated reward distribution. The agent’s goal is to maximize the total reward it receives over some time period.
In particular, we’ll spend a lot of time discussing the Exploration-Exploitation Tradeoff: should the agent choose new actions to learn more about the environment, or should it choose actions that it already knows to be good?
In this chapter, we will introduce the multi-armed bandits setting, and discuss some of the challenges that arise when trying to solve problems in this setting. We will also introduce some of the key concepts that we will use throughout the book, such as regret and exploration-exploitation tradeoffs.
from jaxtyping import Float, Array
+ c-1.4,4.9-3.6,8.6-6.8,11.1c-3.1,2.5-7.3,3.7-12.4,3.8H63c-6.7,0-10,2.9-10,8.8">Made with MyST
3.1Introduction¶
The multi-armed bandits (MAB) setting is a simple setting for studying the basic challenges of sequential decision-making. +In this setting, an agent repeatedly chooses from a fixed set of actions, called arms, each of which has an associated reward distribution. The agent’s goal is to maximize the total reward it receives over some time period.
In particular, we’ll spend a lot of time discussing the Exploration-Exploitation Tradeoff: should the agent choose new actions to learn more about the environment, or should it choose actions that it already knows to be good?
In this chapter, we will introduce the multi-armed bandits setting, and discuss some of the challenges that arise when trying to solve problems in this setting. We will also introduce some of the key concepts that we will use throughout the book, such as regret and exploration-exploitation tradeoffs.
from jaxtyping import Float, Array
import numpy as np
import latexify
from typing import Callable, Union
@@ -40,7 +40,7 @@
identifiers={"arm": "a_t", "reward": "r", "means": "mu"},
use_math_symbols=True,
escape_underscores=False,
-)
Let denote the number of arms. We’ll label them and use superscripts to indicate the arm index; since we seldom need to raise a number to a power, this won’t cause much confusion. In this chapter, we’ll consider the Bernoulli bandit setting from the examples above, where arm either returns reward 1 with probability or 0 otherwise. The agent gets to pull an arm times in total. We can formalize the Bernoulli bandit in the following Python code:
class MAB:
+)
Let denote the number of arms. We’ll label them and use superscripts to indicate the arm index; since we seldom need to raise a number to a power, this won’t cause much confusion. In this chapter, we’ll consider the Bernoulli bandit setting from the examples above, where arm either returns reward 1 with probability or 0 otherwise. The agent gets to pull an arm times in total. We can formalize the Bernoulli bandit in the following Python code:
class MAB:
"""
The Bernoulli multi-armed bandit environment.
@@ -58,8 +58,8 @@
def pull(self, k: int) -> int:
"""Pull the `k`-th arm and sample from its (Bernoulli) reward distribution."""
reward = np.random.rand() < self.means[k].item()
- return +reward
mab = MAB(means=np.array([0.1, 0.8, 0.4]), T=100)
In pseudocode, the agent’s interaction with the MAB environment can be -described by the following process:
@latex
+ return +reward
mab = MAB(means=np.array([0.1, 0.8, 0.4]), T=100)
In pseudocode, the agent’s interaction with the MAB environment can be +described by the following process:
@latex
def mab_loop(mab: MAB, agent: "Agent") -> int:
for t in range(mab.T):
arm = agent.choose_arm() # in 0, ..., K-1
@@ -67,7 +67,7 @@
agent.update_history(arm, reward)
-mab_loop
The Agent
class stores the pull history and uses it to decide which arm to pull next. Since we are working with Bernoulli bandits, we can summarize the pull history concisely in a array.
class Agent:
+mab_loop
The Agent
class stores the pull history and uses it to decide which arm to pull next. Since we are working with Bernoulli bandits, we can summarize the pull history concisely in a array.
class Agent:
def __init__(self, K: int, T: int):
"""The MAB agent that decides how to choose an arm given the past history."""
self.K = K
@@ -87,12 +87,12 @@
def update_history(self, arm: int, reward: int):
self.rewards.append(reward)
self.choices.append(arm)
- self.history[arm, reward] += 1
What’s the optimal strategy for the agent, i.e. the one that achieves + self.history[arm, reward] += 1
What’s the optimal strategy for the agent, i.e. the one that achieves the highest expected reward? Convince yourself that the agent should try -to always pull the arm with the highest expected reward:
The goal, then, can be rephrased as to minimize the regret, defined -below:
def regret_per_step(mab: MAB, agent: Agent):
+to always pull the arm with the highest expected reward:The goal, then, can be rephrased as to minimize the regret, defined
+below:
def regret_per_step(mab: MAB, agent: Agent):
"""Get the difference from the average reward of the optimal arm. The sum of these is the regret."""
- return [mab.means[mab.best_arm] - mab.means[arm] for arm in agent.choices]
Note that this depends on the true means of the pulled arms, not the actual + return [mab.means[mab.best_arm] - mab.means[arm] for arm in agent.choices]
Note that this depends on the true means of the pulled arms, not the actual observed rewards. We typically think of this as a random variable where the randomness comes from the agent’s strategy (i.e. the sequence of @@ -100,7 +100,7 @@ algorithms in two different senses:
Upper bound the expected regret, i.e. show .
Find a high-probability upper bound on the regret, i.e. show .
Note that these two different approaches say very different things about the regret. The first approach says that the average regret is at most . However, the agent might still achieve higher regret on many runs. The second approach says that, with high probability, the agent will achieve regret at most . However, it doesn’t say anything about the regret in the remaining δ fraction of runs, which might be arbitrarily high.
We’d like to achieve sublinear regret in expectation, i.e. . That is, as we learn more about the environment, we’d like to be able to exploit that knowledge to take the optimal arm as often as possible.
The rest of the chapter comprises a series of increasingly sophisticated -MAB algorithms.
def plot_strategy(mab: MAB, agent: Agent):
+MAB algorithms.
def plot_strategy(mab: MAB, agent: Agent):
plt.figure(figsize=(10, 6))
# plot reward and cumulative regret
@@ -117,21 +117,21 @@
plt.xlabel("timestep")
plt.legend()
plt.title(f"{agent.__class__.__name__} reward and regret")
- plt.show()
3.2Pure exploration (random guessing)¶
A trivial strategy is to always choose arms at random (i.e. “pure -exploration”).
class PureExploration(Agent):
+ plt.show()
3.2Pure exploration (random guessing)¶
A trivial strategy is to always choose arms at random (i.e. “pure +exploration”).
class PureExploration(Agent):
def choose_arm(self):
"""Choose an arm uniformly at random."""
- return solutions.pure_exploration_choose_arm(self)
Note that
so the expected regret is simply
Note that
so the expected regret is simply
This scales as , i.e. linear in the number of timesteps . There’s no learning here: the agent doesn’t use any information about the environment to improve its strategy. You can see that the distribution over its arm choices always appears “(uniformly) random”.
This scales as , i.e. linear in the number of timesteps . There’s no learning here: the agent doesn’t use any information about the environment to improve its strategy. You can see that the distribution over its arm choices always appears “(uniformly) random”.
agent = PureExploration(mab.K, mab.T)
mab_loop(mab, agent)
-plot_strategy(mab, agent)
3.3Pure greedy¶
How might we improve on pure exploration? Instead, we could try each arm +plot_strategy(mab, agent)
3.3Pure greedy¶
How might we improve on pure exploration? Instead, we could try each arm once, and then commit to the one with the highest observed reward. We’ll -call this the pure greedy strategy.
class PureGreedy(Agent):
+call this the pure greedy strategy.
class PureGreedy(Agent):
def choose_arm(self):
"""Choose the arm with the highest observed reward on its first pull."""
- return solutions.pure_greedy_choose_arm(self)
Note we’ve used superscripts during the exploration phase to + return solutions.pure_greedy_choose_arm(self)
Note we’ve used superscripts during the exploration phase to indicate that we observe exactly one reward for each arm. Then we use subscripts during the exploitation phase to indicate that we observe a sequence of rewards from the chosen greedy arm .
How does the expected regret of this strategy compare to that of pure @@ -140,31 +140,31 @@ reward distributions with means .
Let’s let be the random reward from the first arm and be the random reward from the second. If , then we achieve zero regret. Otherwise, we achieve regret . Thus, the -expected regret is simply:
Which is still
agent = PureGreedy(mab.K, mab.T)
mab_loop(mab, agent)
-plot_strategy(mab, agent)
The cumulative regret is a straight line because the regret only depends on the arms chosen and not the actual reward observed. In fact, if the greedy algorithm happens to get lucky on the first set of pulls, it may act entirely optimally for that episode! But its average regret is what measures its effectiveness.
3.4Explore-then-commit¶
We can improve the pure greedy algorithm as follows: let’s reduce the variance of the reward estimates by pulling each arm
class ExploreThenCommit(Agent):
+plot_strategy(mab, agent)
The cumulative regret is a straight line because the regret only depends on the arms chosen and not the actual reward observed. In fact, if the greedy algorithm happens to get lucky on the first set of pulls, it may act entirely optimally for that episode! But its average regret is what measures its effectiveness.
3.4Explore-then-commit¶
We can improve the pure greedy algorithm as follows: let’s reduce the variance of the reward estimates by pulling each arm
class ExploreThenCommit(Agent):
def __init__(self, K: int, T: int, N_explore: int):
super().__init__(K, T)
self.N_explore = N_explore
def choose_arm(self):
- return solutions.etc_choose_arm(self)
agent = ExploreThenCommit(mab.K, mab.T, mab.T // 15)
+ return solutions.etc_choose_arm(self)
agent = ExploreThenCommit(mab.K, mab.T, mab.T // 15)
mab_loop(mab, agent)
-plot_strategy(mab, agent)
Notice that now, the graphs are much more consistent, and the algorithm finds the true optimal arm and sticks with it much more frequently. We would expect ETC to then have a better (i.e. lower) average regret. Can we prove this?
3.4.1ETC regret analysis¶
Let’s analyze the expected regret of the explore-then-commit strategy by splitting it up +plot_strategy(mab, agent)
Notice that now, the graphs are much more consistent, and the algorithm finds the true optimal arm and sticks with it much more frequently. We would expect ETC to then have a better (i.e. lower) average regret. Can we prove this?
3.4.1ETC regret analysis¶
Let’s analyze the expected regret of the explore-then-commit strategy by splitting it up into the exploration and exploitation phases.
3.4.1.1Exploration phase.¶
This phase takes
3.4.1.2Exploitation phase.¶
This will take a bit more effort. We’ll prove that for any total time
Let
So we’d like to bound
So we’d like to bound
Let’s define
The proof of this inequality is beyond the scope of this book. See Vershynin (2018) Chapter 2.2.
We can apply this directly to the rewards for a given arm
The proof of this inequality is beyond the scope of this book. See Vershynin (2018) Chapter 2.2.
We can apply this directly to the rewards for a given arm
Then to apply this bound to
Then to apply this bound to
where we’ve set
where we’ve set
Note that it suffices for
Note that it suffices for
Plugging this into the expression for the regret, we
-have (still with probability
Plugging this into the expression for the regret, we
+have (still with probability
The ETC algorithm is rather “abrupt” in that it switches from +M1001 80h400000v40h-400000z'/>
The ETC algorithm is rather “abrupt” in that it switches from exploration to exploitation after a fixed number of timesteps. In practice, it’s often better to use a more gradual transition, which -brings us to the epsilon-greedy algorithm.
3.5Epsilon-greedy¶
Instead of doing all of the exploration and then all of the exploitation +brings us to the epsilon-greedy algorithm.
3.5Epsilon-greedy¶
Instead of doing all of the exploration and then all of the exploitation separately – which additionally requires knowing the time horizon beforehand – we can instead interleave exploration and exploitation by, at each timestep, choosing a random action with some probability. We -call this the epsilon-greedy algorithm.
class EpsilonGreedy(Agent):
+call this the epsilon-greedy algorithm.
class EpsilonGreedy(Agent):
def __init__(
self,
K: int,
@@ -290,9 +290,9 @@
self.ε_array = ε_array
def choose_arm(self):
- return solutions.epsilon_greedy_choose_arm(self)
agent = EpsilonGreedy(mab.K, mab.T, np.full(mab.T, 0.1))
+ return solutions.epsilon_greedy_choose_arm(self)
agent = EpsilonGreedy(mab.K, mab.T, np.full(mab.T, 0.1))
mab_loop(mab, agent)
-plot_strategy(mab, agent)
Note that we let ε vary over time. In particular, we might want to gradually decrease ε as we learn more about the reward distributions and no longer need to spend time exploring.
It turns out that setting
In ETC, we had to set
But the way these algorithms explore is rather naive: we’ve been exploring uniformly across all the arms. But what if we could be smarter about it, and explore more for arms that we’re less certain about?
3.6Upper Confidence Bound (UCB)¶
To quantify how certain we are about the mean of each arm, we’ll
+M1001 80h400000v40h-400000z'/> also achieves a regret of
In ETC, we had to set
But the way these algorithms explore is rather naive: we’ve been exploring uniformly across all the arms. But what if we could be smarter about it, and explore more for arms that we’re less certain about?
3.6Upper Confidence Bound (UCB)¶
To quantify how certain we are about the mean of each arm, we’ll compute confidence intervals for our estimators, and then choose the arm with the highest upper confidence bound. This operates on the principle of the benefit of the doubt (i.e. optimism in the face of @@ -320,10 +320,10 @@ uniformly across all timesteps and arms. Let’s introduce some notation to discuss this.
Let
To achieve the “fixed sample size” assumption, we’ll +\end{aligned}
To achieve the “fixed sample size” assumption, we’ll
need to shift our index from time to number of samples from each
arm. In particular, we’ll define
In particular, since
In particular, since
This bound would then suffice for applying the UCB algorithm! That is, the upper confidence bound for arm
This bound would then suffice for applying the UCB algorithm! That is, the upper confidence bound for arm
where we can choose
- A smaller
would give us a larger and higher-confidence interval, emphasizing the exploration term.δ ′ \delta' - A larger
would give a tighter and lower-confidence interval, prioritizing the current sample averages.δ ′ \delta'
We can now use this to define the UCB algorithm.
where we can choose
- A smaller
would give us a larger and higher-confidence interval, emphasizing the exploration term.δ ′ \delta' - A larger
would give a tighter and lower-confidence interval, prioritizing the current sample averages.δ ′ \delta'
We can now use this to define the UCB algorithm.
class UCB(Agent):
def __init__(self, K: int, T: int, delta: float):
super().__init__(K, T)
self.delta = delta
def choose_arm(self):
- return solutions.ucb_choose_arm(self)
Intuitively, UCB prioritizes arms where:
is large, i.e. the arm has a high sample average, and + return solutions.ucb_choose_arm(self)μ ^ t k \hat \mu^k_t
Intuitively, UCB prioritizes arms where:
is large, i.e. the arm has a high sample average, and we’d choose it for exploitation, andμ ^ t k \hat \mu^k_t is large, i.e. we’re still uncertain about the arm, and we’d choose it for exploration.ln ( 2 t / δ ′ ) 2 N t k \sqrt{\frac{\ln(2t/\delta')}{2N^k_t}}
As desired, this explores in a smarter, adaptive way compared to the -previous algorithms. Does it achieve lower regret?
agent = UCB(mab.K, mab.T, 0.9)
+previous algorithms. Does it achieve lower regret?
agent = UCB(mab.K, mab.T, 0.9)
mab_loop(mab, agent)
-plot_strategy(mab, agent)
3.6.1UCB regret analysis¶
First we’ll bound the regret incurred at each timestep. Then we’ll bound +plot_strategy(mab, agent)
3.6.1UCB regret analysis¶
First we’ll bound the regret incurred at each timestep. Then we’ll bound the total regret across timesteps.
For the sake of analysis, we’ll use a slightly looser bound that applies across the whole time horizon and across all arms. We’ll omit the derivation since it’s very similar to the above (walk through it -yourself for practice).
Intuitively,
Intuitively,
Summing this across timesteps gives
Summing this across timesteps gives
Putting everything together gives
Putting everything together gives
In fact, we can do a more sophisticated analysis to trim off a factor of
In fact, we can do a more sophisticated analysis to trim off a factor of
3.6.2Lower bound on regret (intuition)¶
Is it possible to do better than
3.6.2Lower bound on regret (intuition)¶
Is it possible to do better than
3.7Thompson sampling and Bayesian bandits¶
So far, we’ve treated the parameters
3.7Thompson sampling and Bayesian bandits¶
So far, we’ve treated the parameters
From this Bayesian perspective, the Thompson sampling algorithm follows naturally: just sample from the distribution of the optimal arm, -given the observations!
class Distribution:
+given the observations!
class Distribution:
def sample(self) -> Float[Array, " K"]:
"""Sample a vector of means for the K arms."""
...
def update(self, arm: int, reward: float):
"""Condition on obtaining `reward` from the given arm."""
- ...
class ThompsonSampling(Agent):
+ ...
class ThompsonSampling(Agent):
def __init__(self, K: int, T: int, prior: Distribution):
super().__init__(K, T)
self.distribution = prior
@@ -603,18 +603,18 @@
def update_history(self, arm: int, reward: int):
super().update_history(arm, reward)
- self.distribution.update(arm, reward)
In other words, we sample each arm proportionally to how likely we think + self.distribution.update(arm, reward)
In other words, we sample each arm proportionally to how likely we think it is to be optimal, given the observations so far. This strikes a good exploration-exploitation tradeoff: we explore more for arms that we’re less certain about, and exploit more for arms that we’re more certain about. Thompson sampling is a simple yet powerful algorithm that -achieves state-of-the-art performance in many settings.
It turns out that asymptotically, Thompson sampling is optimal in the following sense. Lai & Robbins (1985) prove an -instance-dependent lower bound that says for any bandit algorithm,
where
measures the Kullback-Leibler divergence from the Bernoulli +instance-dependent lower bound that says for any bandit algorithm,
where
measures the Kullback-Leibler divergence from the Bernoulli
distribution with mean
3.8Contextual bandits¶
This content is advanced material taught at the end of the course.
In the above MAB environment, the reward distributions of the arms +the constant factor is optimal as well.
3.8Contextual bandits¶
This content is advanced material taught at the end of the course.
In the above MAB environment, the reward distributions of the arms
remain constant. However, in many real-world settings, we might receive
additional information that affects these distributions. For example, in
the online advertising case where each arm corresponds to an ad we could
@@ -652,7 +652,7 @@
to observe the context, and choose an action
Assuming our context is discrete, we can just perform the same +distribution also depends on the context.
Assuming our context is discrete, we can just perform the same
algorithms, treating each context-arm pair as its own arm. This gives us
an enlarged MAB of
Write down the UCB algorithm for this enlarged MAB. That is, write an
expression for
Recall that running UCB for
3.8.1Linear contextual bandits¶
We want to model the mean reward of arm
3.8.1Linear contextual bandits¶
We want to model the mean reward of arm
This has the closed-form solution known as the ordinary least squares
+timesteps where arm
This has the closed-form solution known as the ordinary least squares (OLS) estimator:
For a random variable
Since the OLS estimator is known to be unbiased (try proving this
+
Since the OLS estimator is known to be unbiased (try proving this
yourself), we can apply Chebyshev’s inequality to
-
We haven’t explained why
We haven’t explained why
The first term is exactly our predicted reward
where
is the empirical covariance matrix of the contexts (assuming that the +interpret the second term, note that
where
is the empirical covariance matrix of the contexts (assuming that the
context has mean zero). That is, the learner is encouraged to choose
arms when
We can now substitute these quantities into UCB to get the LinUCB -algorithm:
class LinUCBPseudocode(Agent):
+algorithm:
class LinUCBPseudocode(Agent):
def __init__(
self, K: int, T: int, D: int, lam: float, get_c: Callable[[int], float]
):
@@ -746,7 +746,7 @@
def update_history(self, context: Float[Array, " D"], arm: int, reward: int):
self.A[arm] += np.outer(context, context)
self.targets[arm] += context * reward
- self.w[arm] = np.linalg.solve(self.A[arm], self.targets[arm])
Note that the matrix
Note that the matrix
Using similar tools for UCB, we can also prove an
3.9Summary¶
In this chapter, -we explored the multi-armed bandit setting for analyzing sequential decision-making in an unknown environment.
- Vershynin, R. (2018). High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press.
- Lai, T. L., & Robbins, H. (1985). Asymptotically Efficient Adaptive Allocation Rules. Advances in Applied Mathematics, 6(1), 4–22. 10.1016/0196-8858(85)90002-8
- Agarwal, A., Jiang, N., Kakade, S. M., & Sun, W. (2022). Reinforcement Learning: Theory and Algorithms.