first draft of imitation learning chapter

adzcai · Aug 8, 2024 · 7af8b08 · 7af8b08
1 parent 0d25179
commit 7af8b08
Show file tree

Hide file tree

Showing 7 changed files with 592 additions and 32 deletions.
diff --git a/book/_toc.yml b/book/_toc.yml
@@ -4,7 +4,7 @@
 format: jb-book
 root: index.md
 options:
-  numbered: 2
+  numbered: true
 chapters:
   - file: intro.md
   - file: bandits.md
@@ -13,6 +13,7 @@ chapters:
   - file: control.md
   - file: pg.md
   - file: exploration.md
+  - file: imitation_learning.md
 # - file: challenges
 # - file: appendix
   - file: bibliography.md
diff --git a/book/bandits.md b/book/bandits.md
@@ -484,6 +484,10 @@ plot_strategy(mab, agent)
 
 Note that we let $\epsilon$ vary over time. In particular, we might want to gradually *decrease* $\epsilon$ as we learn more about the reward distributions and no longer need to spend time exploring.
 
+:::{attention}
+What is the expected regret of the algorithm if we set $\epsilon$ to be a constant?
+:::
+
 It turns out that setting $\epsilon_t = \sqrt[3]{K \ln(t)/t}$ also achieves a regret of $\tilde O(t^{2/3} K^{1/3})$ (ignoring the logarithmic factors). (We will not prove this here.) TODO ADD PROOF CITATION
 
 In ETC, we had to set $N_{\text{explore}}$ based on the total number of timesteps $T$. But the epsilon-greedy algorithm actually handles the exploration *automatically*: the regret rate holds for *any* $t$, and doesn’t depend on the final horizon $T$.
@@ -852,10 +856,11 @@ This has the closed-form solution known as the *ordinary least squares*
 
 :::{math}
 :label: ols_bandit
+
 \begin{aligned}
-        \hat \theta_t^k          & = (A_t^k)^{-1} \sum_{\{ i \in [t] : a_i = k \}} x_i r_i \\
-        \text{where} \quad A_t^k & = \sum_{\{ i \in [t] : a_i = k \}} x_i x_i^\top.
-    \end{aligned}
+    \hat \theta_t^k          & = (A_t^k)^{-1} \sum_{\{ i \in [t] : a_i = k \}} x_i r_i \\
+    \text{where} \quad A_t^k & = \sum_{\{ i \in [t] : a_i = k \}} x_i x_i^\top.
+\end{aligned}
 :::
 
 We can now apply the UCB algorithm in this environment in order to
@@ -877,7 +882,9 @@ $$|Y| \le \beta \sigma \quad \text{with probability} \ge 1 - \frac{1}{\beta^2}$$
 
 Since the OLS estimator is known to be unbiased (try proving this
 yourself), we can apply Chebyshev's inequality to
-$x_t^\top (\hat \theta_t^k - \theta^k)$: $$\begin{aligned}
+$x_t^\top (\hat \theta_t^k - \theta^k)$:
+
+$$\begin{aligned}
     x_t^\top \theta^k \le x_t^\top \hat \theta_t^k + \beta \sqrt{x_t^\top (A_t^k)^{-1} x_t} \quad \text{with probability} \ge 1 - \frac{1}{\beta^2}
 \end{aligned}$$
 
@@ -901,7 +908,7 @@ We can now substitute these quantities into UCB to get the **LinUCB**
 algorithm:
 
 ```{code-cell}
-class LinUCB(Agent):
+class LinUCBPseudocode(Agent):
     def __init__(
         self, K: int, T: int, D: int, lam: float, get_c: Callable[[int], float]
     ):
@@ -944,6 +951,4 @@ regret bound. The full details of the analysis can be found in Section 3 of {cit
 
 ## Summary
 
-```{code-cell}
 
-```
diff --git a/book/exploration.md b/book/exploration.md
@@ -22,7 +22,11 @@ In the multi-armed bandits chapter {ref}`bandits`, where the state never changes
 :::{prf:definition} Per-episode regret
 :label: per_episode_regret
 
-To quantify the performance of a learning algorithm, we will consider its per-episode regret over $T$ timesteps/episodes: $$\text{Regret}_T = \E\left[ \sum_{t=0}^{T-1} V^\star_0(s_0) - V^{\pi^t}_0(s_0) \right]$$ where $\pi^t$ is the policy generated by the algorithm at the $t$th iteration.
+To quantify the performance of a learning algorithm, we will consider its per-episode regret over $T$ timesteps/episodes:
+
+$$\text{Regret}_T = \E\left[ \sum_{t=0}^{T-1} V^\star_0(s_0) - V^{\pi^t}_0(s_0) \right]$$
+
+where $\pi^t$ is the policy generated by the algorithm at the $t$th iteration.
 :::
 
 ### Sparse reward
@@ -57,6 +61,10 @@ $K \gets \emptyset$ Using our known transitions $K$, compute the shortest path $
 The shortest path computation can be implemented using DP. We leave this as an exercise.
 ::::
 
+```{code-cell}
+def explore_then_exploit(mdp: MDP):
+```
+
 :::{prf:theorem}
 Performance of explore-then-exploitexplore_then_exploit_performance As long as every state can be reached from $s_0$ within a single episode, i.e. $|\mathcal{S}| \le \hor$, this will eventually be able to explore all $|\mathcal{S}| |\mathcal{A}|$ state-action pairs, adding one new transition per episode. We know it will take at most $|\mathcal{S}| |\mathcal{A}|$ iterations to explore the entire MDP, after which $\pi^t = \pi^\star$, incurring no additional regret. For each $\pi^t$ up until then, corresponding to the shortest-path policies $\tilde \pi$, the value of policy $\pi^t$ will differ from that of $\pi^\star$ by at most $\hor$, since the policies will differ by at most $1$ reward at each timestep. So, $$\sum_{t=0}^{T-1} V^\star_0 - V_0^{\pi^t} \le |\mathcal{S}||\mathcal{A}| \hor.$$ (Note that this MDP and algorithm are deterministic, so the regret is not random.)
 :::
@@ -208,11 +216,16 @@ A polynomial dependency on $|\mathcal{S}|$ and $|\mathcal{A}|$ is manageable whe
 :::{prf:definition} Linear MDP
 :label: linear_mdp
 
-We assume that the transition probabilities and rewards are *linear* in some feature vector $\phi(s, a) \in \mathbb{R}^d$: $$\begin{aligned}
+We assume that the transition probabilities and rewards are *linear* in some feature vector
+
+$\phi(s, a) \in \mathbb{R}^d$:
+
+$$\begin{aligned}
         P_\hi(s' \mid s, a) & = \phi(s, a)^\top \mu^\star_\hi(s') \\
         r_\hi(s, a)         & = \phi(s, a)^\top \theta_\hi^\star
-
-\end{aligned}$$ Note that we can also think of $P_\hi(\cdot \mid s, a) = \mu_\hi^\star$ as an $|\mathcal{S}| \times d$ matrix, and think of $\mu^\star_\hi(s')$ as indexing into the $s'$-th row of this matrix (treating it as a column vector). Thinking of $V^\star_{\hi+1}$ as an $|\mathcal{S}|$-dimensional vector, this allows us to write $$\E_{s' \sim P_\hi(\cdot \mid s, a)}[V^\star_{\hi+1}(s)] = (\mu^\star_\hi \phi(s, a))^\top V^\star_{\hi+1}.$$ The $\phi$ feature mapping can be designed to capture interactions between the state $s$ and action $a$. In this book, we'll assume that the feature map $\phi : \mathcal{S} \times \mathcal{A} \to \mathbb{R}^d$ and the reward function (described by $\theta_\hi^\star$) are known to the learner.
+\end{aligned}$$
+
+Note that we can also think of $P_\hi(\cdot \mid s, a) = \mu_\hi^\star$ as an $|\mathcal{S}| \times d$ matrix, and think of $\mu^\star_\hi(s')$ as indexing into the $s'$-th row of this matrix (treating it as a column vector). Thinking of $V^\star_{\hi+1}$ as an $|\mathcal{S}|$-dimensional vector, this allows us to write $$\E_{s' \sim P_\hi(\cdot \mid s, a)}[V^\star_{\hi+1}(s)] = (\mu^\star_\hi \phi(s, a))^\top V^\star_{\hi+1}.$$ The $\phi$ feature mapping can be designed to capture interactions between the state $s$ and action $a$. In this book, we'll assume that the feature map $\phi : \mathcal{S} \times \mathcal{A} \to \mathbb{R}^d$ and the reward function (described by $\theta_\hi^\star$) are known to the learner.
 :::
 
 ### Planning in a linear MDP

diff --git a/book/imitation_learning.md b/book/imitation_learning.md
@@ -0,0 +1,141 @@
+---
+jupytext:
+  text_representation:
+    extension: .md
+    format_name: myst
+    format_version: 0.13
+    jupytext_version: 1.16.2
+kernelspec:
+  display_name: Python 3 (ipykernel)
+  language: python
+  name: python3
+---
+
+# Imitation Learning
+
+Imagine you are tasked with learning how to drive. How do, or did, you go about it?
+At first, this task might seem insurmountable: there are a vast array of controls, and the cost of making a single mistake could be extremely high, making it hard to explore by trial and error.
+Luckily, there are already people in the world who know how to drive who can get you started.
+In this and many other examples, we all "stand on the shoulders of giants" and learn skills from experts who have already mastered them.
+
+Now in machine learning, much of the time, we are trying to teach machines to accomplish tasks that us humans are already proficient at.
+In such cases, the machine learning algorithm is the one learning the new skill, and humans are the "experts" that can demonstrate how to perform the task.
+**Imitation learning** is a direct application of this idea to machine learning for interactive tasks.
+We'll see that the most naive form of imitation learning, called **behavioural cloning**, is really an application of supervised learning to interactive tasks.
+We'll then explore **dataset aggregation** (DAgger) as a way to query an expert and learn even more effectively.
+
+## Behavioural cloning
+
+This notion of "learning from human-provided data" may remind you of the basic premise of {ref}`supervised_learning`,
+in which there is some mapping from _inputs_ to _outputs_ that us humans can implicitly compute, such as seeing a photo and being able to recognize its constituents.
+To teach a machine to calculate this mapping, we first collect a large _training dataset_ by getting people to label a lot of inputs,
+and then use some optimization algorithm to produce a predictor that maps from the inputs to the outputs as closely as possible.
+How does this relate to interactive tasks?
+Here, the input is the observation seen by the agent and the output is the action it selects, so the mapping is the agent's policy.
+What's stopping us from applying supervised learning techniques?
+In practice, nothing! This is called **behavioural cloning.**
+
+:::{prf:algorithm}
+1. Collect a training dataset of trajectories generated by an expert policy $\pi_\text{data}$. Here, we treat each state-action pair as independent, resuling in a dataset $\mathcal{D} = (s^n, a^n)_{n=1}^{N}$. (For concreteness, if there are $M$ trajectories with a horizon $H$, then $N = M \times H$.)
+   - Note that this is an inaccurate approximation! A key property of interactive tasks is that the agent's output -- the action that it takes -- may influence its next observation.
+2. Use a SL algorithm $\texttt{fit} : \mathcal{D} \mapsto \tilde \pi$ to extract a policy $\tilde \pi$ that approximates the expert policy.
+:::
+
+Typically, this second task can be framed as **empirical loss minimization**:
+
+:::{math}
+\tilde \pi = \arg\min_{\pi \in \Pi} \sum_{n=0}^{N-1} \text{loss}(\pi(s^n), a^n)
+:::
+
+where $\Pi$ is some class of possible policies, $\text{loss}$ is the loss function to measure how far off the policy's prediction is, and the SL algorithm tells us how to compute this $\arg\min$.
+If training a deterministic policy that is just a function from inputs to outputs with no randomness, we might try to minimize the **mean squared error**.
+More generally, though, we often choose the **negative log likelihood** as our loss function, so that the optimization is equivalent to **maximum likelihood estimation**:
+out of the space of all possible mappings, we search for the one according to which the training dataset is the most likely.
+
+:::{math}
+\tilde \pi = \arg\max_{\pi \in \Pi} \Pr_{a^n \sim \pi(s^n)}(a^{0:N} \mid s^{0:N})
+:::
+
+Can we quantify how well this algorithm works?
+For simplicity, let's consider the case where the action space is discrete and both the data and trained policy are deterministic.
+(This corresponds to a classification task in SL.)
+Suppose the SL algorithm obtains $\varepsilon$ classification error.
+That is, for trajectories drawn from the expert policy,
+the learned policy chooses a different action at most $\varepsilon$ of the time:
+
+:::{math}
+\mathbb{E}_{\tau \sim \rho_{\pi_{\text{data}}}} \left[ \frac 1 \hor \sum_{\hi=0}^{\hor-1} \ind{ \tilde \pi(s_\hi) \ne \pi_{\text{data}} (s_\hi) } \right] \le \varepsilon
+:::
+
+Then, their value functions differ by
+
+:::{math}
+| V^{\pi_{\text{data}}} - V^{\tilde \pi} | \le H^2 \varepsilon
+:::
+
+where $H$ is the horizon.
+
+:::{prf:theorem} Performance of behavioural cloning
+
+Recall the {prf:ref}`pdl` allows us to express the difference between $\pi_{\text{data}}$ and $\tilde \pi$ as
+
+$$
+V_0^{\pi_{\text{data}}}(s) - V_0^{\tilde \pi} (s) = \E_{\tau \sim \rho^{\pi_{\text{data}}} \mid s_0 = s} \left[ \sum_{\hi=0}^{\hor-1} A_\hi^{\tilde \pi} (s_\hi, a_\hi) \right].
+$$
+
+Now since the data policy is deterministic, we can substitute $a_\hi = \pi_{\text{data}}(s_\hi)$.
+This allows us to make a further simplification:
+since $\pi_{\text{data}}$ is deterministic, we have
+
+$$
+A^{\pi_{\text{data}}}(s, \pi_{\text{data}}(s)) = Q^{\pi_{\text{data}}}(s, \pi_{\text{data}}(s)) - V^{\pi_{\text{data}}}(s) = 0.
+$$
+
+Now we can use the assumption that the SL algorithm obtains $\varepsilon$ classification error. By the above, $A_\hi^{\tilde \pi}(s_\hi, \pi_{\text{data}}(s_\hi)) = 0$ when $\pi_{\text{data}}(s_\hi) = \tilde \pi(s_\hi)$. In the case where the two policies differ on $s_\hi$, which occurs with probability $\varepsilon$, the advantage is naively upper bounded by $H$ (assuming rewards are bounded between $0$ and $1$). Taking the final sum gives the desired bound.
+:::
+
+<!-- TODO ADD DISTRIBUTION SHIFT EXAMPLE FROM SLIDES -->
+
+## Distribution shift
+
+Let us return to the driving analogy. Suppose you have taken some driving lessons and now feel comfortable in your neighbourhood. But today you have to travel to an area you haven't visited before, such as a highway, where it would be dangerous to try and apply the techniques you've already learned.
+This is the issue of _distribution shift_: a policy learned under some distribution of states may not perform well if this distribution changes.
+
+This is already a common issue in supervised learning, where the training dataset for a model might not resemble the environment where it gets deployed. In interactive environments, this issue is further exacerbated by the dependency between the observations and the agent's behaviour; if you take a wrong turn early on, it may be difficult or impossible to recover in that trajectory.
+
+How could you learn a strategy for these new settings?
+In the driving example, you might decide to install a dashcam to record the car's surroundings. That way, once you make it back to safety, you can show the recording to an expert, who can provide feedback at each step of the way.
+Then the next time you go for a drive, you can remember the expert's advice, and take a safer route.
+You could then repeat this training as many times as desired, thereby collecting the expert's feedback over a diverse range of locations.
+This is the key idea behind _dataset aggregation_.
+
+## Dataset aggregation (DAgger)
+
+The DAgger algorithm is due to {cite}`ross_reduction_2010`.
+
+```python
+def dagger_pseudocode(
+    env: MAB,
+    π_init: Policy,
+    π_expert: Policy,
+    n_dagger_iterations: int,
+    n_trajectories_per_iteration: int
+):
+    π = π_init
+    dataset = set()
+
+    for _ in range(n_dagger_iterations):
+        for __ in range(n_trajectories_per_iteration):
+            τ = collect_trajectory(π, env)
+            for step in range(env.H):
+                obs = τ.state[step]
+                τ.action[step] = π_expert(obs)
+            dataset.add(τ)
+
+        π = fit(dataset)
+
+    return π
+```
+
+
+