Update documentation

adzcai · Nov 11, 2024 · e18d321 · e18d321
1 parent a7caa92
commit e18d321
Show file tree

Hide file tree

Showing 41 changed files with 1,166 additions and 921 deletions.
diff --git a/background.html b/background.html
diff --git a/background.json b/background.json
diff --git a/bandits.html b/bandits.html
diff --git a/bandits.json b/bandits.json
diff --git a/build/1d74500d7a5d62ffa43debb29b4fba06.png b/build/1d74500d7a5d62ffa43debb29b4fba06.png
diff --git a/build/_assets/app-TARM6IJU.css → build/_assets/app-H3NBUYVS.css b/build/_assets/app-TARM6IJU.css → build/_assets/app-H3NBUYVS.css
diff --git a/build/_shared/chunk-P4DJOY6Q.js → build/_shared/chunk-JLDGA2DL.js b/build/_shared/chunk-P4DJOY6Q.js → build/_shared/chunk-JLDGA2DL.js
diff --git a/build/_shared/chunk-AC25E3GK.js → build/_shared/chunk-N544LW6X.js b/build/_shared/chunk-AC25E3GK.js → build/_shared/chunk-N544LW6X.js
diff --git a/build/b8e65b5253271f49ddf227a711c3aa2c.png b/build/b8e65b5253271f49ddf227a711c3aa2c.png
diff --git a/...ed_dp-bbfcf7e66c9311fe5ec9f9beb0cc0cbc.md → ...ed_dp-4d73bec315097a872828e6be1c141ef6.md b/...ed_dp-bbfcf7e66c9311fe5ec9f9beb0cc0cbc.md → ...ed_dp-4d73bec315097a872828e6be1c141ef6.md
@@ -305,6 +305,9 @@ def fitted_q_iteration(
     return Q_hat
 ```
 
+(fitted-pi-eval)=
+## Fitted policy evaluation
+
 We can also use this fixed-point interation to *evaluate* a policy using the dataset (not necessarily the one used to generate the trajectories):
 
 :::{prf:definition} Fitted policy evaluation
@@ -347,6 +350,8 @@ Spot the difference between `fitted_evaluation` and `fitted_q_iteration`. (See t
 How would you modify this algorithm to evaluate the data collection policy?
 :::
 
+## Fitted policy iteration
+
 We can use this policy evaluation algorithm to adapt the [](#policy iteration algorithm <policy_iteration>) to this new setting. The algorithm remains exactly the same -- repeatedly make the policy greedy w.r.t. its own value function -- except now we must evaluate the policy (i.e. compute its value function) using the iterative `fitted_evaluation` algorithm.
 
 ```{code-cell}

diff --git a/build/imitation_learning-bf09ff59ddcdb66b7ab3f1189910eb31.md b/build/imitation_learning-bf09ff59ddcdb66b7ab3f1189910eb31.md
@@ -0,0 +1,215 @@
+---
+jupytext:
+  text_representation:
+    extension: .md
+    format_name: myst
+    format_version: 0.13
+    jupytext_version: 1.16.2
+kernelspec:
+  display_name: Python 3 (ipykernel)
+  language: python
+  name: python3
+numbering:
+  enumerator: 7.%s
+---
+
+# 7 Imitation Learning
+
+## Introduction
+
+Imagine you are tasked with learning how to drive. How do, or did, you go about it?
+At first, this task might seem insurmountable: there are a vast array of controls, and the cost of making a single mistake could be extremely high, making it hard to explore by trial and error.
+Luckily, there are already people in the world who know how to drive who can get you started.
+In almost every challenge we face,
+we "stand on the shoulders of giants" and learn skills from experts who have already mastered them.
+
+![a robot imitating the pose of a young child (Photo by Pavel Danilyuk: https://www.pexels.com/photo/a-robot-imitating-a-girl-s-movement-8294811/)](./shared/robot-imitation-learning.jpg)
+
+Now in machine learning,
+we are often trying to teach machines to accomplish tasks that humans are already proficient at.
+In such cases, the machine learning algorithm is the one learning the new skill, and humans are the "experts" that can demonstrate how to perform the task.
+**Imitation learning** is a strategy for getting the learner to perform at least as well as the expert.
+We'll see that the most naive form of imitation learning, called **behavioral cloning**, is really an application of supervised learning to interactive tasks.
+We'll then explore **dataset aggregation** (DAgger) as a way to query an expert and learn even more effectively.
+
+## Behavioral cloning
+
+This notion of "learning from human-provided data" may remind you of the basic premise of [](./supervised_learning.md).
+In supervised learning,
+there is some mapping from _inputs_ to _outputs_,
+such as the task of assigning the correct label to an image,
+that humans can implicitly compute.
+To teach a machine to calculate this mapping,
+we first collect a large _training dataset_ by getting people to label a lot of inputs,
+and then use some optimization algorithm to produce a predictor that maps from the inputs to the outputs as closely as possible.
+
+How does this relate to interactive tasks?
+Here, the input is the observation seen by the agent and the output is the action it selects,
+so the mapping is the agent's _policy_.
+What's stopping us from applying supervised learning techniques to mimic the expert's policy?
+In principle, nothing!
+This is called **behavioral cloning.**
+
+:::{prf:definition} Behavioral cloning
+:label: behavioral_cloning
+
+
+1. Collect a training dataset of trajectories $\mathcal{D} = (s^n, a^n)_{n=1}^{N}$ generated by an **expert policy** $\pi_\text{expert}$. (For example, if the dataset contains $M$ trajectories, each with a finite horizon $H$, then $N = M \times H$.)
+2. Use a SL algorithm $\texttt{fit} : \mathcal{D} \mapsto \widetilde{\pi}$ to extract a policy $\widetilde{\pi}$ that approximates the expert policy.
+:::
+
+Typically, this second task can be framed as **empirical loss minimization**:
+
+:::{math}
+\widetilde{\pi} = \arg\min_{\pi \in \Pi} \sum_{n=0}^{N-1} \text{loss}(\pi(s^n), a^n)
+:::
+
+where $\Pi$ is some class of possible policies, $\text{loss}$ is the loss function to measure how different the policy's prediction is from the true observed action,
+and the SL algorithm itself, also known as the **fitting method**, tells us how to compute this $\arg\min$.
+
+How should we choose the loss function?
+In supervised learning, we saw that the **mean squared error** is a good choice for continuous outputs.
+However, how should we measure the difference between two actions in a _discrete_ action space?
+In this setting, the policy acts more like a _classifier_ that picks the best action in a given state.
+Rather than considering a deterministic policy that just outputs a single action,
+we'll consider a stochastic policy $\pi$ that outputs a _distribution_ over actions.
+This allows us to assign a _likelihood_ to observing the entire dataset $\mathcal{D}$ under the policy $\pi$,
+assuming the state-action pairs are independent:
+
+$$
+\pr_\pi (\mathcal{D}) = \prod_{n=1}^{N} \pi(a_n \mid s_n)
+$$
+
+Note that the states and actions are _not_, however, actually independent! A key property of interactive tasks is that the agent's output -- the action that it takes -- may influence its next observation.
+We want to find a policy under which the training dataset $\mathcal{D}$ is the most likely.
+This is called the **maximum likelihood estimate** of the policy that generated the dataset:
+
+:::{math}
+\widetilde{\pi} = \arg\max_{\pi \in \Pi} \pr_{\pi}(\mathcal{D})
+:::
+
+This is also equivalent to picking the **negative log likelihood** as the loss function:
+
+:::{math}
+\begin{align*}
+\widetilde{\pi} &= \arg\min_{\pi \in \Pi} - \log \pr_\pi(\mathcal{D}) \\
+&= \arg\min_{\pi \in \Pi} \sum_{n=1}^N - \log \pi(a_n \mid s_n)
+\end{align*}
+:::
+
+### Performance of behavioral cloning
+
+Can we quantify how well this algorithm works?
+For simplicity, let's consider the case where the action space is _finite_ and both the expert policy and learned policy are deterministic.
+Suppose the learned policy obtains $\varepsilon$ _classification error_.
+That is, for trajectories drawn from the expert policy,
+the learned policy chooses a different action at most $\varepsilon$ of the time:
+
+:::{math}
+\mathbb{E}_{\tau \sim \rho_{\pi_{\text{expert}}}} \left[ \frac 1 \hor \sum_{\hi=0}^{\hor-1} \ind{ \widetilde{\pi}(s_\hi) \ne \pi_{\text{expert}} (s_\hi) } \right] \le \varepsilon
+:::
+
+Then, their value functions differ by
+
+:::{math}
+| V^{\pi_{\text{expert}}} - V^{\widetilde{\pi}} | \le H^2 \varepsilon
+:::
+
+where $H$ is the horizon.
+
+:::{prf:theorem} Performance of behavioral cloning
+
+Recall the {prf:ref}`pdl` allows us to express the difference between $\pi_{\text{expert}}$ and $\widetilde{\pi}$ as
+
+$$
+V_0^{\pi_{\text{expert}}}(s) - V_0^{\widetilde{\pi}} (s) = \E_{\tau \sim \rho^{\pi_{\text{expert}}} \mid s_0 = s} \left[ \sum_{\hi=0}^{\hor-1} A_\hi^{\widetilde{\pi}} (s_\hi, a_\hi) \right].
+\label{eq:pdl-rhs}
+$$
+
+Now since the expert policy is deterministic, we can substitute $a_\hi = \pi_{\text{expert}}(s_\hi)$.
+This allows us to make a further simplification:
+since $\pi_{\text{expert}}$ is deterministic,
+the advantage of the chosen action is exactly zero:
+
+$$
+A^{\pi_{\text{expert}}}(s, \pi_{\text{expert}}(s)) = Q^{\pi_{\text{expert}}}(s, \pi_{\text{expert}}(s)) - V^{\pi_{\text{expert}}}(s) = 0.
+$$
+
+But the right-hand-side of [](#eq:pdl-rhs) uses $A^{\widetilde{\pi}}$, not $A^{\pi_{\text{expert}}}$.
+To bridge this gap,
+we now use the assumption that $\widetilde{\pi}$ obtains $\varepsilon$ classification error.
+Note that $A_\hi^{\widetilde{\pi}}(s_\hi, \pi_{\text{expert}}(s_\hi)) = 0$ when $\pi_{\text{expert}}(s_\hi) = \widetilde{\pi}(s_\hi)$.
+In the case where the two policies differ on $s_\hi$, which occurs with probability $\varepsilon$, the advantage is naively upper bounded by $H$ (assuming rewards are bounded between $0$ and $1$).
+Taking the final sum gives the desired bound.
+:::
+
+<!-- TODO ADD DISTRIBUTION SHIFT EXAMPLE FROM SLIDES -->
+
+## Distribution shift
+
+Let us return to the driving analogy. Suppose you have taken some driving lessons and now feel comfortable in your neighbourhood. But today you have to travel to an area you haven't visited before, such as a highway, where it would be dangerous to try and apply the techniques you've already learned.
+This is the issue of _distribution shift_: a policy learned under a certain distribution of states may not perform well if this distribution changes.
+
+This is already a common issue in supervised learning, where the training dataset for a model might not resemble the environment where it gets deployed.
+In interactive environments, this issue is further exacerbated by the dependency between the observations and the agent's behavior; if you take a wrong turn early on, it may be difficult or impossible to recover in that trajectory.
+
+How could you learn a strategy for these new settings?
+In the driving example, you might decide to install a dashcam to record the car's surroundings. That way, once you make it back to safety, you can show the recording to an expert, who can provide feedback at each step of the way.
+Then the next time you go for a drive, you can remember the expert's advice, and take a safer route.
+You could then repeat this training as many times as desired, thereby collecting the expert's feedback over a diverse range of locations.
+This is the key idea behind _dataset aggregation_.
+
+## Dataset aggregation (DAgger)
+
+The DAgger algorithm is due to {cite}`ross_reduction_2010`.
+It assumes that we have _query access_ to the expert policy.
+That is, for a given state $s$,
+we can ask for the expert's action $\pi_{\text{expert}}(s)$ in that state.
+We also need access to the environment for rolling out policies.
+This makes DAgger an **online** algorithm,
+as opposed to pure behavioral cloning,
+which is **offline** since we don't need to act in the environment at all.
+
+You can think of DAgger as a specific way of collecting the dataset $\mathcal{D}$.
+
+:::{prf:algorithm} DAgger
+
+Inputs: $\pi_{\text{expert}}$, an initial policy $\pi_{\text{init}}$, the number of iterations $T$, and the number of trajectories $N$ to collect per iteration.
+
+1. Initialize $\mathcal{D} = \{\}$ (the empty set) and $\pi = \pi_{\text{init}}$.
+2. For $t = 1, \dots, T$:
+   - Collect $N$ trajectories $\tau_1, \dots, \tau_N$ using the current policy $\pi$.
+   - For each trajectory $\tau_n$:
+     - Replace each action $a_h$ in $\tau_n$ with the **expert action** $\pi_{\text{expert}}(s_h)$.
+     - Call the resulting trajectory $\tau^{\text{expert}}_n$.
+   - $\mathcal{D} \gets \mathcal{D} \cup \{ \tau^{\text{expert}}_1, \dots, \tau^{\text{expert}}_n \}$.
+   - Let $\pi \gets \texttt{fit}(\mathcal{D})$, where $\texttt{fit}$ is a behavioral cloning algorithm.
+3. Return $\pi$.
+:::
+
+How well does DAgger perform?
+We omit a proof here, but under certain assumptions,
+the DAgger algorithm can better approximate the expert policy:
+
+$$
+|V^{\pi_{\text{expert}}} - V^{\pi_{\text{DAgger}}}| \le H \varepsilon
+$$
+
+where $\varepsilon$ is the "classification error" guaranteed by the supervised learning algorithm.
+
+<!-- TODO -->
+
+## Summary
+
+For tasks where it is too difficult or expensive to learn from scratch,
+we can instead start off with a collection of **expert demonstrations**.
+Then we can use supervised learning techniques to find a policy that imitates the expert demonstrations.
+
+The simplest way to do this is to apply a supervised learning algorithm to an already-collected dataset of expert state-action pairs.
+This is called **behavioral cloning**.
+However, given query access to the expert policy,
+we can do better by integrating its feedback in an online loop.
+The **DAgger** algorithm is one way of doing this,
+where we use the expert policy to augment trajectories and then learn from this augmented dataset using behavioral cloning.
+
+
diff --git a/build/imitation_learning-bf860cb6679fb159939c7b8b45aabd4b.md b/build/imitation_learning-bf860cb6679fb159939c7b8b45aabd4b.md