-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
41 changed files
with
1,166 additions
and
921 deletions.
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
Large diffs are not rendered by default.
Oops, something went wrong.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion
2
build/_assets/app-TARM6IJU.css → build/_assets/app-H3NBUYVS.css
Large diffs are not rendered by default.
Oops, something went wrong.
78 changes: 39 additions & 39 deletions
78
build/_shared/chunk-P4DJOY6Q.js → build/_shared/chunk-JLDGA2DL.js
Large diffs are not rendered by default.
Oops, something went wrong.
2 changes: 1 addition & 1 deletion
2
build/_shared/chunk-AC25E3GK.js → build/_shared/chunk-N544LW6X.js
Large diffs are not rendered by default.
Oops, something went wrong.
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
215 changes: 215 additions & 0 deletions
215
build/imitation_learning-bf09ff59ddcdb66b7ab3f1189910eb31.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,215 @@ | ||
--- | ||
jupytext: | ||
text_representation: | ||
extension: .md | ||
format_name: myst | ||
format_version: 0.13 | ||
jupytext_version: 1.16.2 | ||
kernelspec: | ||
display_name: Python 3 (ipykernel) | ||
language: python | ||
name: python3 | ||
numbering: | ||
enumerator: 7.%s | ||
--- | ||
|
||
# 7 Imitation Learning | ||
|
||
## Introduction | ||
|
||
Imagine you are tasked with learning how to drive. How do, or did, you go about it? | ||
At first, this task might seem insurmountable: there are a vast array of controls, and the cost of making a single mistake could be extremely high, making it hard to explore by trial and error. | ||
Luckily, there are already people in the world who know how to drive who can get you started. | ||
In almost every challenge we face, | ||
we "stand on the shoulders of giants" and learn skills from experts who have already mastered them. | ||
|
||
![a robot imitating the pose of a young child (Photo by Pavel Danilyuk: https://www.pexels.com/photo/a-robot-imitating-a-girl-s-movement-8294811/)](./shared/robot-imitation-learning.jpg) | ||
|
||
Now in machine learning, | ||
we are often trying to teach machines to accomplish tasks that humans are already proficient at. | ||
In such cases, the machine learning algorithm is the one learning the new skill, and humans are the "experts" that can demonstrate how to perform the task. | ||
**Imitation learning** is a strategy for getting the learner to perform at least as well as the expert. | ||
We'll see that the most naive form of imitation learning, called **behavioral cloning**, is really an application of supervised learning to interactive tasks. | ||
We'll then explore **dataset aggregation** (DAgger) as a way to query an expert and learn even more effectively. | ||
|
||
## Behavioral cloning | ||
|
||
This notion of "learning from human-provided data" may remind you of the basic premise of [](./supervised_learning.md). | ||
In supervised learning, | ||
there is some mapping from _inputs_ to _outputs_, | ||
such as the task of assigning the correct label to an image, | ||
that humans can implicitly compute. | ||
To teach a machine to calculate this mapping, | ||
we first collect a large _training dataset_ by getting people to label a lot of inputs, | ||
and then use some optimization algorithm to produce a predictor that maps from the inputs to the outputs as closely as possible. | ||
|
||
How does this relate to interactive tasks? | ||
Here, the input is the observation seen by the agent and the output is the action it selects, | ||
so the mapping is the agent's _policy_. | ||
What's stopping us from applying supervised learning techniques to mimic the expert's policy? | ||
In principle, nothing! | ||
This is called **behavioral cloning.** | ||
|
||
:::{prf:definition} Behavioral cloning | ||
:label: behavioral_cloning | ||
|
||
|
||
1. Collect a training dataset of trajectories $\mathcal{D} = (s^n, a^n)_{n=1}^{N}$ generated by an **expert policy** $\pi_\text{expert}$. (For example, if the dataset contains $M$ trajectories, each with a finite horizon $H$, then $N = M \times H$.) | ||
2. Use a SL algorithm $\texttt{fit} : \mathcal{D} \mapsto \widetilde{\pi}$ to extract a policy $\widetilde{\pi}$ that approximates the expert policy. | ||
::: | ||
|
||
Typically, this second task can be framed as **empirical loss minimization**: | ||
|
||
:::{math} | ||
\widetilde{\pi} = \arg\min_{\pi \in \Pi} \sum_{n=0}^{N-1} \text{loss}(\pi(s^n), a^n) | ||
::: | ||
|
||
where $\Pi$ is some class of possible policies, $\text{loss}$ is the loss function to measure how different the policy's prediction is from the true observed action, | ||
and the SL algorithm itself, also known as the **fitting method**, tells us how to compute this $\arg\min$. | ||
|
||
How should we choose the loss function? | ||
In supervised learning, we saw that the **mean squared error** is a good choice for continuous outputs. | ||
However, how should we measure the difference between two actions in a _discrete_ action space? | ||
In this setting, the policy acts more like a _classifier_ that picks the best action in a given state. | ||
Rather than considering a deterministic policy that just outputs a single action, | ||
we'll consider a stochastic policy $\pi$ that outputs a _distribution_ over actions. | ||
This allows us to assign a _likelihood_ to observing the entire dataset $\mathcal{D}$ under the policy $\pi$, | ||
assuming the state-action pairs are independent: | ||
|
||
$$ | ||
\pr_\pi (\mathcal{D}) = \prod_{n=1}^{N} \pi(a_n \mid s_n) | ||
$$ | ||
|
||
Note that the states and actions are _not_, however, actually independent! A key property of interactive tasks is that the agent's output -- the action that it takes -- may influence its next observation. | ||
We want to find a policy under which the training dataset $\mathcal{D}$ is the most likely. | ||
This is called the **maximum likelihood estimate** of the policy that generated the dataset: | ||
|
||
:::{math} | ||
\widetilde{\pi} = \arg\max_{\pi \in \Pi} \pr_{\pi}(\mathcal{D}) | ||
::: | ||
|
||
This is also equivalent to picking the **negative log likelihood** as the loss function: | ||
|
||
:::{math} | ||
\begin{align*} | ||
\widetilde{\pi} &= \arg\min_{\pi \in \Pi} - \log \pr_\pi(\mathcal{D}) \\ | ||
&= \arg\min_{\pi \in \Pi} \sum_{n=1}^N - \log \pi(a_n \mid s_n) | ||
\end{align*} | ||
::: | ||
|
||
### Performance of behavioral cloning | ||
|
||
Can we quantify how well this algorithm works? | ||
For simplicity, let's consider the case where the action space is _finite_ and both the expert policy and learned policy are deterministic. | ||
Suppose the learned policy obtains $\varepsilon$ _classification error_. | ||
That is, for trajectories drawn from the expert policy, | ||
the learned policy chooses a different action at most $\varepsilon$ of the time: | ||
|
||
:::{math} | ||
\mathbb{E}_{\tau \sim \rho_{\pi_{\text{expert}}}} \left[ \frac 1 \hor \sum_{\hi=0}^{\hor-1} \ind{ \widetilde{\pi}(s_\hi) \ne \pi_{\text{expert}} (s_\hi) } \right] \le \varepsilon | ||
::: | ||
|
||
Then, their value functions differ by | ||
|
||
:::{math} | ||
| V^{\pi_{\text{expert}}} - V^{\widetilde{\pi}} | \le H^2 \varepsilon | ||
::: | ||
|
||
where $H$ is the horizon. | ||
|
||
:::{prf:theorem} Performance of behavioral cloning | ||
|
||
Recall the {prf:ref}`pdl` allows us to express the difference between $\pi_{\text{expert}}$ and $\widetilde{\pi}$ as | ||
|
||
$$ | ||
V_0^{\pi_{\text{expert}}}(s) - V_0^{\widetilde{\pi}} (s) = \E_{\tau \sim \rho^{\pi_{\text{expert}}} \mid s_0 = s} \left[ \sum_{\hi=0}^{\hor-1} A_\hi^{\widetilde{\pi}} (s_\hi, a_\hi) \right]. | ||
\label{eq:pdl-rhs} | ||
$$ | ||
|
||
Now since the expert policy is deterministic, we can substitute $a_\hi = \pi_{\text{expert}}(s_\hi)$. | ||
This allows us to make a further simplification: | ||
since $\pi_{\text{expert}}$ is deterministic, | ||
the advantage of the chosen action is exactly zero: | ||
|
||
$$ | ||
A^{\pi_{\text{expert}}}(s, \pi_{\text{expert}}(s)) = Q^{\pi_{\text{expert}}}(s, \pi_{\text{expert}}(s)) - V^{\pi_{\text{expert}}}(s) = 0. | ||
$$ | ||
|
||
But the right-hand-side of [](#eq:pdl-rhs) uses $A^{\widetilde{\pi}}$, not $A^{\pi_{\text{expert}}}$. | ||
To bridge this gap, | ||
we now use the assumption that $\widetilde{\pi}$ obtains $\varepsilon$ classification error. | ||
Note that $A_\hi^{\widetilde{\pi}}(s_\hi, \pi_{\text{expert}}(s_\hi)) = 0$ when $\pi_{\text{expert}}(s_\hi) = \widetilde{\pi}(s_\hi)$. | ||
In the case where the two policies differ on $s_\hi$, which occurs with probability $\varepsilon$, the advantage is naively upper bounded by $H$ (assuming rewards are bounded between $0$ and $1$). | ||
Taking the final sum gives the desired bound. | ||
::: | ||
|
||
<!-- TODO ADD DISTRIBUTION SHIFT EXAMPLE FROM SLIDES --> | ||
|
||
## Distribution shift | ||
|
||
Let us return to the driving analogy. Suppose you have taken some driving lessons and now feel comfortable in your neighbourhood. But today you have to travel to an area you haven't visited before, such as a highway, where it would be dangerous to try and apply the techniques you've already learned. | ||
This is the issue of _distribution shift_: a policy learned under a certain distribution of states may not perform well if this distribution changes. | ||
|
||
This is already a common issue in supervised learning, where the training dataset for a model might not resemble the environment where it gets deployed. | ||
In interactive environments, this issue is further exacerbated by the dependency between the observations and the agent's behavior; if you take a wrong turn early on, it may be difficult or impossible to recover in that trajectory. | ||
|
||
How could you learn a strategy for these new settings? | ||
In the driving example, you might decide to install a dashcam to record the car's surroundings. That way, once you make it back to safety, you can show the recording to an expert, who can provide feedback at each step of the way. | ||
Then the next time you go for a drive, you can remember the expert's advice, and take a safer route. | ||
You could then repeat this training as many times as desired, thereby collecting the expert's feedback over a diverse range of locations. | ||
This is the key idea behind _dataset aggregation_. | ||
|
||
## Dataset aggregation (DAgger) | ||
|
||
The DAgger algorithm is due to {cite}`ross_reduction_2010`. | ||
It assumes that we have _query access_ to the expert policy. | ||
That is, for a given state $s$, | ||
we can ask for the expert's action $\pi_{\text{expert}}(s)$ in that state. | ||
We also need access to the environment for rolling out policies. | ||
This makes DAgger an **online** algorithm, | ||
as opposed to pure behavioral cloning, | ||
which is **offline** since we don't need to act in the environment at all. | ||
|
||
You can think of DAgger as a specific way of collecting the dataset $\mathcal{D}$. | ||
|
||
:::{prf:algorithm} DAgger | ||
|
||
Inputs: $\pi_{\text{expert}}$, an initial policy $\pi_{\text{init}}$, the number of iterations $T$, and the number of trajectories $N$ to collect per iteration. | ||
|
||
1. Initialize $\mathcal{D} = \{\}$ (the empty set) and $\pi = \pi_{\text{init}}$. | ||
2. For $t = 1, \dots, T$: | ||
- Collect $N$ trajectories $\tau_1, \dots, \tau_N$ using the current policy $\pi$. | ||
- For each trajectory $\tau_n$: | ||
- Replace each action $a_h$ in $\tau_n$ with the **expert action** $\pi_{\text{expert}}(s_h)$. | ||
- Call the resulting trajectory $\tau^{\text{expert}}_n$. | ||
- $\mathcal{D} \gets \mathcal{D} \cup \{ \tau^{\text{expert}}_1, \dots, \tau^{\text{expert}}_n \}$. | ||
- Let $\pi \gets \texttt{fit}(\mathcal{D})$, where $\texttt{fit}$ is a behavioral cloning algorithm. | ||
3. Return $\pi$. | ||
::: | ||
|
||
How well does DAgger perform? | ||
We omit a proof here, but under certain assumptions, | ||
the DAgger algorithm can better approximate the expert policy: | ||
|
||
$$ | ||
|V^{\pi_{\text{expert}}} - V^{\pi_{\text{DAgger}}}| \le H \varepsilon | ||
$$ | ||
|
||
where $\varepsilon$ is the "classification error" guaranteed by the supervised learning algorithm. | ||
|
||
<!-- TODO --> | ||
|
||
## Summary | ||
|
||
For tasks where it is too difficult or expensive to learn from scratch, | ||
we can instead start off with a collection of **expert demonstrations**. | ||
Then we can use supervised learning techniques to find a policy that imitates the expert demonstrations. | ||
|
||
The simplest way to do this is to apply a supervised learning algorithm to an already-collected dataset of expert state-action pairs. | ||
This is called **behavioral cloning**. | ||
However, given query access to the expert policy, | ||
we can do better by integrating its feedback in an online loop. | ||
The **DAgger** algorithm is one way of doing this, | ||
where we use the expert policy to augment trajectories and then learn from this augmented dataset using behavioral cloning. | ||
|
||
|
149 changes: 0 additions & 149 deletions
149
build/imitation_learning-bf860cb6679fb159939c7b8b45aabd4b.md
This file was deleted.
Oops, something went wrong.
Oops, something went wrong.