-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
first draft of imitation learning chapter
- Loading branch information
Showing
7 changed files
with
592 additions
and
32 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,141 @@ | ||
--- | ||
jupytext: | ||
text_representation: | ||
extension: .md | ||
format_name: myst | ||
format_version: 0.13 | ||
jupytext_version: 1.16.2 | ||
kernelspec: | ||
display_name: Python 3 (ipykernel) | ||
language: python | ||
name: python3 | ||
--- | ||
|
||
# Imitation Learning | ||
|
||
Imagine you are tasked with learning how to drive. How do, or did, you go about it? | ||
At first, this task might seem insurmountable: there are a vast array of controls, and the cost of making a single mistake could be extremely high, making it hard to explore by trial and error. | ||
Luckily, there are already people in the world who know how to drive who can get you started. | ||
In this and many other examples, we all "stand on the shoulders of giants" and learn skills from experts who have already mastered them. | ||
|
||
Now in machine learning, much of the time, we are trying to teach machines to accomplish tasks that us humans are already proficient at. | ||
In such cases, the machine learning algorithm is the one learning the new skill, and humans are the "experts" that can demonstrate how to perform the task. | ||
**Imitation learning** is a direct application of this idea to machine learning for interactive tasks. | ||
We'll see that the most naive form of imitation learning, called **behavioural cloning**, is really an application of supervised learning to interactive tasks. | ||
We'll then explore **dataset aggregation** (DAgger) as a way to query an expert and learn even more effectively. | ||
|
||
## Behavioural cloning | ||
|
||
This notion of "learning from human-provided data" may remind you of the basic premise of {ref}`supervised_learning`, | ||
in which there is some mapping from _inputs_ to _outputs_ that us humans can implicitly compute, such as seeing a photo and being able to recognize its constituents. | ||
To teach a machine to calculate this mapping, we first collect a large _training dataset_ by getting people to label a lot of inputs, | ||
and then use some optimization algorithm to produce a predictor that maps from the inputs to the outputs as closely as possible. | ||
How does this relate to interactive tasks? | ||
Here, the input is the observation seen by the agent and the output is the action it selects, so the mapping is the agent's policy. | ||
What's stopping us from applying supervised learning techniques? | ||
In practice, nothing! This is called **behavioural cloning.** | ||
|
||
:::{prf:algorithm} | ||
1. Collect a training dataset of trajectories generated by an expert policy $\pi_\text{data}$. Here, we treat each state-action pair as independent, resuling in a dataset $\mathcal{D} = (s^n, a^n)_{n=1}^{N}$. (For concreteness, if there are $M$ trajectories with a horizon $H$, then $N = M \times H$.) | ||
- Note that this is an inaccurate approximation! A key property of interactive tasks is that the agent's output -- the action that it takes -- may influence its next observation. | ||
2. Use a SL algorithm $\texttt{fit} : \mathcal{D} \mapsto \tilde \pi$ to extract a policy $\tilde \pi$ that approximates the expert policy. | ||
::: | ||
|
||
Typically, this second task can be framed as **empirical loss minimization**: | ||
|
||
:::{math} | ||
\tilde \pi = \arg\min_{\pi \in \Pi} \sum_{n=0}^{N-1} \text{loss}(\pi(s^n), a^n) | ||
::: | ||
|
||
where $\Pi$ is some class of possible policies, $\text{loss}$ is the loss function to measure how far off the policy's prediction is, and the SL algorithm tells us how to compute this $\arg\min$. | ||
If training a deterministic policy that is just a function from inputs to outputs with no randomness, we might try to minimize the **mean squared error**. | ||
More generally, though, we often choose the **negative log likelihood** as our loss function, so that the optimization is equivalent to **maximum likelihood estimation**: | ||
out of the space of all possible mappings, we search for the one according to which the training dataset is the most likely. | ||
|
||
:::{math} | ||
\tilde \pi = \arg\max_{\pi \in \Pi} \Pr_{a^n \sim \pi(s^n)}(a^{0:N} \mid s^{0:N}) | ||
::: | ||
|
||
Can we quantify how well this algorithm works? | ||
For simplicity, let's consider the case where the action space is discrete and both the data and trained policy are deterministic. | ||
(This corresponds to a classification task in SL.) | ||
Suppose the SL algorithm obtains $\varepsilon$ classification error. | ||
That is, for trajectories drawn from the expert policy, | ||
the learned policy chooses a different action at most $\varepsilon$ of the time: | ||
|
||
:::{math} | ||
\mathbb{E}_{\tau \sim \rho_{\pi_{\text{data}}}} \left[ \frac 1 \hor \sum_{\hi=0}^{\hor-1} \ind{ \tilde \pi(s_\hi) \ne \pi_{\text{data}} (s_\hi) } \right] \le \varepsilon | ||
::: | ||
|
||
Then, their value functions differ by | ||
|
||
:::{math} | ||
| V^{\pi_{\text{data}}} - V^{\tilde \pi} | \le H^2 \varepsilon | ||
::: | ||
|
||
where $H$ is the horizon. | ||
|
||
:::{prf:theorem} Performance of behavioural cloning | ||
|
||
Recall the {prf:ref}`pdl` allows us to express the difference between $\pi_{\text{data}}$ and $\tilde \pi$ as | ||
|
||
$$ | ||
V_0^{\pi_{\text{data}}}(s) - V_0^{\tilde \pi} (s) = \E_{\tau \sim \rho^{\pi_{\text{data}}} \mid s_0 = s} \left[ \sum_{\hi=0}^{\hor-1} A_\hi^{\tilde \pi} (s_\hi, a_\hi) \right]. | ||
$$ | ||
|
||
Now since the data policy is deterministic, we can substitute $a_\hi = \pi_{\text{data}}(s_\hi)$. | ||
This allows us to make a further simplification: | ||
since $\pi_{\text{data}}$ is deterministic, we have | ||
|
||
$$ | ||
A^{\pi_{\text{data}}}(s, \pi_{\text{data}}(s)) = Q^{\pi_{\text{data}}}(s, \pi_{\text{data}}(s)) - V^{\pi_{\text{data}}}(s) = 0. | ||
$$ | ||
|
||
Now we can use the assumption that the SL algorithm obtains $\varepsilon$ classification error. By the above, $A_\hi^{\tilde \pi}(s_\hi, \pi_{\text{data}}(s_\hi)) = 0$ when $\pi_{\text{data}}(s_\hi) = \tilde \pi(s_\hi)$. In the case where the two policies differ on $s_\hi$, which occurs with probability $\varepsilon$, the advantage is naively upper bounded by $H$ (assuming rewards are bounded between $0$ and $1$). Taking the final sum gives the desired bound. | ||
::: | ||
|
||
<!-- TODO ADD DISTRIBUTION SHIFT EXAMPLE FROM SLIDES --> | ||
|
||
## Distribution shift | ||
|
||
Let us return to the driving analogy. Suppose you have taken some driving lessons and now feel comfortable in your neighbourhood. But today you have to travel to an area you haven't visited before, such as a highway, where it would be dangerous to try and apply the techniques you've already learned. | ||
This is the issue of _distribution shift_: a policy learned under some distribution of states may not perform well if this distribution changes. | ||
|
||
This is already a common issue in supervised learning, where the training dataset for a model might not resemble the environment where it gets deployed. In interactive environments, this issue is further exacerbated by the dependency between the observations and the agent's behaviour; if you take a wrong turn early on, it may be difficult or impossible to recover in that trajectory. | ||
|
||
How could you learn a strategy for these new settings? | ||
In the driving example, you might decide to install a dashcam to record the car's surroundings. That way, once you make it back to safety, you can show the recording to an expert, who can provide feedback at each step of the way. | ||
Then the next time you go for a drive, you can remember the expert's advice, and take a safer route. | ||
You could then repeat this training as many times as desired, thereby collecting the expert's feedback over a diverse range of locations. | ||
This is the key idea behind _dataset aggregation_. | ||
|
||
## Dataset aggregation (DAgger) | ||
|
||
The DAgger algorithm is due to {cite}`ross_reduction_2010`. | ||
|
||
```python | ||
def dagger_pseudocode( | ||
env: MAB, | ||
π_init: Policy, | ||
π_expert: Policy, | ||
n_dagger_iterations: int, | ||
n_trajectories_per_iteration: int | ||
): | ||
π = π_init | ||
dataset = set() | ||
|
||
for _ in range(n_dagger_iterations): | ||
for __ in range(n_trajectories_per_iteration): | ||
τ = collect_trajectory(π, env) | ||
for step in range(env.H): | ||
obs = τ.state[step] | ||
τ.action[step] = π_expert(obs) | ||
dataset.add(τ) | ||
|
||
π = fit(dataset) | ||
|
||
return π | ||
``` | ||
|
||
|
||
|
Oops, something went wrong.