diff --git a/book/_toc.yml b/book/_toc.yml
index 7a76bf8..8ecb5f0 100644
--- a/book/_toc.yml
+++ b/book/_toc.yml
@@ -4,7 +4,7 @@
 format: jb-book
 root: index.md
 options:
-  numbered: 2
+  numbered: true
 chapters:
   - file: intro.md
   - file: bandits.md
@@ -13,6 +13,7 @@ chapters:
   - file: control.md
   - file: pg.md
   - file: exploration.md
+  - file: imitation_learning.md
 # - file: challenges
 # - file: appendix
   - file: bibliography.md
diff --git a/book/bandits.md b/book/bandits.md
index 56c875c..9645371 100644
--- a/book/bandits.md
+++ b/book/bandits.md
@@ -484,6 +484,10 @@ plot_strategy(mab, agent)
 
 Note that we let $\epsilon$ vary over time. In particular, we might want to gradually *decrease* $\epsilon$ as we learn more about the reward distributions and no longer need to spend time exploring.
 
+:::{attention}
+What is the expected regret of the algorithm if we set $\epsilon$ to be a constant?
+:::
+
 It turns out that setting $\epsilon_t = \sqrt[3]{K \ln(t)/t}$ also achieves a regret of $\tilde O(t^{2/3} K^{1/3})$ (ignoring the logarithmic factors). (We will not prove this here.) TODO ADD PROOF CITATION
 
 In ETC, we had to set $N_{\text{explore}}$ based on the total number of timesteps $T$. But the epsilon-greedy algorithm actually handles the exploration *automatically*: the regret rate holds for *any* $t$, and doesn’t depend on the final horizon $T$.
@@ -852,10 +856,11 @@ This has the closed-form solution known as the *ordinary least squares*
 
 :::{math}
 :label: ols_bandit
+
 \begin{aligned}
-        \hat \theta_t^k          & = (A_t^k)^{-1} \sum_{\{ i \in [t] : a_i = k \}} x_i r_i \\
-        \text{where} \quad A_t^k & = \sum_{\{ i \in [t] : a_i = k \}} x_i x_i^\top.
-    \end{aligned}
+    \hat \theta_t^k          & = (A_t^k)^{-1} \sum_{\{ i \in [t] : a_i = k \}} x_i r_i \\
+    \text{where} \quad A_t^k & = \sum_{\{ i \in [t] : a_i = k \}} x_i x_i^\top.
+\end{aligned}
 :::
 
 We can now apply the UCB algorithm in this environment in order to
@@ -877,7 +882,9 @@ $$|Y| \le \beta \sigma \quad \text{with probability} \ge 1 - \frac{1}{\beta^2}$$
 
 Since the OLS estimator is known to be unbiased (try proving this
 yourself), we can apply Chebyshev's inequality to
-$x_t^\top (\hat \theta_t^k - \theta^k)$: $$\begin{aligned}
+$x_t^\top (\hat \theta_t^k - \theta^k)$:
+
+$$\begin{aligned}
     x_t^\top \theta^k \le x_t^\top \hat \theta_t^k + \beta \sqrt{x_t^\top (A_t^k)^{-1} x_t} \quad \text{with probability} \ge 1 - \frac{1}{\beta^2}
 \end{aligned}$$
 
@@ -901,7 +908,7 @@ We can now substitute these quantities into UCB to get the **LinUCB**
 algorithm:
 
 ```{code-cell}
-class LinUCB(Agent):
+class LinUCBPseudocode(Agent):
     def __init__(
         self, K: int, T: int, D: int, lam: float, get_c: Callable[[int], float]
     ):
@@ -944,6 +951,4 @@ regret bound. The full details of the analysis can be found in Section 3 of {cit
 
 ## Summary
 
-```{code-cell}
 
-```
diff --git a/book/exploration.md b/book/exploration.md
index b3129f4..7e4fd2a 100644
--- a/book/exploration.md
+++ b/book/exploration.md
@@ -22,7 +22,11 @@ In the multi-armed bandits chapter {ref}`bandits`, where the state never changes
 :::{prf:definition} Per-episode regret
 :label: per_episode_regret
 
-To quantify the performance of a learning algorithm, we will consider its per-episode regret over $T$ timesteps/episodes: $$\text{Regret}_T = \E\left[ \sum_{t=0}^{T-1} V^\star_0(s_0) - V^{\pi^t}_0(s_0) \right]$$ where $\pi^t$ is the policy generated by the algorithm at the $t$th iteration.
+To quantify the performance of a learning algorithm, we will consider its per-episode regret over $T$ timesteps/episodes:
+
+$$\text{Regret}_T = \E\left[ \sum_{t=0}^{T-1} V^\star_0(s_0) - V^{\pi^t}_0(s_0) \right]$$
+
+where $\pi^t$ is the policy generated by the algorithm at the $t$th iteration.
 :::
 
 ### Sparse reward
@@ -57,6 +61,10 @@ $K \gets \emptyset$ Using our known transitions $K$, compute the shortest path $
 The shortest path computation can be implemented using DP. We leave this as an exercise.
 ::::
 
+```{code-cell}
+def explore_then_exploit(mdp: MDP):
+```
+
 :::{prf:theorem}
 Performance of explore-then-exploitexplore_then_exploit_performance As long as every state can be reached from $s_0$ within a single episode, i.e. $|\mathcal{S}| \le \hor$, this will eventually be able to explore all $|\mathcal{S}| |\mathcal{A}|$ state-action pairs, adding one new transition per episode. We know it will take at most $|\mathcal{S}| |\mathcal{A}|$ iterations to explore the entire MDP, after which $\pi^t = \pi^\star$, incurring no additional regret. For each $\pi^t$ up until then, corresponding to the shortest-path policies $\tilde \pi$, the value of policy $\pi^t$ will differ from that of $\pi^\star$ by at most $\hor$, since the policies will differ by at most $1$ reward at each timestep. So, $$\sum_{t=0}^{T-1} V^\star_0 - V_0^{\pi^t} \le |\mathcal{S}||\mathcal{A}| \hor.$$ (Note that this MDP and algorithm are deterministic, so the regret is not random.)
 :::
@@ -208,11 +216,16 @@ A polynomial dependency on $|\mathcal{S}|$ and $|\mathcal{A}|$ is manageable whe
 :::{prf:definition} Linear MDP
 :label: linear_mdp
 
-We assume that the transition probabilities and rewards are *linear* in some feature vector $\phi(s, a) \in \mathbb{R}^d$: $$\begin{aligned}
+We assume that the transition probabilities and rewards are *linear* in some feature vector
+
+$\phi(s, a) \in \mathbb{R}^d$:
+
+$$\begin{aligned}
         P_\hi(s' \mid s, a) & = \phi(s, a)^\top \mu^\star_\hi(s') \\
         r_\hi(s, a)         & = \phi(s, a)^\top \theta_\hi^\star
-    
-\end{aligned}$$ Note that we can also think of $P_\hi(\cdot \mid s, a) = \mu_\hi^\star$ as an $|\mathcal{S}| \times d$ matrix, and think of $\mu^\star_\hi(s')$ as indexing into the $s'$-th row of this matrix (treating it as a column vector). Thinking of $V^\star_{\hi+1}$ as an $|\mathcal{S}|$-dimensional vector, this allows us to write $$\E_{s' \sim P_\hi(\cdot \mid s, a)}[V^\star_{\hi+1}(s)] = (\mu^\star_\hi \phi(s, a))^\top V^\star_{\hi+1}.$$ The $\phi$ feature mapping can be designed to capture interactions between the state $s$ and action $a$. In this book, we'll assume that the feature map $\phi : \mathcal{S} \times \mathcal{A} \to \mathbb{R}^d$ and the reward function (described by $\theta_\hi^\star$) are known to the learner.
+\end{aligned}$$
+
+Note that we can also think of $P_\hi(\cdot \mid s, a) = \mu_\hi^\star$ as an $|\mathcal{S}| \times d$ matrix, and think of $\mu^\star_\hi(s')$ as indexing into the $s'$-th row of this matrix (treating it as a column vector). Thinking of $V^\star_{\hi+1}$ as an $|\mathcal{S}|$-dimensional vector, this allows us to write $$\E_{s' \sim P_\hi(\cdot \mid s, a)}[V^\star_{\hi+1}(s)] = (\mu^\star_\hi \phi(s, a))^\top V^\star_{\hi+1}.$$ The $\phi$ feature mapping can be designed to capture interactions between the state $s$ and action $a$. In this book, we'll assume that the feature map $\phi : \mathcal{S} \times \mathcal{A} \to \mathbb{R}^d$ and the reward function (described by $\theta_\hi^\star$) are known to the learner.
 :::
 
 ### Planning in a linear MDP
diff --git a/book/imitation_learning.md b/book/imitation_learning.md
new file mode 100644
index 0000000..31df7bd
--- /dev/null
+++ b/book/imitation_learning.md
@@ -0,0 +1,141 @@
+---
+jupytext:
+  text_representation:
+    extension: .md
+    format_name: myst
+    format_version: 0.13
+    jupytext_version: 1.16.2
+kernelspec:
+  display_name: Python 3 (ipykernel)
+  language: python
+  name: python3
+---
+
+# Imitation Learning
+
+Imagine you are tasked with learning how to drive. How do, or did, you go about it?
+At first, this task might seem insurmountable: there are a vast array of controls, and the cost of making a single mistake could be extremely high, making it hard to explore by trial and error.
+Luckily, there are already people in the world who know how to drive who can get you started.
+In this and many other examples, we all "stand on the shoulders of giants" and learn skills from experts who have already mastered them.
+
+Now in machine learning, much of the time, we are trying to teach machines to accomplish tasks that us humans are already proficient at.
+In such cases, the machine learning algorithm is the one learning the new skill, and humans are the "experts" that can demonstrate how to perform the task.
+**Imitation learning** is a direct application of this idea to machine learning for interactive tasks.
+We'll see that the most naive form of imitation learning, called **behavioural cloning**, is really an application of supervised learning to interactive tasks.
+We'll then explore **dataset aggregation** (DAgger) as a way to query an expert and learn even more effectively.
+
+## Behavioural cloning
+
+This notion of "learning from human-provided data" may remind you of the basic premise of {ref}`supervised_learning`,
+in which there is some mapping from _inputs_ to _outputs_ that us humans can implicitly compute, such as seeing a photo and being able to recognize its constituents.
+To teach a machine to calculate this mapping, we first collect a large _training dataset_ by getting people to label a lot of inputs,
+and then use some optimization algorithm to produce a predictor that maps from the inputs to the outputs as closely as possible.
+How does this relate to interactive tasks?
+Here, the input is the observation seen by the agent and the output is the action it selects, so the mapping is the agent's policy.
+What's stopping us from applying supervised learning techniques?
+In practice, nothing! This is called **behavioural cloning.**
+
+:::{prf:algorithm}
+1. Collect a training dataset of trajectories generated by an expert policy $\pi_\text{data}$. Here, we treat each state-action pair as independent, resuling in a dataset $\mathcal{D} = (s^n, a^n)_{n=1}^{N}$. (For concreteness, if there are $M$ trajectories with a horizon $H$, then $N = M \times H$.)
+   - Note that this is an inaccurate approximation! A key property of interactive tasks is that the agent's output -- the action that it takes -- may influence its next observation.
+2. Use a SL algorithm $\texttt{fit} : \mathcal{D} \mapsto \tilde \pi$ to extract a policy $\tilde \pi$ that approximates the expert policy.
+:::
+
+Typically, this second task can be framed as **empirical loss minimization**:
+
+:::{math}
+\tilde \pi = \arg\min_{\pi \in \Pi} \sum_{n=0}^{N-1} \text{loss}(\pi(s^n), a^n)
+:::
+
+where $\Pi$ is some class of possible policies, $\text{loss}$ is the loss function to measure how far off the policy's prediction is, and the SL algorithm tells us how to compute this $\arg\min$.
+If training a deterministic policy that is just a function from inputs to outputs with no randomness, we might try to minimize the **mean squared error**.
+More generally, though, we often choose the **negative log likelihood** as our loss function, so that the optimization is equivalent to **maximum likelihood estimation**:
+out of the space of all possible mappings, we search for the one according to which the training dataset is the most likely.
+
+:::{math}
+\tilde \pi = \arg\max_{\pi \in \Pi} \Pr_{a^n \sim \pi(s^n)}(a^{0:N} \mid s^{0:N})
+:::
+
+Can we quantify how well this algorithm works?
+For simplicity, let's consider the case where the action space is discrete and both the data and trained policy are deterministic.
+(This corresponds to a classification task in SL.)
+Suppose the SL algorithm obtains $\varepsilon$ classification error.
+That is, for trajectories drawn from the expert policy,
+the learned policy chooses a different action at most $\varepsilon$ of the time:
+
+:::{math}
+\mathbb{E}_{\tau \sim \rho_{\pi_{\text{data}}}} \left[ \frac 1 \hor \sum_{\hi=0}^{\hor-1} \ind{ \tilde \pi(s_\hi) \ne \pi_{\text{data}} (s_\hi) } \right] \le \varepsilon
+:::
+
+Then, their value functions differ by
+
+:::{math}
+| V^{\pi_{\text{data}}} - V^{\tilde \pi} | \le H^2 \varepsilon
+:::
+
+where $H$ is the horizon.
+
+:::{prf:theorem} Performance of behavioural cloning
+
+Recall the {prf:ref}`pdl` allows us to express the difference between $\pi_{\text{data}}$ and $\tilde \pi$ as
+
+$$
+V_0^{\pi_{\text{data}}}(s) - V_0^{\tilde \pi} (s) = \E_{\tau \sim \rho^{\pi_{\text{data}}} \mid s_0 = s} \left[ \sum_{\hi=0}^{\hor-1} A_\hi^{\tilde \pi} (s_\hi, a_\hi) \right].
+$$
+
+Now since the data policy is deterministic, we can substitute $a_\hi = \pi_{\text{data}}(s_\hi)$.
+This allows us to make a further simplification:
+since $\pi_{\text{data}}$ is deterministic, we have
+
+$$
+A^{\pi_{\text{data}}}(s, \pi_{\text{data}}(s)) = Q^{\pi_{\text{data}}}(s, \pi_{\text{data}}(s)) - V^{\pi_{\text{data}}}(s) = 0.
+$$
+
+Now we can use the assumption that the SL algorithm obtains $\varepsilon$ classification error. By the above, $A_\hi^{\tilde \pi}(s_\hi, \pi_{\text{data}}(s_\hi)) = 0$ when $\pi_{\text{data}}(s_\hi) = \tilde \pi(s_\hi)$. In the case where the two policies differ on $s_\hi$, which occurs with probability $\varepsilon$, the advantage is naively upper bounded by $H$ (assuming rewards are bounded between $0$ and $1$). Taking the final sum gives the desired bound.
+:::
+
+<!-- TODO ADD DISTRIBUTION SHIFT EXAMPLE FROM SLIDES -->
+
+## Distribution shift
+
+Let us return to the driving analogy. Suppose you have taken some driving lessons and now feel comfortable in your neighbourhood. But today you have to travel to an area you haven't visited before, such as a highway, where it would be dangerous to try and apply the techniques you've already learned.
+This is the issue of _distribution shift_: a policy learned under some distribution of states may not perform well if this distribution changes.
+
+This is already a common issue in supervised learning, where the training dataset for a model might not resemble the environment where it gets deployed. In interactive environments, this issue is further exacerbated by the dependency between the observations and the agent's behaviour; if you take a wrong turn early on, it may be difficult or impossible to recover in that trajectory.
+
+How could you learn a strategy for these new settings?
+In the driving example, you might decide to install a dashcam to record the car's surroundings. That way, once you make it back to safety, you can show the recording to an expert, who can provide feedback at each step of the way.
+Then the next time you go for a drive, you can remember the expert's advice, and take a safer route.
+You could then repeat this training as many times as desired, thereby collecting the expert's feedback over a diverse range of locations.
+This is the key idea behind _dataset aggregation_.
+
+## Dataset aggregation (DAgger)
+
+The DAgger algorithm is due to {cite}`ross_reduction_2010`.
+
+```python
+def dagger_pseudocode(
+    env: MAB,
+    π_init: Policy,
+    π_expert: Policy,
+    n_dagger_iterations: int,
+    n_trajectories_per_iteration: int
+):
+    π = π_init
+    dataset = set()
+
+    for _ in range(n_dagger_iterations):
+        for __ in range(n_trajectories_per_iteration):
+            τ = collect_trajectory(π, env)
+            for step in range(env.H):
+                obs = τ.state[step]
+                τ.action[step] = π_expert(obs)
+            dataset.add(τ)
+        
+        π = fit(dataset)
+    
+    return π
+```
+
+
+
diff --git a/book/pg.md b/book/pg.md
index 0e9b064..9b5e21c 100644
--- a/book/pg.md
+++ b/book/pg.md
@@ -46,8 +46,8 @@ Remember that in reinforcement learning, the goal is to *maximize reward.* Speci
     J(\theta) := \E_{s_0 \sim \mu_0} V^{\pi_\theta} (s_0) = & \E \sum_{t=0}^{T-1} r_t \\
     \text{where} \quad & s_0 \sim \mu_0 \\
     & s_{t+1} \sim P(s_t, a_t), \\
-    & a_h = \pi_\theta(s_h) \\
-    & r_h = r(s_h, a_h).
+    & a_\hi = \pi_\theta(s_\hi) \\
+    & r_\hi = r(s_\hi, a_\hi).
 \end{split}
 :::
 
@@ -158,7 +158,7 @@ Note that to avoid correlations between the gradient estimator and the value est
 Policy gradient with a learned baselinepg_baseline
 
 <!-- :::{prf:algorithmic}
-Learning rate $\eta_0, \dots, \eta_{K-1}$ Initialization $\theta^0$ Sample $N$ trajectories from $\pi_{\theta^k}$ to estimate a baseline $\tilde b$ such that $\tilde b_h(s) \approx V_h^{\theta^k}(s)$ Sample $M$ trajectories $\tau_0, \dots, \tau_{M-1} \sim \rho_{\theta^k}$ Compute the policy gradient estimate $$\tilde{\nabla}_\theta J(\theta^k) = \frac{1}{M} \sum_{m=0}^{M-1} \sum_{h=0}^{H-1} \nabla \log \pi_{\theta^k} (a_h \mid s_h) (R_h(\tau_m) - \tilde b_h(s_h))$$ Gradient ascent update $\theta^{k+1} \gets \theta^k + \tilde \nabla_\theta J(\theta^k)$
+Learning rate $\eta_0, \dots, \eta_{K-1}$ Initialization $\theta^0$ Sample $N$ trajectories from $\pi_{\theta^k}$ to estimate a baseline $\tilde b$ such that $\tilde b_\hi(s) \approx V_\hi^{\theta^k}(s)$ Sample $M$ trajectories $\tau_0, \dots, \tau_{M-1} \sim \rho_{\theta^k}$ Compute the policy gradient estimate $$\tilde{\nabla}_\theta J(\theta^k) = \frac{1}{M} \sum_{m=0}^{M-1} \sum_{h=0}^{H-1} \nabla \log \pi_{\theta^k} (a_\hi \mid s_\hi) (R_\hi(\tau_m) - \tilde b_\hi(s_\hi))$$ Gradient ascent update $\theta^{k+1} \gets \theta^k + \tilde \nabla_\theta J(\theta^k)$
 ::: -->
 
 The baseline estimation step can be done using any appropriate supervised learning algorithm. Note that the gradient estimator will be unbiased regardless of the baseline.
@@ -229,20 +229,46 @@ To analyze the difference between them, we'll make use of the **performance diff
 :::{prf:theorem} Performance difference lemma
 :label: pdl
 
-Let $\rho_{\pi, s}$ denote the distribution induced by the policy $\pi$ over trajectories starting in state $s$.
+Suppose Beatrice and Joan are playing a game and want to compare their average rewards starting in state $s$.
+However, only Beatrice is allowed to take actions, while Joan can evaluate those actions from her own perspective. That is, she knows how good Beatrice's action is compared to her typical strategy in that state. (This is her _advantage function_ $A_\hi^{\text{Joan}}(s_\hi, a_\hi)$).
 
-Given two policies $\pi, \tilde pi$, the PDL allows us to express the difference between their value functions as follows:
+The performance difference lemma says that this is all they need to compare themselves! That is,
 
-$$V_0^{\tilde \pi}(s) - V_0^\pi(s) = \E_{\tau \sim \rho_{\tilde \pi, s}} \left[ \sum_{h=0}^{H-1} A_h^\pi (s_h, a_h) \right]$$
+:::{math}
+:label: pdl_eq
+V_0^{\text{Beatrice}}(s) - V_0^{\text{Joan}}(s) = \E_{\tau \sim \rho_{\text{Beatrice}, s}} \left[ \sum_{h=0}^{H-1} A_\hi^{\text{Joan}} (s_\hi, a_\hi) \right]
+:::{math}
+
+where $\rho_{\text{Beatrice}, s}$ denotes the distribution over trajectories starting in state $s$ when Beatrice is playing.
+
+To see why, consider just a single step $\hi$ of the trajectory. At this step we compute how much better actions from Joan are than the actions from Beatrice, on average. But this is exactly the average Joan-evaluated-advantage across actions from Beatrice, as described in the PDL!
+
+Formally, this corresponds to a nice telescoping simplification when we expand out the definition of the advantage function. Note that
+
+$$
+\begin{align*}
+A^\pi_\hi(s_\hi, a_\hi) &= Q^\pi_\hi(s_\hi, a_\hi) - V^\pi_\hi(s_\hi) \\
+&= r_\hi(s_\hi, a_\hi) + \E_{s_{\hi+1} \sim P(s_\hi, a_\hi)} [V^\pi_{\hi+1}(s_{\hi+1})] - V^\pi_\hi(s_\hi)
+\end{align*}
+$$
+
+so expanding out the r.h.s. expression of {eq}`pdl_eq` and grouping terms together gives
+
+$$
+\begin{align*}
+\E_{\tau \sim \rho_{\text{Beatrice}, s}} \left[ \sum_{\hi=0}^{\hor-1} A_\hi^{\text{Joan}} (s_\hi, a_\hi) \right] &= \E_{\tau \sim \rho_{\text{Beatrice}, s}} \left[ \left( \sum_{\hi=0}^{\hor-1} r_\hi(s_\hi, a_\hi) \right) + \left( V^{\text{Joan}}_1(s_1) + \cdots + V^{\text{Joan}}_\hor(s_\hor) \right) - \left( V^{\text{Joan}_0}(s_0) + \cdots + V^{\text{Joan}}_{\hor-1}(s_{\hor-1}) \right) \right] \\
+&= V^{\text{Beatrice}}_0(s) - V^{\text{Joan}}_0(s)
+\end{align*}
+$$
 
-Some intuition: Recall that $A^\pi_h(s, a)$ tells us how much better the action $a$ is in state $s$ than average, supposing actions are chosen according to $\pi$. How much better is $\tilde \pi$ than $\pi$? To answer this, we break down the trajectory step-by-step. At each step, we compute how much better actions from $\tilde \pi$ are than the actions from $\pi$. But this is exactly the average $\pi$-advantage, where the expectation is taken over actions from $\tilde \pi$. This is exactly what the PDL describes.
+as desired. (Note that the "inner" expectation from expanding the advantage function has the same distribution as the outer one, so omitting it here is valid.)
 :::
 
 Let's analyze why fitted approaches such as PI don't work as well in the RL setting. To start, let's ask, where *do* fitted approaches work well? They are commonly seen in *supervised learning*, where a prediction rule is fit using some labelled training set, and then assessed on a test set from the same distribution. Does this assumption still hold when doing PI?
 
-Let's consider a single iteration of PI. Suppose the new policy $\tilde \pi$ chooses some action with a negative advantage w.r.t. $\pi$. Define $\Delta_\infty = \min_{s \in \mathcal{S}} A^{\pi}_h(s, \tilde \pi(s))$. If this is negative, then the PDL shows that there may exist some state $s$ and time $h$ such that
+Let's consider a single iteration of PI. Suppose the new policy $\tilde \pi$ chooses some action with a negative advantage w.r.t. $\pi$. Define $\Delta_\infty = \min_{s \in \mathcal{S}} A^{\pi}_\hi(s, \tilde \pi(s))$. If this is negative, then the PDL shows that there may exist some state $s$ and time $h$ such that
 
-$$V_h^{\tilde \pi}(s) \ge V_h^{\pi}(s) - H \cdot |\Delta_\infty|.$$
+$$V_\hi^{\tilde \pi}(s) \ge V_\hi^{\pi}(s) - H \cdot |\Delta_\infty|.$$
 
 In general, PI cannot avoid particularly bad situations where the new policy $\tilde \pi$ often visits these bad states, causing an actual degradation. It does not enforce that the trajectory distributions $\rho_\pi$ and $\rho_{\tilde \pi}$ be close to each other. In other words, the "training distribution" that our prediction rule is fitted on, $\rho_\pi$, may differ significantly from the "evaluation distribution" $\rho_{\tilde \pi}$ --- we must address this issue of *distributional shift*.
 
@@ -264,13 +290,13 @@ Additionally, rather than estimating the $Q$-function of the current policy, we
 :label: trpo
 
 <!-- :::{prf:algorithmic}
-Trust region radius $\delta$ Initialize $\theta^0$ $\theta^{k+1} \gets \argmax_{\theta} \E_{s_0, \dots, s_{H-1} \sim \pi^k} \left[ \sum_h \E_{a_h \sim \pi_\theta(s_h)} A^{\pi^k}(s_h, a_h) \right]$ See below where $\kl{\rho_{\pi^k}}{\rho_{\pi_{\theta}}} \le \delta$ $\pi^K$
+Trust region radius $\delta$ Initialize $\theta^0$ $\theta^{k+1} \gets \argmax_{\theta} \E_{s_0, \dots, s_{H-1} \sim \pi^k} \left[ \sum_\hi \E_{a_\hi \sim \pi_\theta(s_\hi)} A^{\pi^k}(s_\hi, a_\hi) \right]$ See below where $\kl{\rho_{\pi^k}}{\rho_{\pi_{\theta}}} \le \delta$ $\pi^K$
 ::: -->
 
 Note that the objective function is not identical to the r.h.s. of the Performance Difference Lemma. Here, we still use the *states* sampled from the old policy, and only use the *actions* from the new policy. This is because it would be computationally infeasible to sample entire trajectories from $\pi_\theta$ as we are optimizing over $\theta$. This approximation is also reasonable in the sense that it matches the r.h.s. of the Performance Difference Lemma to first order in $\theta$. (We will elaborate more on this later.)
 ::::
 
-Both the objective function and the KLD constraint involve a weighted average over the space of all trajectories. This is intractable in general, so we need to estimate the expectation. As before, we can do this by taking an empirical average over samples from the trajectory distribution. However, the inner expectation over $a_h \sim \pi_{\theta}$ involves the optimizing variable $\theta$, and we'd like an expression that has a closed form in terms of $\theta$ to make optimization tractable. Otherwise, we'd need to resample many times each time we made an update to $\theta$. To address this, we'll use a common technique known as **importance sampling**.
+Both the objective function and the KLD constraint involve a weighted average over the space of all trajectories. This is intractable in general, so we need to estimate the expectation. As before, we can do this by taking an empirical average over samples from the trajectory distribution. However, the inner expectation over $a_\hi \sim \pi_{\theta}$ involves the optimizing variable $\theta$, and we'd like an expression that has a closed form in terms of $\theta$ to make optimization tractable. Otherwise, we'd need to resample many times each time we made an update to $\theta$. To address this, we'll use a common technique known as **importance sampling**.
 
 :::{prf:definition} Importance sampling
 :label: importance_sampling
@@ -288,9 +314,9 @@ Applying importance sampling allows us to estimate the TRPO objective as follows
 :label: trpo_implement
 
 <!-- :::{prf:algorithmic} TODO
-Initialize $\theta^0$ Sample $N$ trajectories from $\rho^k$ to learn a value estimator $\tilde b_h(s) \approx V^{\pi^k}_h(s)$ Sample $M$ trajectories $\tau_0, \dots, \tau_{M-1} \sim \rho^k$ $$\begin{gathered}
-            \theta^{k+1} \gets \argmax_{\theta} \frac{1}{M} \sum_{m=0}^{M-1} \sum_{h=0}^{H-1} \frac{\pi_\theta(a_h \mid s_h)}{\pi^k(a_h \mid s_h)} [ R_h(\tau_m) - \tilde b_h(s_h) ] \\
-            \text{where } \sum_{m=0}^{M-1} \sum_{h=0}^{H-1} \log \frac{\pi_k(a_h^m \mid s_h^m)}{\pi_\theta(a_h^m \mid s_h^m)} \le \delta
+Initialize $\theta^0$ Sample $N$ trajectories from $\rho^k$ to learn a value estimator $\tilde b_\hi(s) \approx V^{\pi^k}_\hi(s)$ Sample $M$ trajectories $\tau_0, \dots, \tau_{M-1} \sim \rho^k$ $$\begin{gathered}
+            \theta^{k+1} \gets \argmax_{\theta} \frac{1}{M} \sum_{m=0}^{M-1} \sum_{h=0}^{H-1} \frac{\pi_\theta(a_\hi \mid s_\hi)}{\pi^k(a_\hi \mid s_\hi)} [ R_\hi(\tau_m) - \tilde b_\hi(s_\hi) ] \\
+            \text{where } \sum_{m=0}^{M-1} \sum_{h=0}^{H-1} \log \frac{\pi_k(a_\hi^m \mid s_\hi^m)}{\pi_\theta(a_\hi^m \mid s_\hi^m)} \le \delta
         
 \end{gathered}$$
 ::: -->
@@ -313,7 +339,7 @@ Fisher information matrixfisher_matrix Let $p_\theta$ denote a parameterized dis
     
 \end{aligned}$$ Recall that the Hessian of a function describes its curvature: That is, for a vector $\delta \in \Theta$, the quantity $\delta^\top F_\theta \delta$ describes how rapidly the negative log-likelihood changes if we move by $\delta$.
 
-In particular, when $p_\theta = \rho_{\theta}$ denotes a trajectory distribution, we can further simplify the expression: $$F_{\theta} = \E_{\tau \sim \rho_\theta} \left[ \sum_{h=0}^{H-1} (\nabla \log \pi_\theta (a_h \mid s_h)) (\nabla \log \pi_\theta(a_h \mid s_h))^\top \right]
+In particular, when $p_\theta = \rho_{\theta}$ denotes a trajectory distribution, we can further simplify the expression: $$F_{\theta} = \E_{\tau \sim \rho_\theta} \left[ \sum_{h=0}^{H-1} (\nabla \log \pi_\theta (a_\hi \mid s_\hi)) (\nabla \log \pi_\theta(a_\hi \mid s_\hi))^\top \right]
         \label{eq:fisher_trajectory}$$ Note that we've used the Markov property to cancel out the cross terms corresponding to two different time steps.
 :::
 
@@ -364,7 +390,7 @@ We can relax the TRPO objective in a different way: Rather than imposing a hard
 Proximal policy optimization (exact)ppo
 
 <!-- :::{prf:algorithmic}
-Regularization parameter $\lambda$ Initialize $\theta^0$ $\theta^{k+1} \gets \argmax_{\theta} \E_{s_0, \dots, s_{H-1} \sim \pi^k} \left[ \sum_h \E_{a_h \sim \pi_\theta(s_h)} A^{\pi^k}(s_h, a_h) \right] - \lambda \kl{\rho_{\pi^k}}{\rho_{\pi_{\theta}}}$ $\theta^K$
+Regularization parameter $\lambda$ Initialize $\theta^0$ $\theta^{k+1} \gets \argmax_{\theta} \E_{s_0, \dots, s_{H-1} \sim \pi^k} \left[ \sum_\hi \E_{a_\hi \sim \pi_\theta(s_\hi)} A^{\pi^k}(s_\hi, a_\hi) \right] - \lambda \kl{\rho_{\pi^k}}{\rho_{\pi_{\theta}}}$ $\theta^K$
 ::: -->
 
 Note that like the original TRPO algorithm {prf:ref}`trpo`, PPO is not gradient-based; rather, at each step, we try to maximize local advantage relative to the current policy.
@@ -374,11 +400,11 @@ Let us now turn this into an implementable algorithm, assuming we can sample tra
 
 Let us simplify the $\kl{\rho_{\pi^k}}{\rho_{\pi_{\theta}}}$ term first. Expanding gives $$\begin{aligned}
     \kl{\rho_{\pi^k}}{\rho_{\pi_{\theta}}} & = \E_{\tau \sim \rho_{\pi^k}} \left[\log \frac{\rho_{\pi^k}(\tau)}{\rho_{\pi_{\theta}}(\tau)}\right]                                                       \\
-                                           & = \E_{\tau \sim \rho_{\pi^k}} \left[ \sum_{h=0}^{H-1} \log \frac{\pi^k(a_h \mid s_h)}{\pi_{\theta}(a_h \mid s_h)}\right] & \text{state transitions cancel} \\
-                                           & = \E_{\tau \sim \rho_{\pi^k}} \left[ \sum_{h=0}^{H-1} \log \frac{1}{\pi_{\theta}(a_h \mid s_h)}\right] + c
+                                           & = \E_{\tau \sim \rho_{\pi^k}} \left[ \sum_{h=0}^{H-1} \log \frac{\pi^k(a_\hi \mid s_\hi)}{\pi_{\theta}(a_\hi \mid s_\hi)}\right] & \text{state transitions cancel} \\
+                                           & = \E_{\tau \sim \rho_{\pi^k}} \left[ \sum_{h=0}^{H-1} \log \frac{1}{\pi_{\theta}(a_\hi \mid s_\hi)}\right] + c
 \end{aligned}$$ where $c$ is some constant relative to $\theta$.
 
-As we did for TRPO {prf:ref}`trpo`, we can use importance sampling {prf:ref}`importance_sampling` to rewrite the inner expectation. Combining the expectations together, this gives the (exact) objective $$\max_{\theta} \E_{\tau \sim \rho_{\pi^k}} \left[ \sum_{h=0}^{H-1} \left( \frac{\pi_\theta(a_h \mid s_h)}{\pi^k(a_h \mid s_h)} A^{\pi^k}(s_h, a_h) - \lambda \log \frac{1}{\pi_\theta(a_h \mid s_h)} \right) \right]$$
+As we did for TRPO {prf:ref}`trpo`, we can use importance sampling {prf:ref}`importance_sampling` to rewrite the inner expectation. Combining the expectations together, this gives the (exact) objective $$\max_{\theta} \E_{\tau \sim \rho_{\pi^k}} \left[ \sum_{h=0}^{H-1} \left( \frac{\pi_\theta(a_\hi \mid s_\hi)}{\pi^k(a_\hi \mid s_\hi)} A^{\pi^k}(s_\hi, a_\hi) - \lambda \log \frac{1}{\pi_\theta(a_\hi \mid s_\hi)} \right) \right]$$
 
 Now we can use gradient ascent on the parameters $\theta$ until convergence to maximize this function, completing a single iteration of PPO (i.e. $\theta^{k+1} \gets \theta$).
 
diff --git a/book/shared/references.bib b/book/shared/references.bib
index 8cda677..6656795 100644
--- a/book/shared/references.bib
+++ b/book/shared/references.bib
@@ -1,3 +1,28 @@
+@article{achiam_spinning_2018,
+  title = {Spinning {{Up}} in {{Deep Reinforcement Learning}}},
+  author = {Achiam, Joshua},
+  year = {2018},
+  urldate = {2024-07-01},
+  file = {/Users/alexandercai/Zotero/storage/UPUMW6XV/index.html}
+}
+
+@misc{adaptive_agent_team_human-timescale_2023,
+  title = {Human-{{Timescale Adaptation}} in an {{Open-Ended Task Space}}},
+  author = {Adaptive Agent Team and Bauer, Jakob and Baumli, Kate and Baveja, Satinder and Behbahani, Feryal and Bhoopchand, Avishkar and {Bradley-Schmieg}, Nathalie and Chang, Michael and Clay, Natalie and Collister, Adrian and Dasagi, Vibhavari and Gonzalez, Lucy and Gregor, Karol and Hughes, Edward and Kashem, Sheleem and {Loks-Thompson}, Maria and Openshaw, Hannah and {Parker-Holder}, Jack and Pathak, Shreya and {Perez-Nieves}, Nicolas and Rakicevic, Nemanja and Rockt{\"a}schel, Tim and Schroecker, Yannick and Sygnowski, Jakub and Tuyls, Karl and York, Sarah and Zacherl, Alexander and Zhang, Lei},
+  year = {2023},
+  month = jan,
+  number = {arXiv:2301.07608},
+  eprint = {2301.07608},
+  primaryclass = {cs},
+  publisher = {arXiv},
+  urldate = {2023-02-21},
+  abstract = {Foundation models have shown impressive adaptation and scalability in supervised and self-supervised learning problems, but so far these successes have not fully translated to reinforcement learning (RL). In this work, we demonstrate that training an RL agent at scale leads to a general in-context learning algorithm that can adapt to open-ended novel embodied 3D problems as quickly as humans. In a vast space of held-out environment dynamics, our adaptive agent (AdA) displays on-the-fly hypothesis-driven exploration, efficient exploitation of acquired knowledge, and can successfully be prompted with first-person demonstrations. Adaptation emerges from three ingredients: (1) meta-reinforcement learning across a vast, smooth and diverse task distribution, (2) a policy parameterised as a large-scale attention-based memory architecture, and (3) an effective automated curriculum that prioritises tasks at the frontier of an agent's capabilities. We demonstrate characteristic scaling laws with respect to network size, memory length, and richness of the training task distribution. We believe our results lay the foundation for increasingly general and adaptive RL agents that perform well across ever-larger open-ended domains.},
+  archiveprefix = {arXiv},
+  keywords = {Computer Science - Artificial Intelligence,Computer Science - Machine Learning,Computer Science - Neural and Evolutionary Computing},
+  annotation = {1 citations (Semantic Scholar/arXiv) [2023-02-20]},
+  file = {/Users/alexandercai/Library/CloudStorage/GoogleDrive-alexcai@college.harvard.edu/My Drive/Vault/papers/assets/2023/Human-Timescale Adaptation in an Open-Ended Task Space (2023) - Adaptive Agent Team et al.pdf}
+}
+
 @book{agarwal_reinforcement_2022,
   title = {Reinforcement {{Learning}}: {{Theory}} and {{Algorithms}}},
   shorttitle = {{{AJKS}}},
@@ -23,6 +48,29 @@ @inproceedings{azar_minimax_2017
   file = {/Users/alexandercai/Library/CloudStorage/GoogleDrive-alexcai@college.harvard.edu/My Drive/Vault/papers/assets/2017/Minimax Regret Bounds for Reinforcement Learning (2017) - Azar, Osband, Munos.pdf}
 }
 
+@misc{babuschkin_deepmind_2020,
+  title = {The {{DeepMind JAX Ecosystem}}},
+  author = {Babuschkin, Igor and Baumli, Kate and Bell, Alison and Bhupatiraju, Surya and Bruce, Jake and Buchlovsky, Peter and Budden, David and Cai, Trevor and Clark, Aidan and Danihelka, Ivo and Dedieu, Antoine and Fantacci, Claudio and Godwin, Jonathan and Jones, Chris and Hemsley, Ross and Hennigan, Tom and Hessel, Matteo and Hou, Shaobo and Kapturowski, Steven and Keck, Thomas and Kemaev, Iurii and King, Michael and Kunesch, Markus and Martens, Lena and Merzic, Hamza and Mikulik, Vladimir and Norman, Tamara and Papamakarios, George and Quan, John and Ring, Roman and Ruiz, Francisco and Sanchez, Alvaro and Schneider, Rosalia and Sezener, Eren and Spencer, Stephen and Srinivasan, Srivatsan and Stokowiec, Wojciech and Wang, Luyu and Zhou, Guangyao and Viola, Fabio},
+  year = {2020}
+}
+
+@article{barto_neuronlike_1983,
+  title = {Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problems},
+  author = {Barto, Andrew G. and Sutton, Richard S. and Anderson, Charles W.},
+  year = {1983},
+  month = sep,
+  journal = {IEEE Transactions on Systems, Man, and Cybernetics},
+  volume = {SMC-13},
+  number = {5},
+  pages = {834--846},
+  issn = {2168-2909},
+  doi = {10.1109/TSMC.1983.6313077},
+  urldate = {2024-07-01},
+  abstract = {It is shown how a system consisting of two neuronlike adaptive elements can solve a difficult learning control problem. The task is to balance a pole that is hinged to a movable cart by applying forces to the cart's base. It is argued that the learning problems faced by adaptive elements that are components of adaptive networks are at least as difficult as this version of the pole-balancing problem. The learning system consists of a single associative search element (ASE) and a single adaptive critic element (ACE). In the course of learning to balance the pole, the ASE constructs associations between input and output by searching under the influence of reinforcement feedback, and the ACE constructs a more informative evaluation function than reinforcement feedback alone can provide. The differences between this approach and other attempts to solve problems using neurolike elements are discussed, as is the relation of this work to classical and instrumental conditioning in animal learning studies and its possible implications for research in the neurosciences.},
+  keywords = {Adaptive systems,Biological neural networks,Neurons,Pattern recognition,Problem-solving,Supervised learning,Training},
+  file = {/Users/alexandercai/Zotero/storage/GHD9WZXL/6313077.html}
+}
+
 @article{degrave_magnetic_2022,
   title = {Magnetic Control of Tokamak Plasmas through Deep Reinforcement Learning},
   author = {Degrave, Jonas and Felici, Federico and Buchli, Jonas and Neunert, Michael and Tracey, Brendan and Carpanese, Francesco and Ewalds, Timo and Hafner, Roland and Abdolmaleki, Abbas and {de las Casas}, Diego and Donner, Craig and Fritz, Leslie and Galperti, Cristian and Huber, Andrea and Keeling, James and Tsimpoukelli, Maria and Kay, Jackie and Merle, Antoine and Moret, Jean-Marc and Noury, Seb and Pesamosca, Federico and Pfau, David and Sauter, Olivier and Sommariva, Cristian and Coda, Stefano and Duval, Basil and Fasoli, Ambrogio and Kohli, Pushmeet and Kavukcuoglu, Koray and Hassabis, Demis and Riedmiller, Martin},
@@ -44,6 +92,24 @@ @article{degrave_magnetic_2022
   file = {/Users/alexandercai/Library/CloudStorage/GoogleDrive-alexcai@college.harvard.edu/My Drive/Vault/papers/assets/2022/Magnetic control of tokamak plasmas through deep reinforcement learning (2022) - Degrave et al.pdf}
 }
 
+@inproceedings{freeman_brax_2021,
+  title = {Brax -- {{A Differentiable Physics Engine}} for {{Large Scale Rigid Body Simulation}}},
+  booktitle = {{{NeurIPS Datasets}} and {{Benchmarks}} 2021},
+  author = {Freeman, C. Daniel and Frey, Erik and Raichuk, Anton and Girgin, Sertan and Mordatch, Igor and Bachem, Olivier},
+  year = {2021},
+  month = jun,
+  eprint = {2106.13281},
+  primaryclass = {cs},
+  doi = {10.48550/arXiv.2106.13281},
+  urldate = {2023-06-26},
+  abstract = {We present Brax, an open source library for rigid body simulation with a focus on performance and parallelism on accelerators, written in JAX. We present results on a suite of tasks inspired by the existing reinforcement learning literature, but remade in our engine. Additionally, we provide reimplementations of PPO, SAC, ES, and direct policy optimization in JAX that compile alongside our environments, allowing the learning algorithm and the environment processing to occur on the same device, and to scale seamlessly on accelerators. Finally, we include notebooks that facilitate training of performant policies on common OpenAI Gym MuJoCo-like tasks in minutes.},
+  archiveprefix = {arXiv},
+  pubstate = {preprint {\textbar} DBLP: https://dblp.org/rec/conf/nips/FreemanFRGMB21},
+  keywords = {Computer Science - Artificial Intelligence,Computer Science - Robotics},
+  annotation = {151 citations (Semantic Scholar/arXiv) [2023-07-22]},
+  file = {/Users/alexandercai/Library/CloudStorage/GoogleDrive-alexcai@college.harvard.edu/My Drive/Vault/papers/assets/2021/Brax – A Differentiable Physics Engine for Large Scale Rigid Body Simulation (2021) - Freeman et al.pdf}
+}
+
 @misc{hausknecht_deep_2017,
   title = {Deep {{Recurrent Q-Learning}} for {{Partially Observable MDPs}}},
   author = {Hausknecht, Matthew and Stone, Peter},
@@ -62,6 +128,18 @@ @misc{hausknecht_deep_2017
   file = {/Users/alexandercai/Library/CloudStorage/GoogleDrive-alexcai@college.harvard.edu/My Drive/Vault/papers/assets/2017/Deep Recurrent Q-Learning for Partially Observable MDPs (2017) - Hausknecht, Stone.pdf}
 }
 
+@book{kochenderfer_algorithms_2022,
+  title = {Algorithms for {{Decision Making}}},
+  author = {Kochenderfer, Mykel J and Wheeler, Tim A and Wray, Kyle H},
+  year = {2022},
+  month = aug,
+  urldate = {2022-10-23},
+  abstract = {A broad introduction to algorithms for decision making under uncertainty, introducing the underlying mathematical problem formulations and the algorithms for...},
+  isbn = {978-0-262-04701-2},
+  langid = {american},
+  file = {/Users/alexandercai/Library/CloudStorage/GoogleDrive-alexcai@college.harvard.edu/My Drive/Vault/papers/assets/2022/Algorithms for Decision Making (2022) - Kochenderfer, Wheeler, Wray.pdf}
+}
+
 @article{lai_asymptotically_1985,
   title = {Asymptotically Efficient Adaptive Allocation Rules},
   author = {Lai, T. L and Robbins, Herbert},
@@ -77,6 +155,18 @@ @article{lai_asymptotically_1985
   file = {/Users/alexandercai/Library/CloudStorage/GoogleDrive-alexcai@college.harvard.edu/My Drive/Vault/papers/assets/1985/Asymptotically efficient adaptive allocation rules (1985) - Lai, Robbins.pdf}
 }
 
+@inproceedings{lechner_gigastep_2023,
+  title = {Gigastep - {{One Billion Steps}} per {{Second Multi-agent Reinforcement Learning}}},
+  booktitle = {Thirty-Seventh {{Conference}} on {{Neural Information Processing Systems Datasets}} and {{Benchmarks Track}}},
+  author = {Lechner, Mathias and Yin, Lianhao and Seyde, Tim and Wang, Tsun-Hsuan and Xiao, Wei and Hasani, Ramin and Rountree, Joshua and Rus, Daniela},
+  year = {2023},
+  month = nov,
+  urldate = {2023-12-12},
+  abstract = {Multi-agent reinforcement learning (MARL) research is faced with a trade-off: it either uses complex environments requiring large compute resources, which makes it inaccessible to researchers with limited resources, or relies on simpler dynamics for faster execution, which makes the transferability of the results to more realistic tasks challenging. Motivated by these challenges, we present Gigastep, a fully vectorizable, MARL environment implemented in JAX, capable of executing up to one billion environment steps per second on consumer-grade hardware. Its design allows for comprehensive MARL experimentation, including a complex, high-dimensional space defined by 3D dynamics, stochasticity, and partial observations. Gigastep supports both collaborative and adversarial tasks, continuous and discrete action spaces, and provides RGB image and feature vector observations, allowing the evaluation of a wide range of MARL algorithms. We validate Gigastep's usability through an extensive set of experiments, underscoring its role in widening participation and promoting inclusivity in the MARL research community.},
+  langid = {english},
+  file = {/Users/alexandercai/Library/CloudStorage/GoogleDrive-alexcai@college.harvard.edu/My Drive/Vault/papers/assets/2023/Gigastep - One Billion Steps per Second Multi-agent Reinforcement Learning (2023) - Lechner et al.pdf}
+}
+
 @article{mnih_playing_2013,
   title = {Playing {{Atari}} with {{Deep Reinforcement Learning}}},
   author = {Mnih, Volodymyr and Kavukcuoglu, Koray and Silver, David and Graves, Alex and Antonoglou, Ioannis and Wierstra, Daan and Riedmiller, Martin A.},
@@ -97,6 +187,35 @@ @book{nielsen_neural_2015
   urldate = {2024-03-10}
 }
 
+@inproceedings{ross_reduction_2010,
+  title = {A {{Reduction}} of {{Imitation Learning}} and {{Structured Prediction}} to {{No-Regret Online Learning}}},
+  booktitle = {International {{Conference}} on {{Artificial Intelligence}} and {{Statistics}}},
+  author = {Ross, St{\'e}phane and Gordon, Geoffrey J. and Bagnell, J.},
+  year = {2010},
+  month = nov,
+  urldate = {2024-08-08},
+  abstract = {Sequential prediction problems such as imitation learning, where future observations depend on previous predictions (actions), violate the common i.i.d. assumptions made in statistical learning. This leads to poor performance in theory and often in practice. Some recent approaches provide stronger guarantees in this setting, but remain somewhat unsatisfactory as they train either non-stationary or stochastic policies and require a large number of iterations. In this paper, we propose a new iterative algorithm, which trains a stationary deterministic policy, that can be seen as a no regret algorithm in an online learning setting. We show that any such no regret algorithm, combined with additional reduction assumptions, must find a policy with good performance under the distribution of observations it induces in such sequential settings. We demonstrate that this new approach outperforms previous approaches on two challenging imitation learning problems and a benchmark sequence labeling problem.},
+  file = {/Users/alexandercai/Library/CloudStorage/GoogleDrive-alexcai@college.harvard.edu/My Drive/Vault/papers/assets/2010/A Reduction of Imitation Learning and Structured Prediction to No-Regret Online (2010) - Ross, Gordon, Bagnell.pdf}
+}
+
+@misc{sun_easy--hard_2024,
+  title = {Easy-to-{{Hard Generalization}}: {{Scalable Alignment Beyond Human Supervision}}},
+  shorttitle = {Easy-to-{{Hard Generalization}}},
+  author = {Sun, Zhiqing and Yu, Longhui and Shen, Yikang and Liu, Weiyang and Yang, Yiming and Welleck, Sean and Gan, Chuang},
+  year = {2024},
+  month = mar,
+  number = {arXiv:2403.09472},
+  eprint = {2403.09472},
+  primaryclass = {cs},
+  publisher = {arXiv},
+  doi = {10.48550/arXiv.2403.09472},
+  urldate = {2024-07-01},
+  abstract = {Current AI alignment methodologies rely on human-provided demonstrations or judgments, and the learned capabilities of AI systems would be upper-bounded by human capabilities as a result. This raises a challenging research question: How can we keep improving the systems when their capabilities have surpassed the levels of humans? This paper answers this question in the context of tackling hard reasoning tasks (e.g., level 4-5 MATH problems) via learning from human annotations on easier tasks (e.g., level 1-3 MATH problems), which we term as {\textbackslash}textit\{easy-to-hard generalization\}. Our key insight is that an evaluator (reward model) trained on supervisions for easier tasks can be effectively used for scoring candidate solutions of harder tasks and hence facilitating easy-to-hard generalization over different levels of tasks. Based on this insight, we propose a novel approach to scalable alignment, which firstly trains the process-supervised reward models on easy problems (e.g., level 1-3), and then uses them to evaluate the performance of policy models on hard problems. We show that such {\textbackslash}textit\{easy-to-hard generalization from evaluators\} can enable {\textbackslash}textit\{easy-to-hard generalizations in generators\} either through re-ranking or reinforcement learning (RL). Notably, our process-supervised 7b RL model achieves an accuracy of 34.0{\textbackslash}\% on MATH500, despite only using human supervision on easy problems. Our approach suggests a promising path toward AI systems that advance beyond the frontier of human supervision.},
+  archiveprefix = {arXiv},
+  keywords = {Computer Science - Artificial Intelligence,Computer Science - Computation and Language,Computer Science - Machine Learning},
+  file = {/Users/alexandercai/Library/CloudStorage/GoogleDrive-alexcai@college.harvard.edu/My Drive/Vault/papers/assets/2024/Easy-to-Hard Generalization (2024) - Sun et al.pdf;/Users/alexandercai/Zotero/storage/J52D59AK/2403.html}
+}
+
 @book{sussman_functional_2013,
   title = {Functional Differential Geometry},
   author = {Sussman, Gerald Jay and Wisdom, Jack and Farr, Will},
@@ -140,3 +259,72 @@ @book{vershynin_high-dimensional_2018
   keywords = {Business & Economics / Econometrics,Computers / Optical Data Processing,Language Arts & Disciplines / Library & Information Science / General,Mathematics / Probability & Statistics / General,Technology & Engineering / Signals & Signal Processing},
   file = {/Users/alexandercai/Library/CloudStorage/GoogleDrive-alexcai@college.harvard.edu/My Drive/Vault/papers/assets/2018/High-Dimensional Probability (2018) - Vershynin.pdf}
 }
+
+@misc{welleck_decoding_2024,
+  title = {From {{Decoding}} to {{Meta-Generation}}: {{Inference-time Algorithms}} for {{Large Language Models}}},
+  shorttitle = {From {{Decoding}} to {{Meta-Generation}}},
+  author = {Welleck, Sean and Bertsch, Amanda and Finlayson, Matthew and Schoelkopf, Hailey and Xie, Alex and Neubig, Graham and Kulikov, Ilia and Harchaoui, Zaid},
+  year = {2024},
+  month = jun,
+  number = {arXiv:2406.16838},
+  eprint = {2406.16838},
+  primaryclass = {cs},
+  publisher = {arXiv},
+  doi = {10.48550/arXiv.2406.16838},
+  urldate = {2024-07-01},
+  abstract = {One of the most striking findings in modern research on large language models (LLMs) is that scaling up compute during training leads to better results. However, less attention has been given to the benefits of scaling compute during inference. This survey focuses on these inference-time approaches. We explore three areas under a unified mathematical formalism: token-level generation algorithms, meta-generation algorithms, and efficient generation. Token-level generation algorithms, often called decoding algorithms, operate by sampling a single token at a time or constructing a token-level search space and then selecting an output. These methods typically assume access to a language model's logits, next-token distributions, or probability scores. Meta-generation algorithms work on partial or full sequences, incorporating domain knowledge, enabling backtracking, and integrating external information. Efficient generation methods aim to reduce token costs and improve the speed of generation. Our survey unifies perspectives from three research communities: traditional natural language processing, modern LLMs, and machine learning systems.},
+  archiveprefix = {arXiv},
+  keywords = {Computer Science - Computation and Language,Computer Science - Machine Learning},
+  file = {/Users/alexandercai/Library/CloudStorage/GoogleDrive-alexcai@college.harvard.edu/My Drive/Vault/papers/assets/2024/From Decoding to Meta-Generation (2024) - Welleck et al.pdf;/Users/alexandercai/Zotero/storage/S4Y984R4/2406.html}
+}
+
+@misc{zhai_fine-tuning_2024,
+  title = {Fine-{{Tuning Large Vision-Language Models}} as {{Decision-Making Agents}} via {{Reinforcement Learning}}},
+  author = {Zhai, Yuexiang and Bai, Hao and Lin, Zipeng and Pan, Jiayi and Tong, Shengbang and Zhou, Yifei and Suhr, Alane and Xie, Saining and LeCun, Yann and Ma, Yi and Levine, Sergey},
+  year = {2024},
+  month = may,
+  number = {arXiv:2405.10292},
+  eprint = {2405.10292},
+  primaryclass = {cs},
+  publisher = {arXiv},
+  doi = {10.48550/arXiv.2405.10292},
+  urldate = {2024-07-01},
+  abstract = {Large vision-language models (VLMs) fine-tuned on specialized visual instruction-following data have exhibited impressive language reasoning capabilities across various scenarios. However, this fine-tuning paradigm may not be able to efficiently learn optimal decision-making agents in multi-step goal-directed tasks from interactive environments. To address this challenge, we propose an algorithmic framework that fine-tunes VLMs with reinforcement learning (RL). Specifically, our framework provides a task description and then prompts the VLM to generate chain-of-thought (CoT) reasoning, enabling the VLM to efficiently explore intermediate reasoning steps that lead to the final text-based action. Next, the open-ended text output is parsed into an executable action to interact with the environment to obtain goal-directed task rewards. Finally, our framework uses these task rewards to fine-tune the entire VLM with RL. Empirically, we demonstrate that our proposed framework enhances the decision-making capabilities of VLM agents across various tasks, enabling 7b models to outperform commercial models such as GPT4-V or Gemini. Furthermore, we find that CoT reasoning is a crucial component for performance improvement, as removing the CoT reasoning results in a significant decrease in the overall performance of our method.},
+  archiveprefix = {arXiv},
+  keywords = {Computer Science - Artificial Intelligence,Computer Science - Computation and Language,Computer Science - Computer Vision and Pattern Recognition,Computer Science - Machine Learning},
+  file = {/Users/alexandercai/Library/CloudStorage/GoogleDrive-alexcai@college.harvard.edu/My Drive/Vault/papers/assets/2024/Fine-Tuning Large Vision-Language Models as Decision-Making Agents via (2024) - Zhai et al.pdf;/Users/alexandercai/Zotero/storage/2X2WJU4D/2405.html}
+}
+
+@misc{zhang_adaptable_2024,
+  title = {Adaptable {{Logical Control}} for {{Large Language Models}}},
+  author = {Zhang, Honghua and Kung, Po-Nien and Yoshida, Masahiro and den Broeck, Guy Van and Peng, Nanyun},
+  year = {2024},
+  month = jun,
+  number = {arXiv:2406.13892},
+  eprint = {2406.13892},
+  primaryclass = {cs},
+  publisher = {arXiv},
+  doi = {10.48550/arXiv.2406.13892},
+  urldate = {2024-07-01},
+  abstract = {Despite the success of Large Language Models (LLMs) on various tasks following human instructions, controlling model generation at inference time poses a persistent challenge. In this paper, we introduce Ctrl-G, an adaptable framework that facilitates tractable and flexible control of LLM generation to reliably follow logical constraints. Ctrl-G combines any production-ready LLM with a Hidden Markov Model, enabling LLM outputs to adhere to logical constraints represented as deterministic finite automata. We show that Ctrl-G, when applied to a TULU2-7B model, outperforms GPT3.5 and GPT4 on the task of interactive text editing: specifically, for the task of generating text insertions/continuations following logical constraints, Ctrl-G achieves over 30\% higher satisfaction rate in human evaluation compared to GPT4. When applied to medium-size language models (e.g., GPT2-large), Ctrl-G also beats its counterparts for constrained generation by large margins on standard benchmarks. Additionally, as a proof-of-concept study, we experiment Ctrl-G on the Grade School Math benchmark to assist LLM reasoning, foreshadowing the application of Ctrl-G, as well as other constrained generation approaches, beyond traditional language generation tasks.},
+  archiveprefix = {arXiv},
+  keywords = {Computer Science - Computation and Language},
+  file = {/Users/alexandercai/Library/CloudStorage/GoogleDrive-alexcai@college.harvard.edu/My Drive/Vault/papers/assets/2024/Adaptable Logical Control for Large Language Models (2024) - Zhang, Kung, Yoshida, Broeck, Peng.pdf;/Users/alexandercai/Zotero/storage/38W8T74Y/2406.html}
+}
+
+@misc{zhang_deep_2015,
+  title = {Deep Learning with {{Elastic Averaging SGD}}},
+  author = {Zhang, Sixin and Choromanska, Anna and LeCun, Yann},
+  year = {2015},
+  month = oct,
+  number = {arXiv:1412.6651},
+  eprint = {1412.6651},
+  primaryclass = {cs, stat},
+  publisher = {arXiv},
+  doi = {10.48550/arXiv.1412.6651},
+  urldate = {2024-07-01},
+  abstract = {We study the problem of stochastic optimization for deep learning in the parallel computing environment under communication constraints. A new algorithm is proposed in this setting where the communication and coordination of work among concurrent processes (local workers), is based on an elastic force which links the parameters they compute with a center variable stored by the parameter server (master). The algorithm enables the local workers to perform more exploration, i.e. the algorithm allows the local variables to fluctuate further from the center variable by reducing the amount of communication between local workers and the master. We empirically demonstrate that in the deep learning setting, due to the existence of many local optima, allowing more exploration can lead to the improved performance. We propose synchronous and asynchronous variants of the new algorithm. We provide the stability analysis of the asynchronous variant in the round-robin scheme and compare it with the more common parallelized method ADMM. We show that the stability of EASGD is guaranteed when a simple stability condition is satisfied, which is not the case for ADMM. We additionally propose the momentum-based version of our algorithm that can be applied in both synchronous and asynchronous settings. Asynchronous variant of the algorithm is applied to train convolutional neural networks for image classification on the CIFAR and ImageNet datasets. Experiments demonstrate that the new algorithm accelerates the training of deep architectures compared to DOWNPOUR and other common baseline approaches and furthermore is very communication efficient.},
+  archiveprefix = {arXiv},
+  keywords = {Computer Science - Machine Learning,Statistics - Machine Learning},
+  file = {/Users/alexandercai/Library/CloudStorage/GoogleDrive-alexcai@college.harvard.edu/My Drive/Vault/papers/assets/2015/Deep learning with Elastic Averaging SGD (2015) - Zhang, Choromanska, LeCun.pdf;/Users/alexandercai/Zotero/storage/M4LFKVWK/1412.html}
+}
diff --git a/main.bib b/main.bib
index cbde2a3..1eba405 100644
--- a/main.bib
+++ b/main.bib
@@ -1,3 +1,28 @@
+@inreference{achiam_spinning_2018,
+  title = {Spinning {{Up}} in {{Deep Reinforcement Learning}}},
+  author = {Achiam, Joshua},
+  date = {2018},
+  url = {https://spinningup.openai.com/en/latest/index.html},
+  urldate = {2024-07-01},
+  file = {/Users/alexandercai/Zotero/storage/UPUMW6XV/index.html}
+}
+
+@online{adaptive_agent_team_human-timescale_2023,
+  title = {Human-{{Timescale Adaptation}} in an {{Open-Ended Task Space}}},
+  author = {Adaptive Agent Team and Bauer, Jakob and Baumli, Kate and Baveja, Satinder and Behbahani, Feryal and Bhoopchand, Avishkar and Bradley-Schmieg, Nathalie and Chang, Michael and Clay, Natalie and Collister, Adrian and Dasagi, Vibhavari and Gonzalez, Lucy and Gregor, Karol and Hughes, Edward and Kashem, Sheleem and Loks-Thompson, Maria and Openshaw, Hannah and Parker-Holder, Jack and Pathak, Shreya and Perez-Nieves, Nicolas and Rakicevic, Nemanja and Rocktäschel, Tim and Schroecker, Yannick and Sygnowski, Jakub and Tuyls, Karl and York, Sarah and Zacherl, Alexander and Zhang, Lei},
+  date = {2023-01-18},
+  eprint = {2301.07608},
+  eprinttype = {arXiv},
+  eprintclass = {cs},
+  url = {http://arxiv.org/abs/2301.07608},
+  urldate = {2023-02-21},
+  abstract = {Foundation models have shown impressive adaptation and scalability in supervised and self-supervised learning problems, but so far these successes have not fully translated to reinforcement learning (RL). In this work, we demonstrate that training an RL agent at scale leads to a general in-context learning algorithm that can adapt to open-ended novel embodied 3D problems as quickly as humans. In a vast space of held-out environment dynamics, our adaptive agent (AdA) displays on-the-fly hypothesis-driven exploration, efficient exploitation of acquired knowledge, and can successfully be prompted with first-person demonstrations. Adaptation emerges from three ingredients: (1) meta-reinforcement learning across a vast, smooth and diverse task distribution, (2) a policy parameterised as a large-scale attention-based memory architecture, and (3) an effective automated curriculum that prioritises tasks at the frontier of an agent's capabilities. We demonstrate characteristic scaling laws with respect to network size, memory length, and richness of the training task distribution. We believe our results lay the foundation for increasingly general and adaptive RL agents that perform well across ever-larger open-ended domains.},
+  pubstate = {prepublished},
+  keywords = {Computer Science - Artificial Intelligence,Computer Science - Machine Learning,Computer Science - Neural and Evolutionary Computing},
+  annotation = {1 citations (Semantic Scholar/arXiv) [2023-02-20]},
+  file = {/Users/alexandercai/Library/CloudStorage/GoogleDrive-alexcai@college.harvard.edu/My Drive/Vault/papers/assets/2023/Human-Timescale Adaptation in an Open-Ended Task Space (2023) - Adaptive Agent Team et al.pdf}
+}
+
 @book{agarwal_reinforcement_2022,
   title = {Reinforcement {{Learning}}: {{Theory}} and {{Algorithms}}},
   shorttitle = {{{AJKS}}},
@@ -24,6 +49,31 @@ @inproceedings{azar_minimax_2017
   file = {/Users/alexandercai/Library/CloudStorage/GoogleDrive-alexcai@college.harvard.edu/My Drive/Vault/papers/assets/2017/Minimax Regret Bounds for Reinforcement Learning (2017) - Azar, Osband, Munos.pdf}
 }
 
+@software{babuschkin_deepmind_2020,
+  title = {The {{DeepMind JAX Ecosystem}}},
+  author = {Babuschkin, Igor and Baumli, Kate and Bell, Alison and Bhupatiraju, Surya and Bruce, Jake and Buchlovsky, Peter and Budden, David and Cai, Trevor and Clark, Aidan and Danihelka, Ivo and Dedieu, Antoine and Fantacci, Claudio and Godwin, Jonathan and Jones, Chris and Hemsley, Ross and Hennigan, Tom and Hessel, Matteo and Hou, Shaobo and Kapturowski, Steven and Keck, Thomas and Kemaev, Iurii and King, Michael and Kunesch, Markus and Martens, Lena and Merzic, Hamza and Mikulik, Vladimir and Norman, Tamara and Papamakarios, George and Quan, John and Ring, Roman and Ruiz, Francisco and Sanchez, Alvaro and Schneider, Rosalia and Sezener, Eren and Spencer, Stephen and Srinivasan, Srivatsan and Stokowiec, Wojciech and Wang, Luyu and Zhou, Guangyao and Viola, Fabio},
+  date = {2020},
+  url = {http://github.com/deepmind}
+}
+
+@article{barto_neuronlike_1983,
+  title = {Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problems},
+  author = {Barto, Andrew G. and Sutton, Richard S. and Anderson, Charles W.},
+  date = {1983-09},
+  journaltitle = {IEEE Transactions on Systems, Man, and Cybernetics},
+  volume = {SMC-13},
+  number = {5},
+  pages = {834--846},
+  issn = {2168-2909},
+  doi = {10.1109/TSMC.1983.6313077},
+  url = {https://ieeexplore.ieee.org/document/6313077},
+  urldate = {2024-07-01},
+  abstract = {It is shown how a system consisting of two neuronlike adaptive elements can solve a difficult learning control problem. The task is to balance a pole that is hinged to a movable cart by applying forces to the cart's base. It is argued that the learning problems faced by adaptive elements that are components of adaptive networks are at least as difficult as this version of the pole-balancing problem. The learning system consists of a single associative search element (ASE) and a single adaptive critic element (ACE). In the course of learning to balance the pole, the ASE constructs associations between input and output by searching under the influence of reinforcement feedback, and the ACE constructs a more informative evaluation function than reinforcement feedback alone can provide. The differences between this approach and other attempts to solve problems using neurolike elements are discussed, as is the relation of this work to classical and instrumental conditioning in animal learning studies and its possible implications for research in the neurosciences.},
+  eventtitle = {{{IEEE Transactions}} on {{Systems}}, {{Man}}, and {{Cybernetics}}},
+  keywords = {Adaptive systems,Biological neural networks,Neurons,Pattern recognition,Problem-solving,Supervised learning,Training},
+  file = {/Users/alexandercai/Zotero/storage/GHD9WZXL/6313077.html}
+}
+
 @article{degrave_magnetic_2022,
   title = {Magnetic Control of Tokamak Plasmas through Deep Reinforcement Learning},
   author = {Degrave, Jonas and Felici, Federico and Buchli, Jonas and Neunert, Michael and Tracey, Brendan and Carpanese, Francesco and Ewalds, Timo and Hafner, Roland and Abdolmaleki, Abbas and family=Casas, given=Diego, prefix=de las, useprefix=true and Donner, Craig and Fritz, Leslie and Galperti, Cristian and Huber, Andrea and Keeling, James and Tsimpoukelli, Maria and Kay, Jackie and Merle, Antoine and Moret, Jean-Marc and Noury, Seb and Pesamosca, Federico and Pfau, David and Sauter, Olivier and Sommariva, Cristian and Coda, Stefano and Duval, Basil and Fasoli, Ambrogio and Kohli, Pushmeet and Kavukcuoglu, Koray and Hassabis, Demis and Riedmiller, Martin},
@@ -45,6 +95,25 @@ @article{degrave_magnetic_2022
   file = {/Users/alexandercai/Library/CloudStorage/GoogleDrive-alexcai@college.harvard.edu/My Drive/Vault/papers/assets/2022/Magnetic control of tokamak plasmas through deep reinforcement learning (2022) - Degrave et al.pdf}
 }
 
+@inproceedings{freeman_brax_2021,
+  title = {Brax – {{A Differentiable Physics Engine}} for {{Large Scale Rigid Body Simulation}}},
+  booktitle = {{{NeurIPS Datasets}} and {{Benchmarks}} 2021},
+  author = {Freeman, C. Daniel and Frey, Erik and Raichuk, Anton and Girgin, Sertan and Mordatch, Igor and Bachem, Olivier},
+  date = {2021-06-24},
+  eprint = {2106.13281},
+  eprinttype = {arXiv},
+  eprintclass = {cs},
+  doi = {10.48550/arXiv.2106.13281},
+  url = {https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/d1f491a404d6854880943e5c3cd9ca25-Abstract-round1.html},
+  urldate = {2023-06-26},
+  abstract = {We present Brax, an open source library for rigid body simulation with a focus on performance and parallelism on accelerators, written in JAX. We present results on a suite of tasks inspired by the existing reinforcement learning literature, but remade in our engine. Additionally, we provide reimplementations of PPO, SAC, ES, and direct policy optimization in JAX that compile alongside our environments, allowing the learning algorithm and the environment processing to occur on the same device, and to scale seamlessly on accelerators. Finally, we include notebooks that facilitate training of performant policies on common OpenAI Gym MuJoCo-like tasks in minutes.},
+  eventtitle = {{{NeurIPS Datasets}} and {{Benchmarks}}},
+  pubstate = {preprint | DBLP: https://dblp.org/rec/conf/nips/FreemanFRGMB21},
+  keywords = {Computer Science - Artificial Intelligence,Computer Science - Robotics},
+  annotation = {151 citations (Semantic Scholar/arXiv) [2023-07-22]},
+  file = {/Users/alexandercai/Library/CloudStorage/GoogleDrive-alexcai@college.harvard.edu/My Drive/Vault/papers/assets/2021/Brax – A Differentiable Physics Engine for Large Scale Rigid Body Simulation (2021) - Freeman et al.pdf}
+}
+
 @online{hausknecht_deep_2017,
   title = {Deep {{Recurrent Q-Learning}} for {{Partially Observable MDPs}}},
   author = {Hausknecht, Matthew and Stone, Peter},
@@ -56,12 +125,24 @@ @online{hausknecht_deep_2017
   url = {http://arxiv.org/abs/1507.06527},
   urldate = {2023-06-04},
   abstract = {Deep Reinforcement Learning has yielded proficient controllers for complex tasks. However, these controllers have limited memory and rely on being able to perceive the complete game screen at each decision point. To address these shortcomings, this article investigates the effects of adding recurrency to a Deep Q-Network (DQN) by replacing the first post-convolutional fully-connected layer with a recurrent LSTM. The resulting \textbackslash textit\{Deep Recurrent Q-Network\} (DRQN), although capable of seeing only a single frame at each timestep, successfully integrates information through time and replicates DQN's performance on standard Atari games and partially observed equivalents featuring flickering game screens. Additionally, when trained with partial observations and evaluated with incrementally more complete observations, DRQN's performance scales as a function of observability. Conversely, when trained with full observations and evaluated with partial observations, DRQN's performance degrades less than DQN's. Thus, given the same length of history, recurrency is a viable alternative to stacking a history of frames in the DQN's input layer and while recurrency confers no systematic advantage when learning to play the game, the recurrent net can better adapt at evaluation time if the quality of observations changes.},
-  pubstate = {preprint},
+  pubstate = {prepublished},
   keywords = {Computer Science - Machine Learning},
   annotation = {1274 citations (Semantic Scholar/arXiv) [2023-06-04]},
   file = {/Users/alexandercai/Library/CloudStorage/GoogleDrive-alexcai@college.harvard.edu/My Drive/Vault/papers/assets/2017/Deep Recurrent Q-Learning for Partially Observable MDPs (2017) - Hausknecht, Stone.pdf}
 }
 
+@book{kochenderfer_algorithms_2022,
+  title = {Algorithms for {{Decision Making}}},
+  author = {Kochenderfer, Mykel J and Wheeler, Tim A and Wray, Kyle H},
+  date = {2022-08-16},
+  url = {https://mitpress.mit.edu/9780262047012/algorithms-for-decision-making/},
+  urldate = {2022-10-23},
+  abstract = {A broad introduction to algorithms for decision making under uncertainty, introducing the underlying mathematical problem formulations and the algorithms for...},
+  isbn = {978-0-262-04701-2},
+  langid = {american},
+  file = {/Users/alexandercai/Library/CloudStorage/GoogleDrive-alexcai@college.harvard.edu/My Drive/Vault/papers/assets/2022/Algorithms for Decision Making (2022) - Kochenderfer, Wheeler, Wray.pdf}
+}
+
 @article{lai_asymptotically_1985,
   title = {Asymptotically Efficient Adaptive Allocation Rules},
   author = {Lai, T. L and Robbins, Herbert},
@@ -78,6 +159,18 @@ @article{lai_asymptotically_1985
   file = {/Users/alexandercai/Library/CloudStorage/GoogleDrive-alexcai@college.harvard.edu/My Drive/Vault/papers/assets/1985/Asymptotically efficient adaptive allocation rules (1985) - Lai, Robbins.pdf}
 }
 
+@inproceedings{lechner_gigastep_2023,
+  title = {Gigastep - {{One Billion Steps}} per {{Second Multi-agent Reinforcement Learning}}},
+  author = {Lechner, Mathias and Yin, Lianhao and Seyde, Tim and Wang, Tsun-Hsuan and Xiao, Wei and Hasani, Ramin and Rountree, Joshua and Rus, Daniela},
+  date = {2023-11-02},
+  url = {https://openreview.net/forum?id=UgPAaEugH3},
+  urldate = {2023-12-12},
+  abstract = {Multi-agent reinforcement learning (MARL) research is faced with a trade-off: it either uses complex environments requiring large compute resources, which makes it inaccessible to researchers with limited resources, or relies on simpler dynamics for faster execution, which makes the transferability of the results to more realistic tasks challenging. Motivated by these challenges, we present Gigastep, a fully vectorizable, MARL environment implemented in JAX, capable of executing up to one billion environment steps per second on consumer-grade hardware. Its design allows for comprehensive MARL experimentation, including a complex, high-dimensional space defined by 3D dynamics, stochasticity, and partial observations. Gigastep supports both collaborative and adversarial tasks, continuous and discrete action spaces, and provides RGB image and feature vector observations, allowing the evaluation of a wide range of MARL algorithms. We validate Gigastep's usability through an extensive set of experiments, underscoring its role in widening participation and promoting inclusivity in the MARL research community.},
+  eventtitle = {Thirty-Seventh {{Conference}} on {{Neural Information Processing Systems Datasets}} and {{Benchmarks Track}}},
+  langid = {english},
+  file = {/Users/alexandercai/Library/CloudStorage/GoogleDrive-alexcai@college.harvard.edu/My Drive/Vault/papers/assets/2023/Gigastep - One Billion Steps per Second Multi-agent Reinforcement Learning (2023) - Lechner et al.pdf}
+}
+
 @article{mnih_playing_2013,
   title = {Playing {{Atari}} with {{Deep Reinforcement Learning}}},
   author = {Mnih, Volodymyr and Kavukcuoglu, Koray and Silver, David and Graves, Alex and Antonoglou, Ioannis and Wierstra, Daan and Riedmiller, Martin A.},
@@ -100,6 +193,34 @@ @book{nielsen_neural_2015
   urldate = {2024-03-10}
 }
 
+@inproceedings{ross_reduction_2010,
+  title = {A {{Reduction}} of {{Imitation Learning}} and {{Structured Prediction}} to {{No-Regret Online Learning}}},
+  author = {Ross, Stéphane and Gordon, Geoffrey J. and Bagnell, J.},
+  date = {2010-11-02},
+  url = {https://www.semanticscholar.org/paper/A-Reduction-of-Imitation-Learning-and-Structured-to-Ross-Gordon/79ab3c49903ec8cb339437ccf5cf998607fc313e},
+  urldate = {2024-08-08},
+  abstract = {Sequential prediction problems such as imitation learning, where future observations depend on previous predictions (actions), violate the common i.i.d. assumptions made in statistical learning. This leads to poor performance in theory and often in practice. Some recent approaches provide stronger guarantees in this setting, but remain somewhat unsatisfactory as they train either non-stationary or stochastic policies and require a large number of iterations. In this paper, we propose a new iterative algorithm, which trains a stationary deterministic policy, that can be seen as a no regret algorithm in an online learning setting. We show that any such no regret algorithm, combined with additional reduction assumptions, must find a policy with good performance under the distribution of observations it induces in such sequential settings. We demonstrate that this new approach outperforms previous approaches on two challenging imitation learning problems and a benchmark sequence labeling problem.},
+  eventtitle = {International {{Conference}} on {{Artificial Intelligence}} and {{Statistics}}},
+  file = {/Users/alexandercai/Library/CloudStorage/GoogleDrive-alexcai@college.harvard.edu/My Drive/Vault/papers/assets/2010/A Reduction of Imitation Learning and Structured Prediction to No-Regret Online (2010) - Ross, Gordon, Bagnell.pdf}
+}
+
+@online{sun_easy--hard_2024,
+  title = {Easy-to-{{Hard Generalization}}: {{Scalable Alignment Beyond Human Supervision}}},
+  shorttitle = {Easy-to-{{Hard Generalization}}},
+  author = {Sun, Zhiqing and Yu, Longhui and Shen, Yikang and Liu, Weiyang and Yang, Yiming and Welleck, Sean and Gan, Chuang},
+  date = {2024-03-14},
+  eprint = {2403.09472},
+  eprinttype = {arXiv},
+  eprintclass = {cs},
+  doi = {10.48550/arXiv.2403.09472},
+  url = {http://arxiv.org/abs/2403.09472},
+  urldate = {2024-07-01},
+  abstract = {Current AI alignment methodologies rely on human-provided demonstrations or judgments, and the learned capabilities of AI systems would be upper-bounded by human capabilities as a result. This raises a challenging research question: How can we keep improving the systems when their capabilities have surpassed the levels of humans? This paper answers this question in the context of tackling hard reasoning tasks (e.g., level 4-5 MATH problems) via learning from human annotations on easier tasks (e.g., level 1-3 MATH problems), which we term as \textbackslash textit\{easy-to-hard generalization\}. Our key insight is that an evaluator (reward model) trained on supervisions for easier tasks can be effectively used for scoring candidate solutions of harder tasks and hence facilitating easy-to-hard generalization over different levels of tasks. Based on this insight, we propose a novel approach to scalable alignment, which firstly trains the process-supervised reward models on easy problems (e.g., level 1-3), and then uses them to evaluate the performance of policy models on hard problems. We show that such \textbackslash textit\{easy-to-hard generalization from evaluators\} can enable \textbackslash textit\{easy-to-hard generalizations in generators\} either through re-ranking or reinforcement learning (RL). Notably, our process-supervised 7b RL model achieves an accuracy of 34.0\textbackslash\% on MATH500, despite only using human supervision on easy problems. Our approach suggests a promising path toward AI systems that advance beyond the frontier of human supervision.},
+  pubstate = {prepublished},
+  keywords = {Computer Science - Artificial Intelligence,Computer Science - Computation and Language,Computer Science - Machine Learning},
+  file = {/Users/alexandercai/Library/CloudStorage/GoogleDrive-alexcai@college.harvard.edu/My Drive/Vault/papers/assets/2024/Easy-to-Hard Generalization (2024) - Sun et al.pdf;/Users/alexandercai/Zotero/storage/J52D59AK/2403.html}
+}
+
 @book{sussman_functional_2013,
   title = {Functional Differential Geometry},
   author = {Sussman, Gerald Jay and Wisdom, Jack and Farr, Will},
@@ -145,3 +266,68 @@ @book{vershynin_high-dimensional_2018
   keywords = {Business & Economics / Econometrics,Computers / Optical Data Processing,Language Arts & Disciplines / Library & Information Science / General,Mathematics / Probability & Statistics / General,Technology & Engineering / Signals & Signal Processing},
   file = {/Users/alexandercai/Library/CloudStorage/GoogleDrive-alexcai@college.harvard.edu/My Drive/Vault/papers/assets/2018/High-Dimensional Probability (2018) - Vershynin.pdf}
 }
+
+@online{welleck_decoding_2024,
+  title = {From {{Decoding}} to {{Meta-Generation}}: {{Inference-time Algorithms}} for {{Large Language Models}}},
+  shorttitle = {From {{Decoding}} to {{Meta-Generation}}},
+  author = {Welleck, Sean and Bertsch, Amanda and Finlayson, Matthew and Schoelkopf, Hailey and Xie, Alex and Neubig, Graham and Kulikov, Ilia and Harchaoui, Zaid},
+  date = {2024-06-24},
+  eprint = {2406.16838},
+  eprinttype = {arXiv},
+  eprintclass = {cs},
+  doi = {10.48550/arXiv.2406.16838},
+  url = {http://arxiv.org/abs/2406.16838},
+  urldate = {2024-07-01},
+  abstract = {One of the most striking findings in modern research on large language models (LLMs) is that scaling up compute during training leads to better results. However, less attention has been given to the benefits of scaling compute during inference. This survey focuses on these inference-time approaches. We explore three areas under a unified mathematical formalism: token-level generation algorithms, meta-generation algorithms, and efficient generation. Token-level generation algorithms, often called decoding algorithms, operate by sampling a single token at a time or constructing a token-level search space and then selecting an output. These methods typically assume access to a language model's logits, next-token distributions, or probability scores. Meta-generation algorithms work on partial or full sequences, incorporating domain knowledge, enabling backtracking, and integrating external information. Efficient generation methods aim to reduce token costs and improve the speed of generation. Our survey unifies perspectives from three research communities: traditional natural language processing, modern LLMs, and machine learning systems.},
+  pubstate = {prepublished},
+  keywords = {Computer Science - Computation and Language,Computer Science - Machine Learning},
+  file = {/Users/alexandercai/Library/CloudStorage/GoogleDrive-alexcai@college.harvard.edu/My Drive/Vault/papers/assets/2024/From Decoding to Meta-Generation (2024) - Welleck et al.pdf;/Users/alexandercai/Zotero/storage/S4Y984R4/2406.html}
+}
+
+@online{zhai_fine-tuning_2024,
+  title = {Fine-{{Tuning Large Vision-Language Models}} as {{Decision-Making Agents}} via {{Reinforcement Learning}}},
+  author = {Zhai, Yuexiang and Bai, Hao and Lin, Zipeng and Pan, Jiayi and Tong, Shengbang and Zhou, Yifei and Suhr, Alane and Xie, Saining and LeCun, Yann and Ma, Yi and Levine, Sergey},
+  date = {2024-05-16},
+  eprint = {2405.10292},
+  eprinttype = {arXiv},
+  eprintclass = {cs},
+  doi = {10.48550/arXiv.2405.10292},
+  url = {http://arxiv.org/abs/2405.10292},
+  urldate = {2024-07-01},
+  abstract = {Large vision-language models (VLMs) fine-tuned on specialized visual instruction-following data have exhibited impressive language reasoning capabilities across various scenarios. However, this fine-tuning paradigm may not be able to efficiently learn optimal decision-making agents in multi-step goal-directed tasks from interactive environments. To address this challenge, we propose an algorithmic framework that fine-tunes VLMs with reinforcement learning (RL). Specifically, our framework provides a task description and then prompts the VLM to generate chain-of-thought (CoT) reasoning, enabling the VLM to efficiently explore intermediate reasoning steps that lead to the final text-based action. Next, the open-ended text output is parsed into an executable action to interact with the environment to obtain goal-directed task rewards. Finally, our framework uses these task rewards to fine-tune the entire VLM with RL. Empirically, we demonstrate that our proposed framework enhances the decision-making capabilities of VLM agents across various tasks, enabling 7b models to outperform commercial models such as GPT4-V or Gemini. Furthermore, we find that CoT reasoning is a crucial component for performance improvement, as removing the CoT reasoning results in a significant decrease in the overall performance of our method.},
+  pubstate = {prepublished},
+  keywords = {Computer Science - Artificial Intelligence,Computer Science - Computation and Language,Computer Science - Computer Vision and Pattern Recognition,Computer Science - Machine Learning},
+  file = {/Users/alexandercai/Library/CloudStorage/GoogleDrive-alexcai@college.harvard.edu/My Drive/Vault/papers/assets/2024/Fine-Tuning Large Vision-Language Models as Decision-Making Agents via (2024) - Zhai et al.pdf;/Users/alexandercai/Zotero/storage/2X2WJU4D/2405.html}
+}
+
+@online{zhang_adaptable_2024,
+  title = {Adaptable {{Logical Control}} for {{Large Language Models}}},
+  author = {Zhang, Honghua and Kung, Po-Nien and Yoshida, Masahiro and family=Broeck, given=Guy Van, prefix=den, useprefix=false and Peng, Nanyun},
+  date = {2024-06-19},
+  eprint = {2406.13892},
+  eprinttype = {arXiv},
+  eprintclass = {cs},
+  doi = {10.48550/arXiv.2406.13892},
+  url = {http://arxiv.org/abs/2406.13892},
+  urldate = {2024-07-01},
+  abstract = {Despite the success of Large Language Models (LLMs) on various tasks following human instructions, controlling model generation at inference time poses a persistent challenge. In this paper, we introduce Ctrl-G, an adaptable framework that facilitates tractable and flexible control of LLM generation to reliably follow logical constraints. Ctrl-G combines any production-ready LLM with a Hidden Markov Model, enabling LLM outputs to adhere to logical constraints represented as deterministic finite automata. We show that Ctrl-G, when applied to a TULU2-7B model, outperforms GPT3.5 and GPT4 on the task of interactive text editing: specifically, for the task of generating text insertions/continuations following logical constraints, Ctrl-G achieves over 30\% higher satisfaction rate in human evaluation compared to GPT4. When applied to medium-size language models (e.g., GPT2-large), Ctrl-G also beats its counterparts for constrained generation by large margins on standard benchmarks. Additionally, as a proof-of-concept study, we experiment Ctrl-G on the Grade School Math benchmark to assist LLM reasoning, foreshadowing the application of Ctrl-G, as well as other constrained generation approaches, beyond traditional language generation tasks.},
+  pubstate = {prepublished},
+  keywords = {Computer Science - Computation and Language},
+  file = {/Users/alexandercai/Library/CloudStorage/GoogleDrive-alexcai@college.harvard.edu/My Drive/Vault/papers/assets/2024/Adaptable Logical Control for Large Language Models (2024) - Zhang, Kung, Yoshida, Broeck, Peng.pdf;/Users/alexandercai/Zotero/storage/38W8T74Y/2406.html}
+}
+
+@online{zhang_deep_2015,
+  title = {Deep Learning with {{Elastic Averaging SGD}}},
+  author = {Zhang, Sixin and Choromanska, Anna and LeCun, Yann},
+  date = {2015-10-25},
+  eprint = {1412.6651},
+  eprinttype = {arXiv},
+  eprintclass = {cs, stat},
+  doi = {10.48550/arXiv.1412.6651},
+  url = {http://arxiv.org/abs/1412.6651},
+  urldate = {2024-07-01},
+  abstract = {We study the problem of stochastic optimization for deep learning in the parallel computing environment under communication constraints. A new algorithm is proposed in this setting where the communication and coordination of work among concurrent processes (local workers), is based on an elastic force which links the parameters they compute with a center variable stored by the parameter server (master). The algorithm enables the local workers to perform more exploration, i.e. the algorithm allows the local variables to fluctuate further from the center variable by reducing the amount of communication between local workers and the master. We empirically demonstrate that in the deep learning setting, due to the existence of many local optima, allowing more exploration can lead to the improved performance. We propose synchronous and asynchronous variants of the new algorithm. We provide the stability analysis of the asynchronous variant in the round-robin scheme and compare it with the more common parallelized method ADMM. We show that the stability of EASGD is guaranteed when a simple stability condition is satisfied, which is not the case for ADMM. We additionally propose the momentum-based version of our algorithm that can be applied in both synchronous and asynchronous settings. Asynchronous variant of the algorithm is applied to train convolutional neural networks for image classification on the CIFAR and ImageNet datasets. Experiments demonstrate that the new algorithm accelerates the training of deep architectures compared to DOWNPOUR and other common baseline approaches and furthermore is very communication efficient.},
+  pubstate = {prepublished},
+  keywords = {Computer Science - Machine Learning,Statistics - Machine Learning},
+  file = {/Users/alexandercai/Library/CloudStorage/GoogleDrive-alexcai@college.harvard.edu/My Drive/Vault/papers/assets/2015/Deep learning with Elastic Averaging SGD (2015) - Zhang, Choromanska, LeCun.pdf;/Users/alexandercai/Zotero/storage/M4LFKVWK/1412.html}
+}