Update documentation

adzcai · Sep 2, 2024 · e3fceb6 · e3fceb6
1 parent 3dd1cd2
commit e3fceb6
Show file tree

Hide file tree

Showing 26 changed files with 829 additions and 680 deletions.
diff --git a/_images/npg_line.png b/_images/npg_line.png
diff --git a/_sources/exploration.md b/_sources/exploration.md
@@ -75,7 +75,7 @@ Performance of explore-then-exploitexplore_then_exploit_performance As long as e
 
 We also explored the exploration-exploitation tradeoff in the chapter on {ref}`bandits`. Recall tthat in the MAB setting, we have $K$ arms, each of which has an unknown reward distribution, and we want to learn which of the arms is *optimal*, i.e. has the highest mean reward.
 
-One algorithm that struck a good balance between exploration and exploitation was the **upper confidence bound** algorithm {ref}`ucb`: For each arm, we construct a *confidence interval* for its true mean award, and then choose the arm with the highest upper confidence bound. In summary, $$k_{t+1} \gets \argmax_{k \in [K]} \frac{R^{k}_t}{N^{k}_t} + \sqrt{\frac{\ln(2t/\delta)}{2 N^{k}_t}}$$ where $N_t^k$ indicates the number of times arm $k$ has been pulled up until time $t$, $R_t^k$ indicates the total reward obtained by pulling arm $k$ up until time $t$, and $\delta > 0$ controls the width of the confidence interval. How might we extend UCB to the MDP case?
+One algorithm that struck a good balance between exploration and exploitation was the **upper confidence bound** algorithm {ref}`ucb`: For each arm, we construct a *confidence interval* for its true mean award, and then choose the arm with the highest upper confidence bound. In summary, $$k_{t+1} \gets \arg\max_{k \in [K]} \frac{R^{k}_t}{N^{k}_t} + \sqrt{\frac{\ln(2t/\delta)}{2 N^{k}_t}}$$ where $N_t^k$ indicates the number of times arm $k$ has been pulled up until time $t$, $R_t^k$ indicates the total reward obtained by pulling arm $k$ up until time $t$, and $\delta > 0$ controls the width of the confidence interval. How might we extend UCB to the MDP case?
 
 Let us formally describe an unknown MDP as an MAB problem. In an unknown MDP, we want to learn which *policy* is optimal. So if we want to apply MAB techniques to solving an MDP, it makes sense to think of *arms* as *policies*. There are $K = (|\mathcal{A}|^{|\mathcal{S}|})^\hor$ deterministic policies in a finite MDP. Then, "pulling" arm $\pi$ corresponds to using $\pi$ to act through a trajectory in the MDP, and observing the total reward.
 

diff --git a/_sources/fitted_dp.md b/_sources/fitted_dp.md
@@ -366,106 +366,4 @@ def fitted_policy_iteration(
     return π
 ```
 
-(supervised_learning)=
-## Supervised learning
-
-This section will cover the details of implementing the `fit` function above:
-That is, how to use a dataset of labelled samples $(x_1, y_1), \dots, (x_N, y_N)$ to find a function $f$ that minimizes the empirical risk.
-This requires two ingredients:
-
-1. A **function class** $\mathcal{F}$ to search over
-2. A **fitting method** for minimizing the empirical risk over this class
-
-The two main function classes we will cover are **linear models** and **neural networks**.
-Both of these function classes are *parameterized* by some parameters $\theta$,
-and the fitting method will search over these parameters to minimize the empirical risk:
-
-:::{prf:definition} Parameterized empirical risk minimization
-:label: parameterized_empirical_risk_minimization
-
-Given a dataset of samples $(x_1, y_1), \dots, (x_N, y_N)$ and a class of functions $\mathcal{F}$ parameterized by $\theta$,
-we to find a parameter (vector) $\hat \theta$ that minimizes the empirical risk:
-
-$$
-\hat \theta = \arg\min_{\theta} \frac{1}{N} \sum_{i=1}^N (y_i - f_\theta(x_i))^2
-$$
-:::
-
-The most common fitting method for parameterized models is **gradient descent**.
-
-:::{prf:algorithm} Gradient descent
-Letting $L(\theta) \in \mathbb{R}$ denote the empirical risk in terms of the parameters,
-the gradient descent algorithm updates the parameters according to the rule
-
-$$
-\theta^{t+1} = \theta^t - \eta \nabla_\theta L(\theta^t)
-$$
-
-where $\eta > 0$ is the **learning rate**.
-:::
-
-```{code-cell}
-Params = Float[Array, " D"]
-
-
-def gradient_descent(
-    loss: Callable[[Params], float],
-    θ_init: Params,
-    η: float,
-    epochs: int,
-):
-    """
-    Run gradient descent to minimize the given loss function
-    (expressed in terms of the parameters).
-    """
-    θ = θ_init
-    for _ in range(epochs):
-        θ = θ - η * grad(loss)(θ)
-    return θ
-```
-
-### Linear regression
-
-In linear regression, we assume that the function $f$ is linear in the parameters:
-
-$$
-\mathcal{F} = \{ x \mapsto \theta^\top x \mid \theta \in \mathbb{R}^D \}
-$$
-
-This function class is extremely simple and only contains linear functions.
-To expand its expressivity, we can _transform_ the input $x$ using some feature function $\phi$,
-i.e. $\widetilde x = \phi(x)$, and then fit a linear model in the transformed space instead.
-
-```{code-cell}
-def fit_linear(X: Float[Array, "N D"], y: Float[Array, " N"], φ=lambda x: x):
-    """Fit a linear model to the given dataset using ordinary least squares."""
-    X = vmap(φ)(X)
-    θ = np.linalg.lstsq(X, y, rcond=None)[0]
-    return lambda x: np.dot(φ(x), θ)
-```
-
-### Neural networks
-
-In neural networks, we assume that the function $f$ is a composition of linear functions (represented by matrices $W_i$) and non-linear activation functions (denoted by $\sigma$):
-
-$$
-\mathcal{F} = \{ x \mapsto \sigma(W_L \sigma(W_{L-1} \dots \sigma(W_1 x + b_1) \dots + b_{L-1}) + b_L) \}
-$$
-
-where $W_i \in \mathbb{R}^{D_{i+1} \times D_i}$ and $b_i \in \mathbb{R}^{D_{i+1}}$ are the parameters of the $i$-th layer, and $\sigma$ is the activation function.
-
-This function class is much more expressive and contains many more parameters.
-This makes it more susceptible to overfitting on smaller datasets,
-but also allows it to represent more complex functions.
-In practice, however, neural networks exhibit interesting phenomena during training,
-and are often able to generalize well even with many parameters.
-
-Another reason for their popularity is the efficient **backpropagation** algorithm
-for computing the gradient of the empirical risk with respect to the parameters.
-Essentially, the hierarchical structure of the neural network, i.e. computing the
-output of the network as a composition of functions, allows us to use the chain rule
-to compute the gradient of the output with respect to the parameters of each layer.
-
-{cite}`nielsen_neural_2015` provides a comprehensive introduction to neural networks and backpropagation.
-
-## Bias correction for Q-learning
+## Summary
diff --git a/_sources/index.md b/_sources/index.md
@@ -128,25 +128,30 @@ We will extend ideas from multi-armed bandits to the MDP setting.
 
 ## Notation
 
-We will use the following notation throughout the book. This notation is
-inspired by {cite}`sutton_reinforcement_2018` and {cite}`agarwal_reinforcement_2022`.
-
-| Notation      | Definition                |
-|:-------------:|:--------------------------|
-|      $s$      | A state.                  |
-|      $a$      | An action.                |
-|      $r$      | A reward.                 |
-|      $p$      | A probability.            |
-|     $\pi$     | A policy.                 |
-|      $V$      | A value function.         |
-|      $Q$      | An action-value function. |
-|      $A$      | An advantage function.    |
-|   $\gamma$    | A discount factor.        |
-|    $\tau$     | A trajectory.             |
-| $\mathcal{S}$ | A state space.            |
-| $\mathcal{A}$ | An action space.          |
-
-Note that throughout the text, certain symbols will stand for either random variables or fixed values. We aim to clarify in ambiguous settings.
+We will use the following notation throughout the book.
+This notation is inspired by {cite}`sutton_reinforcement_2018` and {cite}`agarwal_reinforcement_2022`.
+We use $[N]$ as shorthand for the set $\{ 0, 1, \dots, N-1 \}$.
+
+| Element      | Space                    | Definition (of element)   |
+|:------------:|:------------------------:|:--------------------------|
+|      $s$     | $\mathcal{S}$            | A state.                  |
+|      $a$     | $\mathcal{A}$            | An action.                |
+|      $r$     |                          | A reward.                 |
+|   $\gamma$   |                          | A discount factor.        |
+|    $\tau$    | $\mathcal{T}             | A trajectory.             |
+|     $\pi$    | $\Pi$                    | A policy.                 |
+|   $V^\pi$    | $\mathcal{S} \to \mathbb{R}$                         | The value function of policy $\pi$.                               |
+|   $Q^\pi$    | $\mathcal{S} \times \mathcal{A} \to \mathbb{R}$                         | The action-value function (a.k.a. Q function) of policy $\pi$. |
+|   $A^\pi$    |                          | The advantage function of policy $\pi$.    |
+|              | $\triangle(\mathcal{X})$ | A distribution supported on $\mathcal{X}$. |
+|      $\mu$   | $\triangle(\mathcal{S})$ | A distribution over states.        |
+|    $\hi$     |   $[\hor]$               | Time horizon index of an MDP.    |
+|    $k$       |   $[K]$                  | Arm index of a multi-armed bandit. |
+|    $t$       |   $[T]$                  | Iteration index of an algorithm.  |
+|    $\theta$  | $\Theta$                 | A set of parameters. |
+
+Note that throughout the text, certain symbols will stand for either random variables or fixed values.
+We aim to clarify in ambiguous settings.
 
 +++
 

diff --git a/_sources/pg.md b/_sources/pg.md
@@ -847,21 +847,11 @@ $$
 
 We can think of the space of such distributions as the line between $(0, 1)$ to $(1, 0)$ on the Cartesian plane:
 
-```{code-cell}
-# Coordinates of the points
-x = [0, 1]
-y = [1, 0]
-
-# Plotting the line
-plt.plot(x, y, marker='o')
-plt.title("Line between (0, 1) and (1, 0)")
-plt.xlabel("X-axis")
-plt.ylabel("Y-axis")
-plt.grid(True)
-
-# Display the plot
-plt.show()
-```
+:::{image} shared/npg_line.png
+:alt: a line from (0, 1) to (1, 0)
+:width: 240px
+:align: center
+:::
 
 Clearly the optimal distribution is the constant one $\pi(1) = 1$. Suppose we optimize over the parameterized family $\pi_\theta(1) = \frac{\exp(\theta)}{1+\exp(\theta)}$.
 Then our optimization algorithm should set $\theta$ to be unboundedly large.

diff --git a/_sources/planning.md b/_sources/planning.md
@@ -1,4 +1,12 @@
 
+
+
++++
+
++++
+
++++
+
 (planning)=
 # Planning
 

diff --git a/_sources/supervised_learning.md b/_sources/supervised_learning.md
@@ -0,0 +1,120 @@
+---
+jupytext:
+  text_representation:
+    extension: .md
+    format_name: myst
+    format_version: 0.13
+    jupytext_version: 1.16.2
+kernelspec:
+  display_name: Python 3 (ipykernel)
+  language: python
+  name: python3
+---
+
+(supervised_learning)=
+# Supervised learning
+
+This section will cover the details of implementing the `fit` function above:
+That is, how to use a dataset of labelled samples $(x_1, y_1), \dots, (x_N, y_N)$ to find a function $f$ that minimizes the empirical risk.
+This requires two ingredients:
+
+1. A **function class** $\mathcal{F}$ to search over
+2. A **fitting method** for minimizing the empirical risk over this class
+
+The two main function classes we will cover are **linear models** and **neural networks**.
+Both of these function classes are *parameterized* by some parameters $\theta$,
+and the fitting method will search over these parameters to minimize the empirical risk:
+
+:::{prf:definition} Parameterized empirical risk minimization
+:label: parameterized_empirical_risk_minimization
+
+Given a dataset of samples $(x_1, y_1), \dots, (x_N, y_N)$ and a class of functions $\mathcal{F}$ parameterized by $\theta$,
+we to find a parameter (vector) $\hat \theta$ that minimizes the empirical risk:
+
+$$
+\hat \theta = \arg\min_{\theta} \frac{1}{N} \sum_{i=1}^N (y_i - f_\theta(x_i))^2
+$$
+:::
+
+The most common fitting method for parameterized models is **gradient descent**.
+
+:::{prf:algorithm} Gradient descent
+Letting $L(\theta) \in \mathbb{R}$ denote the empirical risk in terms of the parameters,
+the gradient descent algorithm updates the parameters according to the rule
+
+$$
+\theta^{t+1} = \theta^t - \eta \nabla_\theta L(\theta^t)
+$$
+
+where $\eta > 0$ is the **learning rate**.
+:::
+
+```{code-cell}
+:tags: [hide-input]
+
+from jaxtyping import Float, Array
+from collections.abc import Callable
+```
+
+```{code-cell}
+Params = Float[Array, " D"]
+
+
+def gradient_descent(
+    loss: Callable[[Params], float],
+    θ_init: Params,
+    η: float,
+    epochs: int,
+):
+    """
+    Run gradient descent to minimize the given loss function
+    (expressed in terms of the parameters).
+    """
+    θ = θ_init
+    for _ in range(epochs):
+        θ = θ - η * grad(loss)(θ)
+    return θ
+```
+
+## Linear regression
+
+In linear regression, we assume that the function $f$ is linear in the parameters:
+
+$$
+\mathcal{F} = \{ x \mapsto \theta^\top x \mid \theta \in \mathbb{R}^D \}
+$$
+
+This function class is extremely simple and only contains linear functions.
+To expand its expressivity, we can _transform_ the input $x$ using some feature function $\phi$,
+i.e. $\widetilde x = \phi(x)$, and then fit a linear model in the transformed space instead.
+
+```{code-cell}
+def fit_linear(X: Float[Array, "N D"], y: Float[Array, " N"], φ=lambda x: x):
+    """Fit a linear model to the given dataset using ordinary least squares."""
+    X = vmap(φ)(X)
+    θ = np.linalg.lstsq(X, y, rcond=None)[0]
+    return lambda x: np.dot(φ(x), θ)
+```
+
+## Neural networks
+
+In neural networks, we assume that the function $f$ is a composition of linear functions (represented by matrices $W_i$) and non-linear activation functions (denoted by $\sigma$):
+
+$$
+\mathcal{F} = \{ x \mapsto \sigma(W_L \sigma(W_{L-1} \dots \sigma(W_1 x + b_1) \dots + b_{L-1}) + b_L) \}
+$$
+
+where $W_i \in \mathbb{R}^{D_{i+1} \times D_i}$ and $b_i \in \mathbb{R}^{D_{i+1}}$ are the parameters of the $i$-th layer, and $\sigma$ is the activation function.
+
+This function class is much more expressive and contains many more parameters.
+This makes it more susceptible to overfitting on smaller datasets,
+but also allows it to represent more complex functions.
+In practice, however, neural networks exhibit interesting phenomena during training,
+and are often able to generalize well even with many parameters.
+
+Another reason for their popularity is the efficient **backpropagation** algorithm for computing the gradient of the empirical risk with respect to the parameters.
+Essentially, the hierarchical structure of the neural network,
+i.e. computing the output of the network as a composition of functions,
+allows us to use the chain rule to compute the gradient of the output with respect to the parameters of each layer.
+
+{cite}`nielsen_neural_2015` provides a comprehensive introduction to neural networks and backpropagation.