Update codes

ZhiqingXiao · Feb 25, 2024 · 58d6e93 · 58d6e93
1 parent ef940e5
commit 58d6e93
Show file tree

Hide file tree

Showing 11 changed files with 669 additions and 57 deletions.
diff --git a/en2023/abbreviation.md b/en2023/abbreviation.md
@@ -42,6 +42,7 @@
 | GAIL | Generative Adversarial Imitation Learning |
 | GAN | Generative Adversarial Network |
 | GP | Gaussian Process |
+| GPT | Generative Pre-trained Transformer |
 | GPU | Graphics Processing Unit |
 | HRL | Hierarchical Reinforcement Learning |
 | IL | Imitation Learning |
@@ -78,6 +79,7 @@
 | RAM | Random Access Memory |
 | ReLU | Rectified Linear Unit |
 | RL | Reinforcement Learning |
+| RLHF | Reinforcement Learning with Human Feedback |
 | SAC | Soft Actor–Critic |
 | SARSA | State-Action-Reward-State-Action |
 | SGD | Stochastic Gradient Descent  |

diff --git a/en2023/abbreviation_zh.md b/en2023/abbreviation_zh.md
@@ -42,6 +42,7 @@
 | GAIL | 生成对抗模仿学习 | Generative Adversarial Imitation Learning |
 | GAN | 生成对抗网络 | Generative Adversarial Network |
 | GP | Gaussian过程 | Gaussian Process |
+| GPT | 生成性预变换模型 | Generative Pre-trained Transformer |
 | GPU | 图形处理器 | Graphics Processing Unit |
 | HRL | 分层强化学习 | Hierarchical Reinforcement Learning |
 | IL | 模仿学习 | Imitation Learning |
@@ -78,6 +79,7 @@
 | RAM | 随机存取存储器 | Random Access Memory |
 | ReLU | 修正线性单元 | Rectified Linear Unit |
 | RL | 强化学习 | Reinforcement Learning |
+| RLHF | 人类反馈强化学习 | Reinforcement Learning with Human Feedback |
 | SAC | 柔性执行者/评论者算法 | Soft Actor–Critic |
 | SARSA | 状态/动作/奖励/状态/动作 | State-Action-Reward-State-Action |
 | SGD | 随机梯度下降 | Stochastic Gradient Descent  |

diff --git a/en2023/notation.md b/en2023/notation.md
@@ -0,0 +1,109 @@
+# Notation
+
+### General rules
+
+- Upper-case letters are random events or random numbers, while lower-case letters are deterministic events or deterministic variables.
+- The serif typeface, such as $X$, denotes numerical values. The sans typeface, such as $\mathsfit{X}$, denotes events in general, which can be either numerical or not numerical.
+- Bold letters denote vectors (such as $\mathbf{w}$) or matrices (such as $\mathbf{F}$), where matrices are always upper-case, even they are deterministic matrices.
+- Calligraph letters, such as $\mathcal{X}$, denote sets.
+- Fraktur letters, such as $\mathfrak{f}$, denote mappings.
+
+### Table
+
+In the sequel are notations throughout the book. We also occasionally follow other notations defined locally.
+
+| English Letters | Description |
+| :---: | --- |
+| $A$, $a$ | advantage |
+| $\mathsfit{A}$, $\mathsfit{a}$ | action |
+| $\mathcal{A}$ | action space |
+| $B$, $b$ | baseline in policy gradient; numerical belief in partially observable tasks; (lower case only) bonus; behavior policy in off-policy learning |
+| $\mathsfit{B}$, $\mathsfit{b}$ | belief in partially observable tasks |
+| $\mathfrak{B}_ \pi$, $\mathfrak{b}_ \pi$ | Bellman expectation operator of policy $\pi$ (upper case only used in distributional RL) |
+| $\mathfrak{B}_ \ast$, $\mathfrak{b}_ \ast$ | Bellman optimal operator (upper case only used in distributional RL) |
+| $\mathcal{B}$ | a batch of transition generated by experience replay; belief space in partially observable tasks |
+| $\mathcal{B}^+$ | belief space with terminal belief in partially observable tasks |
+| $c$ | counting; coefficients in linear programming |
+| $\text{Cov}$ | covariance |
+| $d$, $d_ \infty$ | metrics |
+| $d_ f$ | $f$-divergence |
+| $d_ \text{KL}$ | KL divergence |
+| $d_ \text{JS}$ | JS divergence |
+| $d_ \text{TV}$ | total variation |
+| $D_ t$ | indicator of episode end |
+| $\mathcal{D}$ | set of experience |
+| $\mathrm{e}$ | the constant $\mathrm{e}$ ( $\approx2.72$ ) |
+| $e$ | eligibility trace |
+| $\text{E}$ | expectation |
+| $\mathfrak{f}$ | a mapping |
+| $\mathbf{F}$ | Fisher information matrix |
+| $G$, $g$ | return |
+| $\mathbf{g}$ | gradient vector |
+| $h$ | action preference |
+| $\text{H}$ | entropy |
+| $k$ | index of iteration |
+| $\ell$ | loss |
+| $\mathbb{N}$ | set of natural numbers |
+| $o$ | observation probability in partially observable tasks; infinitesimal in asymptotic notations |
+| $O$, $\tilde{O}$ | infinite in asymptotic notations |
+| $\mathsfit{O}$, $\mathsfit{o}$ | observation |
+| $\mathcal{O}$ | observation space |
+| $p$ | probability, dynamics |
+| $\mathbf{P}$ | transition matrix |
+| $\Pr$ | probability |
+| $Q$, $q$ | action value |
+| $Q_ \pi$, $q_ \pi$ | action value of policy $\pi$ (upper case only used in distributional RL) |
+| $Q_ \ast$, $q_ \ast$ | optimal action values (upper case only used in distributional RL) |
+| $\mathbf{q}$ | vector representation of action values |
+| $R$, $r$ | reward |
+| $\mathcal{R}$ | reward space |
+| $\mathbb{R}$ | set of real numbers |
+| $\mathsfit{S}$, $\mathsfit{s}$ | state |
+| $\mathcal{S}$ | state space |
+| $\mathcal{S}^+$ | state space with terminal state |
+| $T$ | steps in an episode |
+| $\mathsfit{T}$, $\Tiny\mathsfit{T}$ | trajectory |
+| $\mathcal{T}$ | time index set |
+| $\mathfrak{u}$ | belief update operator in partially observable tasks |
+| $U$, $u$ | TD target; (lower case only) upper bound |
+| $V$, $v$ | state value |
+| $V_ \pi$, $v_ \pi$ | state value of the policy $\pi$ (upper case only used in distributional RL) |
+| $V_ \ast$, $v_ \ast$ | optimal state values (upper case only used in distributional RL) |
+| $\mathbf{v}$ | vector representation of state values |
+| $\text{Var}$ | variance |
+| $\mathbf{w}$ | parameters of value function estimate |
+| $\mathsfit{X}$, $\mathsfit{x}$ | an event |
+| $\mathcal{X}$ | event space |
+| $\mathbf{z}$ | parameters for eligibility trace |
+| **Greek Letters** | **Description** |
+| $\alpha$ | learning rate |
+| $\beta$ | reinforce strength in eligibility trace; distortion function in distributional RL |
+| $\gamma$ | discount factor |
+| $\mathit\Delta$, $\delta$ | TD error |
+| $\varepsilon$ | parameters for exploration |
+| $\eta$ | state visitation frequency |
+| $\boldsymbol\upeta$ | vector representation of state visitation frequency |
+| $\lambda$ | decay strength of eligibility trace |
+| $\boldsymbol\uptheta$ | parameters for policy function estimates |
+| $\vartheta$ | threshold for value iteration |
+| $\uppi$ | the constant $\uppi$ ( $\approx3.14$ ) |
+| $\mathit\Pi$, $\pi$ | policy |
+| $\pi_ \ast$ | optimal policy |
+| $\pi_ \text{E}$ | expert policy in imitation learning |
+| $\rho$ | state–action visitation frequency; important sampling ratio in off-policy learning |
+| $\phi$ | quantile |
+| $\boldsymbol\uprho$ | vector representation of state–action visitation frequency |
+| $\huge\tau$, $\tau$ | sojourn time of SMDP |
+| $\mathit\Psi$ | Generalized Advantage Estimate (GAE) |
+| $\mathit\Omega$, $\omega$ | accumulated probability in distribution RL; (lower case only) conditional probability for partially observable tasks |
+| **Other Notations** | **Description** |
+| $\stackrel{\text{a.e.}}{=}$ | equal almost everywhere |
+| $\stackrel{\text{d}}{=}$ | share the same distribution |
+| $\stackrel{\text{def}}{=}$ | define |
+| $\lt$, $\le$, $\ge$, $\gt$ | compare numbers; element-wise comparison |
+| $\prec$, $\preccurlyeq$, $\succcurlyeq$, $\succ$ | partial order comparison |
+| $\ll$ | absolute continuous |
+| $\varnothing$ | empty set |
+| $\nabla$ | gradient |
+| $\sim$ | obey a distribution; utility equivalence in distributional RL |
+| $\left\|\quad\right\|$ | absolute value of a real number; element-wise absolute values of a vector or a matrix; the number of elements in a set |
diff --git a/en2023/notation_zh.md b/en2023/notation_zh.md
@@ -0,0 +1,109 @@
+# 《强化学习：原理与Python实现》数学记号
+
+### 一般规律
+
+- 大写是随机事件或随机变量，小写是确定性事件或确定性变量。
+- 衬线体（如Times New Roman字体，如 $X$ ）是数值，非衬线体（如Open Sans字体，如 $\mathsfit{X}$ ）则不一定是数值。
+- 粗体是向量（如 $\mathbf{w}$ ）或矩阵（如 $\mathbf{F}$ ）（矩阵用大写，即使是确定量也是如此）。
+- 花体（如 $\mathcal{X}$ ）是集合。
+- 哥特体（如 $\mathfrak{f}$ ）是映射。
+
+### 数学记号表
+
+下表列出常用记号。部分小节会有局部定义的记号，以该局部定义为准。
+
+| 英语字母 | 含义 | 英文含义 |
+| :---: | --- | --- |
+| $A$, $a$ | 优势 | advantage |
+| $\mathsfit{A}$, $\mathsfit{a}$ | 动作 | action |
+| $\mathcal{A}$ | 动作空间 | action space |
+| $B$, $b$ | 策略梯度算法中的基线；部分可观测任务中的数值化信念；（仅小写）额外量；异策学习时的行为策略 | baseline in policy gradient; numerical belief in partially observable tasks; (lower case only) bonus; behavior policy in off-policy learning |
+| $\mathsfit{B}$, $\mathsfit{b}$ | 部分可观测任务中的信念 | belief in partially observable tasks |
+| $\mathfrak{B}_ \pi$, $\mathfrak{b}_ \pi$ | 策略 $\pi$ 的Bellman期望算子（大写只用于值分布学习） | Bellman expectation operator of policy $\pi$ (upper case only used in distributional RL) |
+| $\mathfrak{B}_ \ast$, $\mathfrak{b}_ \ast$ | Bellman最优算子（大写只用于值分布学习） | Bellman optimal operator (upper case only used in distributional RL) |
+| $\mathcal{B}$ | 经验回放中抽取的一批经验；部分可观测任务中的信念空间 | a batch of transition generated by experience replay; belief space in partially observable tasks |
+| $\mathcal{B}^+$ | 部分可观测任务中带终止信念的信念空间 | belief space with terminal belief in partially observable tasks |
+| $c$ | 计数值；线性规划的目标系数 | counting; coefficients in linear programming |
+| $\text{Cov}$ | 协方差 | covariance |
+| $d$, $d_ \infty$ | 度量 | metrics |
+| $d_ f$ | $f$散度 | $f$-divergence |
+| $d_ \text{KL}$ | KL散度 | KL divergence |
+| $d_ \text{JS}$ | JS散度 | JS divergence |
+| $d_ \text{TV}$ | 全变差 | total variation |
+| $D_ t$ | 回合结束指示 | indicator of episode end |
+| $\mathcal{D}$ | 经验集 | set of experience |
+| $\mathrm{e}$ | 自然常数 | the constant $\mathrm{e}$ ( $\approx2.72$ ) |
+| $e$ | 资格迹 | eligibility trace |
+| $\text{E}$ | 期望 | expectation |
+| $\mathfrak{f}$ | 一般的映射 | a mapping |
+| $\mathbf{F}$ | Fisher信息矩阵 | Fisher information matrix |
+| $G$, $g$ | 回报 | return |
+| $\mathbf{g}$ | 梯度向量 | gradient vector |
+| $h$ | 动作偏好 | action preference |
+| $\text{H}$ | 熵 | entropy |
+| $k$ | 迭代次数指标 | index of iteration |
+| $\ell$ | 损失 | loss |
+| $\mathbb{N}$ | 自然数集 | set of natural numbers |
+| $o$ | 部分可观测环境的观测概率；渐近无穷小 | observation probability in partially observable tasks; infinitesimal in asymptotic notations |
+| $O$, $\tilde{O}$ | 渐近无穷大 | infinite in asymptotic notations |
+| $\mathsfit{O}$, $\mathsfit{o}$ | 观测 | observation |
+| $\mathcal{O}$ | 观测空间 | observation space |
+| $p$ | 概率值，动力 | probability, dynamics |
+| $\mathbf{P}$ | 转移矩阵 | transition matrix |
+| $\Pr$ | 概率 | probability |
+| $Q$, $q$ | 动作价值 | action value |
+| $Q_ \pi$, $q_ \pi$ | 策略 $\pi$ 的动作价值（大写只用于值分布学习） | action value of policy $\pi$ (upper case only used in distributional RL) |
+| $Q_ \ast$, $q_ \ast$ | 最优动作价值（大写只用于值分布学习） | optimal action values (upper case only used in distributional RL) |
+| $\mathbf{q}$ | 动作价值的向量表示 | vector representation of action values |
+| $R$, $r$ | 奖励 | reward |
+| $\mathcal{R}$ | 奖励空间 | reward space |
+| $\mathbb{R}$ | 实数集 | set of real numbers |
+| $\mathsfit{S}$, $\mathsfit{s}$ | 状态 | state |
+| $\mathcal{S}$ | 状态空间 | state space |
+| $\mathcal{S}^+$ | 带终止状态的状态空间 | state space with terminal state |
+| $T$ | 回合步数 | steps in an episode |
+| $\mathsfit{T}$, $\Tiny\mathsfit{T}$ | 轨迹 | trajectory |
+| $\mathcal{T}$ | 时间指标 | time index set |
+| $\mathfrak{u}$ | 部分可观测任务中的信念更新算子 | belief update operator in partially observable tasks |
+| $U$, $u$ | 用自益得到的回报估计随机变量；小写的$u$还表示置信上界 | TD target; (lower case only) upper bound |
+| $V$, $v$ | 状态价值 | state value |
+| $V_ \pi$, $v_ \pi$ | 策略 $\pi$ 的状态价值（大写只用于值分布学习） | state value of the policy $\pi $ (upper case only used in distributional RL) |
+| $V_ \ast$, $v_ \ast$ | 最优状态价值（大写只用于值分布学习） | optimal state values (upper case only used in distributional RL) |
+| $\mathbf{v}$ | 状态价值的向量表示 | vector representation of state values |
+| $\text{Var}$ | 方差 | variance |
+| $\mathbf{w}$ | 价值估计参数 | parameters of value function estimate |
+| $\mathsfit{X}$, $\mathsfit{x}$ | 一般的事件 | an event |
+| $\mathcal{X}$ | 一般的事件空间 | event space |
+| $\mathbf{z}$ | 资格迹参数 | parameters for eligibility trace |
+| **希腊字母** | **含义** | **英文含义** |
+| $\alpha$ | 学习率 | learning rate |
+| $\beta$ | 资格迹算法中的强化强度；值分布学习中的扭曲函数 | reinforce strength in eligibility trace; distortion function in distributional RL |
+| $\gamma$ | 折扣因子 | discount factor |
+| $\mathit\Delta$, $\delta$ | 时序差分误差 | TD error |
+| $\varepsilon$ | 探索参数 | parameters for exploration |
+| $\eta$ | 状态访问频次 | state visitation frequency |
+| $\boldsymbol\upeta$ | 状态访问频次的向量表示 | vector representation of state visitation frequency |
+| $\lambda$ | 资格迹衰减强度 | decay strength of eligibility trace |
+| $\boldsymbol\uptheta$ | 策略估计参数 | parameters for policy function estimates |
+| $\vartheta$ | 价值迭代终止阈值 | threshold for value iteration |
+| $\uppi$ | 圆周率 | the constant $\uppi$ ( $\approx3.14$ ) |
+| $\mathit\Pi$, $\pi$ | 策略 | policy |
+| $\pi_ \ast$ | 最优策略 | optimal policy |
+| $\pi_ \text{E}$ | 模仿学习中的专家策略 | expert policy in imitation learning |
+| $\rho$ | 状态动作对访问频次；异策算法中的重要性采样比率 | state–action visitation frequency; important sampling ratio in off-policy learning |
+| $\phi$ | 分位数 | quantile |
+| $\boldsymbol\uprho$ | 状态动作对访问频次的向量表示 | vector representation of state–action visitation frequency |
+| $\huge\tau$, $\tau$ | 半Markov决策过程中的逗留时间 | sojourn time of SMDP |
+| $\mathit\Psi$ | 扩展的优势估计 | Generalized Advantage Estimate (GAE) |
+| $\mathit\Omega$, $\omega$ | 值分布学习中的累积概率；（仅小写）部分可观测任务中的条件概率 | accumulated probability in distribution RL; (lower case only) conditional probability for partially observable tasks |
+| **其他符号** | **含义** | **英文含义** |
+| $\stackrel{\text{a.e.}}{=}$ | 几乎处处相等 | equal almost everywhere |
+| $\stackrel{\text{d}}{=}$ | 分布相同 | share the same distribution |
+| $\stackrel{\text{def}}{=}$ | 定义 | define |
+| $\lt$, $\le$, $\ge$, $\gt$ | 普通数值比较；向量逐元素比较 | compare numbers; element-wise comparison |
+| $\prec$, $\preccurlyeq$, $\succcurlyeq$, $\succ$ | 偏序关系 | partial order comparison |
+| $\ll$ | 绝对连续 | absolute continuous |
+| $\varnothing$ | 空集 | empty set |
+| $\nabla$ | 梯度 | gradient |
+| $\sim$ | 服从分布；效用相同 | obey a distribution; utility equivalence in distributional RL |
+| $\left\|\quad\right\|$ | 实数的绝对值；向量或矩阵的逐元素求绝对值；集合的元素个数 | absolute value of a real number; element-wise absolute values of a vector or a matrix; the number of elements in a set |
diff --git a/en2023/setup/setupmac.md b/en2023/setup/setupmac.md
@@ -10,7 +10,7 @@ This part will show how to set up a minimum environment. After this step, you ar
 
 **Steps:**
 
-- Download the installer on https://www.anaconda.com/products/distribution (Pick MacOS Graphical Installer for MacOS users). The name of installer is alike `Anaconda3-2022.10-MacOSX-x86_64.pkg` (or `Anaconda3-2022.10-MacOSX-amd64.pkg` for M chip), and the size is about 0.7 GB (or 0.5 GB for M chip).
+- Download the installer on https://www.anaconda.com/products/distribution (Pick MacOS Graphical Installer for MacOS users). The name of installer is alike `Anaconda3-2023.09-0-MacOSX-x86_64.pkg` (or `Anaconda3-2023.09-0-MacOSX-amd64.pkg` for M chip), and the size is about 0.6 GB.
 - Double click the installer to start the install wizard and install accordingly. The free space of the disk should be at least 13GB. (If the free space of the disk is too little, you may still be able to install Anaconda 3 itself, but you may not have enough free space in the follow-up steps. 13GB is the storage requirements for all steps in this article.) Record the location of Anaconda installation. The default location is `/opt/anaconda3`. We will use the location in the sequal.
 
 #### Create a New Conda Environment
@@ -144,10 +144,14 @@ Please install the latest version of Xcode from AppStore.
 
 Codes in Chapter 15 use PyBullet. You can skip this part if you do not care the codes in Chapter 15.
 
-This part will show how to install PyBullet upon the environment with PyTorch and/or TensorFlow in Part 2. Upon completed, you are able to run codes in Chapter 1-9, 12-13, 15-16. If you complete all of Part 3.1-3.3, you are able to run codes in all chapters.
+Since PyBullet depends on an old version of Gym, so it is better install it in a new conda environment so that it will not pollute current conda environment.
 
 **Steps:**
 
+- Create a new conda environment
+
+- Install packages such as Gym in the new environment.
+
 - Execute the following command in the target conda environment:
    ```
    pip install --upgrade pybullet

diff --git a/en2023/setup/setupmac_zh.md b/en2023/setup/setupmac_zh.md
@@ -10,7 +10,7 @@
 
 **步骤：**
 
-- 从https://www.anaconda.com/products/distribution 下载Anaconda 3安装包（选择MacOS Graphical版的安装包）。安装包名字像 `Anaconda3-2023.07-2-MacOSX-x86_64.pkg`（M芯片版安装包名字像`Anaconda3-2023.07-2-MacOSX-amd64.pkg`），大小约0.7 GB（Mx芯片版约0.5 GB）。
+- 从https://www.anaconda.com/products/distribution 下载Anaconda 3安装包（选择MacOS Graphical版的安装包）。安装包名字像 `Anaconda3-2023.09-0-MacOSX-x86_64.pkg`（M芯片版安装包名字像`Anaconda3-2023.09-0-MacOSX-amd64.pkg`），大小约0.6 GB。
 - 双击安装包启动安装向导完成安装。需要安装在剩余空间大于13GB的硬盘上。（如果空间小于这个数，虽然也能完成Anaconda 3的安装，但是后续步骤的空间就不够了。13GB是后续所有步骤需要的空间。）安装过程中记下Anaconda的安装路径。默认路径为：`/opt/anaconda3`。后续操作会用到这个路径。
 
 #### 新建conda环境
@@ -97,7 +97,7 @@
 
 第10-11章代码需要用到`gym[box2d]`，这个部分安装`gym[box2d]`。如果您不想看这两章代码，可以略过此步，不影响其他部分。
 
-该安装需要基于第2部分装好的环境。完成此步后可以运行第1-13章和第16章代码。完成了全部的第3.1-3.3部分后可以运行所有章节的代码。
+该安装需要基于第2部分装好的环境。完成此步后可以运行第1-13章和第16章代码。
 
 请从App Store里安装Xcode。
 
@@ -147,10 +147,14 @@
 
 第15章代码需要PyBullet，本部分安装PyBullet。如果您不想看这章代码，可以略过此步，不影响其他部分。
 
-本步骤可基于第2部分安装好的环境。完成此步后可以运行第1-9章、第12-13章、第15-16章代码。完成了全部的第3.1-3.3部分后可以运行所有章节的代码。
+由于PyBullet需要用到旧版的Gym，所以最好为PyBullet单独建一个环境，以免污染现有环境。
 
 **步骤：**
 
+- 新建Anaconda环境。
+
+- 在新Anaconda环境里安装Gym等。
+
 - 在目标conda环境中执行下列命令：
    ```
    pip install --upgrade pybullet