-
Notifications
You must be signed in to change notification settings - Fork 322
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
ef940e5
commit 58d6e93
Showing
11 changed files
with
669 additions
and
57 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,109 @@ | ||
# Notation | ||
|
||
### General rules | ||
|
||
- Upper-case letters are random events or random numbers, while lower-case letters are deterministic events or deterministic variables. | ||
- The serif typeface, such as $X$, denotes numerical values. The sans typeface, such as $\mathsfit{X}$, denotes events in general, which can be either numerical or not numerical. | ||
- Bold letters denote vectors (such as $\mathbf{w}$) or matrices (such as $\mathbf{F}$), where matrices are always upper-case, even they are deterministic matrices. | ||
- Calligraph letters, such as $\mathcal{X}$, denote sets. | ||
- Fraktur letters, such as $\mathfrak{f}$, denote mappings. | ||
|
||
### Table | ||
|
||
In the sequel are notations throughout the book. We also occasionally follow other notations defined locally. | ||
|
||
| English Letters | Description | | ||
| :---: | --- | | ||
| $A$, $a$ | advantage | | ||
| $\mathsfit{A}$, $\mathsfit{a}$ | action | | ||
| $\mathcal{A}$ | action space | | ||
| $B$, $b$ | baseline in policy gradient; numerical belief in partially observable tasks; (lower case only) bonus; behavior policy in off-policy learning | | ||
| $\mathsfit{B}$, $\mathsfit{b}$ | belief in partially observable tasks | | ||
| $\mathfrak{B}_ \pi$, $\mathfrak{b}_ \pi$ | Bellman expectation operator of policy $\pi$ (upper case only used in distributional RL) | | ||
| $\mathfrak{B}_ \ast$, $\mathfrak{b}_ \ast$ | Bellman optimal operator (upper case only used in distributional RL) | | ||
| $\mathcal{B}$ | a batch of transition generated by experience replay; belief space in partially observable tasks | | ||
| $\mathcal{B}^+$ | belief space with terminal belief in partially observable tasks | | ||
| $c$ | counting; coefficients in linear programming | | ||
| $\text{Cov}$ | covariance | | ||
| $d$, $d_ \infty$ | metrics | | ||
| $d_ f$ | $f$-divergence | | ||
| $d_ \text{KL}$ | KL divergence | | ||
| $d_ \text{JS}$ | JS divergence | | ||
| $d_ \text{TV}$ | total variation | | ||
| $D_ t$ | indicator of episode end | | ||
| $\mathcal{D}$ | set of experience | | ||
| $\mathrm{e}$ | the constant $\mathrm{e}$ ( $\approx2.72$ ) | | ||
| $e$ | eligibility trace | | ||
| $\text{E}$ | expectation | | ||
| $\mathfrak{f}$ | a mapping | | ||
| $\mathbf{F}$ | Fisher information matrix | | ||
| $G$, $g$ | return | | ||
| $\mathbf{g}$ | gradient vector | | ||
| $h$ | action preference | | ||
| $\text{H}$ | entropy | | ||
| $k$ | index of iteration | | ||
| $\ell$ | loss | | ||
| $\mathbb{N}$ | set of natural numbers | | ||
| $o$ | observation probability in partially observable tasks; infinitesimal in asymptotic notations | | ||
| $O$, $\tilde{O}$ | infinite in asymptotic notations | | ||
| $\mathsfit{O}$, $\mathsfit{o}$ | observation | | ||
| $\mathcal{O}$ | observation space | | ||
| $p$ | probability, dynamics | | ||
| $\mathbf{P}$ | transition matrix | | ||
| $\Pr$ | probability | | ||
| $Q$, $q$ | action value | | ||
| $Q_ \pi$, $q_ \pi$ | action value of policy $\pi$ (upper case only used in distributional RL) | | ||
| $Q_ \ast$, $q_ \ast$ | optimal action values (upper case only used in distributional RL) | | ||
| $\mathbf{q}$ | vector representation of action values | | ||
| $R$, $r$ | reward | | ||
| $\mathcal{R}$ | reward space | | ||
| $\mathbb{R}$ | set of real numbers | | ||
| $\mathsfit{S}$, $\mathsfit{s}$ | state | | ||
| $\mathcal{S}$ | state space | | ||
| $\mathcal{S}^+$ | state space with terminal state | | ||
| $T$ | steps in an episode | | ||
| $\mathsfit{T}$, $\Tiny\mathsfit{T}$ | trajectory | | ||
| $\mathcal{T}$ | time index set | | ||
| $\mathfrak{u}$ | belief update operator in partially observable tasks | | ||
| $U$, $u$ | TD target; (lower case only) upper bound | | ||
| $V$, $v$ | state value | | ||
| $V_ \pi$, $v_ \pi$ | state value of the policy $\pi$ (upper case only used in distributional RL) | | ||
| $V_ \ast$, $v_ \ast$ | optimal state values (upper case only used in distributional RL) | | ||
| $\mathbf{v}$ | vector representation of state values | | ||
| $\text{Var}$ | variance | | ||
| $\mathbf{w}$ | parameters of value function estimate | | ||
| $\mathsfit{X}$, $\mathsfit{x}$ | an event | | ||
| $\mathcal{X}$ | event space | | ||
| $\mathbf{z}$ | parameters for eligibility trace | | ||
| **Greek Letters** | **Description** | | ||
| $\alpha$ | learning rate | | ||
| $\beta$ | reinforce strength in eligibility trace; distortion function in distributional RL | | ||
| $\gamma$ | discount factor | | ||
| $\mathit\Delta$, $\delta$ | TD error | | ||
| $\varepsilon$ | parameters for exploration | | ||
| $\eta$ | state visitation frequency | | ||
| $\boldsymbol\upeta$ | vector representation of state visitation frequency | | ||
| $\lambda$ | decay strength of eligibility trace | | ||
| $\boldsymbol\uptheta$ | parameters for policy function estimates | | ||
| $\vartheta$ | threshold for value iteration | | ||
| $\uppi$ | the constant $\uppi$ ( $\approx3.14$ ) | | ||
| $\mathit\Pi$, $\pi$ | policy | | ||
| $\pi_ \ast$ | optimal policy | | ||
| $\pi_ \text{E}$ | expert policy in imitation learning | | ||
| $\rho$ | state–action visitation frequency; important sampling ratio in off-policy learning | | ||
| $\phi$ | quantile | | ||
| $\boldsymbol\uprho$ | vector representation of state–action visitation frequency | | ||
| $\huge\tau$, $\tau$ | sojourn time of SMDP | | ||
| $\mathit\Psi$ | Generalized Advantage Estimate (GAE) | | ||
| $\mathit\Omega$, $\omega$ | accumulated probability in distribution RL; (lower case only) conditional probability for partially observable tasks | | ||
| **Other Notations** | **Description** | | ||
| $\stackrel{\text{a.e.}}{=}$ | equal almost everywhere | | ||
| $\stackrel{\text{d}}{=}$ | share the same distribution | | ||
| $\stackrel{\text{def}}{=}$ | define | | ||
| $\lt$, $\le$, $\ge$, $\gt$ | compare numbers; element-wise comparison | | ||
| $\prec$, $\preccurlyeq$, $\succcurlyeq$, $\succ$ | partial order comparison | | ||
| $\ll$ | absolute continuous | | ||
| $\varnothing$ | empty set | | ||
| $\nabla$ | gradient | | ||
| $\sim$ | obey a distribution; utility equivalence in distributional RL | | ||
| $\left\|\quad\right\|$ | absolute value of a real number; element-wise absolute values of a vector or a matrix; the number of elements in a set | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,109 @@ | ||
# 《强化学习:原理与Python实现》数学记号 | ||
|
||
### 一般规律 | ||
|
||
- 大写是随机事件或随机变量,小写是确定性事件或确定性变量。 | ||
- 衬线体(如Times New Roman字体,如 $X$ )是数值,非衬线体(如Open Sans字体,如 $\mathsfit{X}$ )则不一定是数值。 | ||
- 粗体是向量(如 $\mathbf{w}$ )或矩阵(如 $\mathbf{F}$ )(矩阵用大写,即使是确定量也是如此)。 | ||
- 花体(如 $\mathcal{X}$ )是集合。 | ||
- 哥特体(如 $\mathfrak{f}$ )是映射。 | ||
|
||
### 数学记号表 | ||
|
||
下表列出常用记号。部分小节会有局部定义的记号,以该局部定义为准。 | ||
|
||
| 英语字母 | 含义 | 英文含义 | | ||
| :---: | --- | --- | | ||
| $A$, $a$ | 优势 | advantage | | ||
| $\mathsfit{A}$, $\mathsfit{a}$ | 动作 | action | | ||
| $\mathcal{A}$ | 动作空间 | action space | | ||
| $B$, $b$ | 策略梯度算法中的基线;部分可观测任务中的数值化信念;(仅小写)额外量;异策学习时的行为策略 | baseline in policy gradient; numerical belief in partially observable tasks; (lower case only) bonus; behavior policy in off-policy learning | | ||
| $\mathsfit{B}$, $\mathsfit{b}$ | 部分可观测任务中的信念 | belief in partially observable tasks | | ||
| $\mathfrak{B}_ \pi$, $\mathfrak{b}_ \pi$ | 策略 $\pi$ 的Bellman期望算子(大写只用于值分布学习) | Bellman expectation operator of policy $\pi$ (upper case only used in distributional RL) | | ||
| $\mathfrak{B}_ \ast$, $\mathfrak{b}_ \ast$ | Bellman最优算子(大写只用于值分布学习) | Bellman optimal operator (upper case only used in distributional RL) | | ||
| $\mathcal{B}$ | 经验回放中抽取的一批经验;部分可观测任务中的信念空间 | a batch of transition generated by experience replay; belief space in partially observable tasks | | ||
| $\mathcal{B}^+$ | 部分可观测任务中带终止信念的信念空间 | belief space with terminal belief in partially observable tasks | | ||
| $c$ | 计数值;线性规划的目标系数 | counting; coefficients in linear programming | | ||
| $\text{Cov}$ | 协方差 | covariance | | ||
| $d$, $d_ \infty$ | 度量 | metrics | | ||
| $d_ f$ | $f$散度 | $f$-divergence | | ||
| $d_ \text{KL}$ | KL散度 | KL divergence | | ||
| $d_ \text{JS}$ | JS散度 | JS divergence | | ||
| $d_ \text{TV}$ | 全变差 | total variation | | ||
| $D_ t$ | 回合结束指示 | indicator of episode end | | ||
| $\mathcal{D}$ | 经验集 | set of experience | | ||
| $\mathrm{e}$ | 自然常数 | the constant $\mathrm{e}$ ( $\approx2.72$ ) | | ||
| $e$ | 资格迹 | eligibility trace | | ||
| $\text{E}$ | 期望 | expectation | | ||
| $\mathfrak{f}$ | 一般的映射 | a mapping | | ||
| $\mathbf{F}$ | Fisher信息矩阵 | Fisher information matrix | | ||
| $G$, $g$ | 回报 | return | | ||
| $\mathbf{g}$ | 梯度向量 | gradient vector | | ||
| $h$ | 动作偏好 | action preference | | ||
| $\text{H}$ | 熵 | entropy | | ||
| $k$ | 迭代次数指标 | index of iteration | | ||
| $\ell$ | 损失 | loss | | ||
| $\mathbb{N}$ | 自然数集 | set of natural numbers | | ||
| $o$ | 部分可观测环境的观测概率;渐近无穷小 | observation probability in partially observable tasks; infinitesimal in asymptotic notations | | ||
| $O$, $\tilde{O}$ | 渐近无穷大 | infinite in asymptotic notations | | ||
| $\mathsfit{O}$, $\mathsfit{o}$ | 观测 | observation | | ||
| $\mathcal{O}$ | 观测空间 | observation space | | ||
| $p$ | 概率值,动力 | probability, dynamics | | ||
| $\mathbf{P}$ | 转移矩阵 | transition matrix | | ||
| $\Pr$ | 概率 | probability | | ||
| $Q$, $q$ | 动作价值 | action value | | ||
| $Q_ \pi$, $q_ \pi$ | 策略 $\pi$ 的动作价值(大写只用于值分布学习) | action value of policy $\pi$ (upper case only used in distributional RL) | | ||
| $Q_ \ast$, $q_ \ast$ | 最优动作价值(大写只用于值分布学习) | optimal action values (upper case only used in distributional RL) | | ||
| $\mathbf{q}$ | 动作价值的向量表示 | vector representation of action values | | ||
| $R$, $r$ | 奖励 | reward | | ||
| $\mathcal{R}$ | 奖励空间 | reward space | | ||
| $\mathbb{R}$ | 实数集 | set of real numbers | | ||
| $\mathsfit{S}$, $\mathsfit{s}$ | 状态 | state | | ||
| $\mathcal{S}$ | 状态空间 | state space | | ||
| $\mathcal{S}^+$ | 带终止状态的状态空间 | state space with terminal state | | ||
| $T$ | 回合步数 | steps in an episode | | ||
| $\mathsfit{T}$, $\Tiny\mathsfit{T}$ | 轨迹 | trajectory | | ||
| $\mathcal{T}$ | 时间指标 | time index set | | ||
| $\mathfrak{u}$ | 部分可观测任务中的信念更新算子 | belief update operator in partially observable tasks | | ||
| $U$, $u$ | 用自益得到的回报估计随机变量;小写的$u$还表示置信上界 | TD target; (lower case only) upper bound | | ||
| $V$, $v$ | 状态价值 | state value | | ||
| $V_ \pi$, $v_ \pi$ | 策略 $\pi$ 的状态价值(大写只用于值分布学习) | state value of the policy $\pi $ (upper case only used in distributional RL) | | ||
| $V_ \ast$, $v_ \ast$ | 最优状态价值(大写只用于值分布学习) | optimal state values (upper case only used in distributional RL) | | ||
| $\mathbf{v}$ | 状态价值的向量表示 | vector representation of state values | | ||
| $\text{Var}$ | 方差 | variance | | ||
| $\mathbf{w}$ | 价值估计参数 | parameters of value function estimate | | ||
| $\mathsfit{X}$, $\mathsfit{x}$ | 一般的事件 | an event | | ||
| $\mathcal{X}$ | 一般的事件空间 | event space | | ||
| $\mathbf{z}$ | 资格迹参数 | parameters for eligibility trace | | ||
| **希腊字母** | **含义** | **英文含义** | | ||
| $\alpha$ | 学习率 | learning rate | | ||
| $\beta$ | 资格迹算法中的强化强度;值分布学习中的扭曲函数 | reinforce strength in eligibility trace; distortion function in distributional RL | | ||
| $\gamma$ | 折扣因子 | discount factor | | ||
| $\mathit\Delta$, $\delta$ | 时序差分误差 | TD error | | ||
| $\varepsilon$ | 探索参数 | parameters for exploration | | ||
| $\eta$ | 状态访问频次 | state visitation frequency | | ||
| $\boldsymbol\upeta$ | 状态访问频次的向量表示 | vector representation of state visitation frequency | | ||
| $\lambda$ | 资格迹衰减强度 | decay strength of eligibility trace | | ||
| $\boldsymbol\uptheta$ | 策略估计参数 | parameters for policy function estimates | | ||
| $\vartheta$ | 价值迭代终止阈值 | threshold for value iteration | | ||
| $\uppi$ | 圆周率 | the constant $\uppi$ ( $\approx3.14$ ) | | ||
| $\mathit\Pi$, $\pi$ | 策略 | policy | | ||
| $\pi_ \ast$ | 最优策略 | optimal policy | | ||
| $\pi_ \text{E}$ | 模仿学习中的专家策略 | expert policy in imitation learning | | ||
| $\rho$ | 状态动作对访问频次;异策算法中的重要性采样比率 | state–action visitation frequency; important sampling ratio in off-policy learning | | ||
| $\phi$ | 分位数 | quantile | | ||
| $\boldsymbol\uprho$ | 状态动作对访问频次的向量表示 | vector representation of state–action visitation frequency | | ||
| $\huge\tau$, $\tau$ | 半Markov决策过程中的逗留时间 | sojourn time of SMDP | | ||
| $\mathit\Psi$ | 扩展的优势估计 | Generalized Advantage Estimate (GAE) | | ||
| $\mathit\Omega$, $\omega$ | 值分布学习中的累积概率;(仅小写)部分可观测任务中的条件概率 | accumulated probability in distribution RL; (lower case only) conditional probability for partially observable tasks | | ||
| **其他符号** | **含义** | **英文含义** | | ||
| $\stackrel{\text{a.e.}}{=}$ | 几乎处处相等 | equal almost everywhere | | ||
| $\stackrel{\text{d}}{=}$ | 分布相同 | share the same distribution | | ||
| $\stackrel{\text{def}}{=}$ | 定义 | define | | ||
| $\lt$, $\le$, $\ge$, $\gt$ | 普通数值比较;向量逐元素比较 | compare numbers; element-wise comparison | | ||
| $\prec$, $\preccurlyeq$, $\succcurlyeq$, $\succ$ | 偏序关系 | partial order comparison | | ||
| $\ll$ | 绝对连续 | absolute continuous | | ||
| $\varnothing$ | 空集 | empty set | | ||
| $\nabla$ | 梯度 | gradient | | ||
| $\sim$ | 服从分布;效用相同 | obey a distribution; utility equivalence in distributional RL | | ||
| $\left\|\quad\right\|$ | 实数的绝对值;向量或矩阵的逐元素求绝对值;集合的元素个数 | absolute value of a real number; element-wise absolute values of a vector or a matrix; the number of elements in a set | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.