forked from RussTedrake/underactuated
-
Notifications
You must be signed in to change notification settings - Fork 0
/
rl_policy_search.html
659 lines (569 loc) · 34.4 KB
/
rl_policy_search.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
<!DOCTYPE html>
<html>
<head>
<title>Ch. 20 - Model-Free Policy Search</title>
<meta name="Ch. 20 - Model-Free Policy Search" content="text/html; charset=utf-8;" />
<link rel="canonical" href="http://underactuated.mit.edu/rl_policy_search.html" />
<script src="https://hypothes.is/embed.js" async></script>
<script type="text/javascript" src="chapters.js"></script>
<script type="text/javascript" src="htmlbook/book.js"></script>
<script src="htmlbook/mathjax-config.js" defer></script>
<script type="text/javascript" id="MathJax-script" defer
src="htmlbook/MathJax/es5/tex-chtml.js">
</script>
<script>window.MathJax || document.write('<script type="text/javascript" src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-chtml.js" defer><\/script>')</script>
<link rel="stylesheet" href="htmlbook/highlight/styles/default.css">
<script src="htmlbook/highlight/highlight.pack.js"></script> <!-- http://highlightjs.readthedocs.io/en/latest/css-classes-reference.html#language-names-and-aliases -->
<script>hljs.initHighlightingOnLoad();</script>
<link rel="stylesheet" type="text/css" href="htmlbook/book.css" />
</head>
<body onload="loadChapter('underactuated');">
<div data-type="titlepage">
<header>
<h1><a href="index.html" style="text-decoration:none;">Underactuated Robotics</a></h1>
<p data-type="subtitle">Algorithms for Walking, Running, Swimming, Flying, and Manipulation</p>
<p style="font-size: 18px;"><a href="http://people.csail.mit.edu/russt/">Russ Tedrake</a></p>
<p style="font-size: 14px; text-align: right;">
© Russ Tedrake, 2022<br/>
Last modified <span id="last_modified"></span>.</br>
<script>
var d = new Date(document.lastModified);
document.getElementById("last_modified").innerHTML = d.getFullYear() + "-" + (d.getMonth()+1) + "-" + d.getDate();</script>
<a href="misc.html">How to cite these notes, use annotations, and give feedback.</a><br/>
</p>
</header>
</div>
<p><b>Note:</b> These are working notes used for <a
href="http://underactuated.csail.mit.edu/Spring2022/">a course being taught
at MIT</a>. They will be updated throughout the Spring 2022 semester. <a
href="https://www.youtube.com/channel/UChfUOAhz7ynELF-s_1LPpWg">Lecture videos are available on YouTube</a>.</p>
<table style="width:100%;"><tr style="width:100%">
<td style="width:33%;text-align:left;"><a class="previous_chapter" href=state_estimation.html>Previous Chapter</a></td>
<td style="width:33%;text-align:center;"><a href=index.html>Table of contents</a></td>
<td style="width:33%;text-align:right;"><a class="next_chapter" href=drake.html>Next Chapter</a></td>
</tr></table>
<script type="text/javascript">document.write(notebook_header('rl_policy_search'))
</script>
<!-- EVERYTHING ABOVE THIS LINE IS OVERWRITTEN BY THE INSTALL SCRIPT -->
<chapter style="counter-reset: chapter 19"><h1>Model-Free Policy Search</h1>
<p>Reinforcement learning (RL) is a collection of algorithms for solving the
same optimal control problem that we've focused on through the text, but the
real gems from the RL literature are the algorithms for (almost) <i>black-box
optimization</i> of stochastic optimal control problems. The idea of an
algorithm that only has a "black-box" interface to the optimization problem
means that it can obtain (potentially noisy) samples of the optimal cost via
trial and error, but does not have access to the underlying model, and does
not have direct access to complete gradient information.</p>
<p>This is a hard problem! In general we cannot expect RL algorithms to
optimize as quickly as more structured optimization, and we can typically only
guarantee convergence to a local optima at best. But the framework is
extremely general, so it can be applied to problems that are inaccessible to
any of the other algorithms we've examined so far. My favorite examples for
reinforcement learning are control in complex fluid dynamics (e.g.
<elib>Roberts09a</elib>). These systems are often very difficult to model, or
the model is so high-dimensional and complicated so that it's prohibitive for
control design. In these problems, it might actually be faster to optimize via
trial-and-error in physical experiments. </p>
<p>In this chapter we will examine one particular style of reinforcement
learning that attempts to find good controllers by explicitly parameterizing a
family of policies (e.g. via a parameter vector $\alpha$), then searching
directly for the parameters that optimize the long-term cost. For a
stochastic optimal control problem with our favorite additive form of the
objective, this might look like: \begin{equation} \min_\alpha E \left[
\sum_{n=0}^N
\ell(\bx[n],\bu[n]) \right]\end{equation} where the random variables are drawn
from
probability densities \begin{gather*}\bx[0] \sim p_0(\bx),\\ \bx[n] \sim
p(\bx[n] | \bx[n-1], \bu[n-1]), \\ \bu[n] \sim p_\alpha(\bu[n] |
\bx[n]).\end{gather*} The last equation is a probabilistic representation of
the control policy -- on each time-step actions $\bu$ are drawn from a
distribution that is conditioned on the current state, $\bx$.</p>
<p>Of course, the controls community has investigated ideas like this, too,
for instance under the umbrellas of <i>extremum-seeking control</i> and
<i>iterative learning control</i>. I'll try to make connections whenever
possible.</p>
<section><h1>Policy Gradient Methods</h1>
<p>One of the standard approaches to policy search in RL is to estimate the
gradient of the expected long-term cost with respect to the policy
parameters by evaluating some number of sample trajectories, then performing
(stochastic) gradient descent. Many of these so-called "policy gradient"
algorithms leverage a derivation called the <em>likelihood ratio method</em>
that was perhaps first described in <elib>Glynn90</elib> then popularized in
the the REINFORCE algorithm <elib>Williams92</elib>. It is based on what
looks like a trick with logarithms to estimate the gradient; I feel like
this trick is often presented with an air of mystery about it. Let's try to
make sure we understand it.</p>
<subsection><h1>The Likelihood Ratio Method (aka
REINFORCE)</h1>
<p>Let's start with a simpler optimization over a stochastic function:
$$\min_\alpha E\left[ \ell(\bx) \right] \text{ with } \bx \sim p_\alpha(\bx)
$$ I hope the notation is clear? $\bx$ is a random vector, drawn from
the distribution $p_\alpha(\bx)$, with the subscript indicating that the
distribution depends on the parameter vector $\alpha$. What is the
gradient of this function? The REINFORCE derivation is \begin{align*}
\frac{\partial}{\partial \alpha} E \left[ g(\bx) \right] =&
\frac{\partial}{\partial \alpha} \int d\bx ~ g(\bx) p_\alpha(\bx) \\ =&
\int d\bx ~ g(\bx) \frac{\partial}{\partial \alpha} p_\alpha(\bx) \\ =&
\int d\bx ~ g(\bx) p_\alpha(\bx) \frac{\partial}{\partial \alpha} \log
p_\alpha(\bx) \\ =& E\left[ g(\bx) \frac{\partial}{\partial \alpha} \log
p_\alpha(\bx) \right]. \end{align*} To achieve this, we used the
derivative of the logarithm: \begin{gather*} y = \log u \\ \frac{\partial
y}{\partial u} = \frac{1}{u} \frac{\partial u}{\partial x} \end{gather*}
to write $$\frac{\partial}{\partial \alpha} p_\alpha(\bx) = p_\alpha(\bx)
\frac{\partial}{\partial \alpha} \log p_\alpha(\bx).$$ This suggests a
simple Monte Carlo algorithm for estimating the policy gradient: draw $N$
random samples $\bx_i$ then estimate the gradient as $$
\frac{\partial}{\partial \alpha} E \left[ g(\bx) \right] \approx
\frac{1}{N} \sum_i g(\bx_i) \frac{\partial}{\partial \alpha} \log
p_\alpha(\bx_i).$$</p>
<p>This trick is even more potent in the optimal control case. For a
finite-horizon problem we have \begin{align*} \frac{\partial}{\partial
\alpha} E \left[ \sum_{n=0}^N \ell(\bx[n], \bu[n]) \right] =& \int
d\bx[\cdot] d\bu[\cdot] \left[ \sum_{n=0}^N \ell(\bx[n],\bu[n]) \right]
\frac{\partial}{\partial \alpha} p_\alpha(\bx[\cdot], \bu[\cdot]) \\ =&
E\left[ \left(\sum_{n=0}^N \ell(\bx[n],\bu[n])\right)
\frac{\partial}{\partial \alpha} \log p_\alpha(\bx[\cdot],\bu[\cdot])
\right], \end{align*} where $\bx[\cdot]$ is shorthand for the entire
trajectory $\bx[0], ..., \bx[n]$, and $$ p_\alpha(\bx[\cdot],\bu[\cdot]) =
p_0(\bx[0]) \left(\prod_{n=1}^N p\left(\bx[n] \Big{|} \bx[n-1],\bu[n-1]\right) \right)
\left(\prod_{n=0}^N p_\alpha(\bu[n] \Big{|} \bx[n]) \right).$$ Taking the $\log$
we have $$\log p_\alpha\left(\bx[\cdot],\bu[\cdot]\right) = \log p_0(\bx[0]) +
\sum_{n=1}^N \log p(\bx[n] \Big{|} \bx[n-1],\bu[n-1]) + \sum_{n=0}^n \log
p_\alpha(\bu[n] \Big{|} \bx[n]).$$ Only the last terms depend on $\alpha$,
which yields \begin{align*} \frac{\partial}{\partial \alpha} E \left[
\sum_{n=0}^N \ell(\bx[n], \bu[n]) \right] &= E\left[ \left( \sum_{n=0}^N
\ell(\bx[n],\bu[n]) \right) \left(\sum_{k=0}^N \frac{\partial}{\partial
\alpha} \log p_\alpha\left(\bu[k]\Big{|}\bx[k]\right) \right) \right] \\ &= E\left[
\sum_{n=0}^N \left( \ell(\bx[n],\bu[n]) \sum_{k=0}^n
\frac{\partial}{\partial \alpha} \log p_\alpha(\bu[k]\Big{|}\bx[k]) \right)
\right], \end{align*} where the last step uses the fact that
$E\left[\ell(\bx[n],\bu[n])\frac{\partial}{\partial \alpha} \log
p_\alpha(\bu[k]\Big{|}\bx[k]) \right] = 0$ for all $k > n$.</p>
<p>This update should surprise you. It says that I can find the gradient
of the long-term cost by taking only the gradient of the policy... but not
the gradient of the plant, nor the cost! The intuition is that one can
obtain the gradient by evaluating the policy along a number of (random)
trajectory roll-outs of the closed-loop system, evaluating the cost on
each, then increasing the probability in the policy of taking the actions
correlated with lower long-term costs.</p>
<p>This derivation is often presented as <b>the</b> policy gradient
derivation. The identity is certainly correct, but I'd prefer that you
think of this as just one way to obtain the policy gradient. It's a
particularly clever one in that it uses exactly the information that we
have available in reinforcement learning -- we have access to the
instantaneous costs, $\ell(\bx[n], \bu[n])$, and to the policy -- so are
allowed to take gradients of the policy -- but does not require any
understanding whatsoever of the plant model. Although it's clever, it is
not particularly efficient -- the Monte Carlo approximations of the
expected value have a high variance, so many samples are required for an
accurate estimate. Other derivations are possible, some even simpler, and
others that <b>do</b> make use of the plant gradients if you have access
to them, and these will perform differently in the finite-sample
approximations.</p>
<p>For the remainder of this section, I'd like to dig in and try to
understand the nature of the stochastic update a little better, to help
you understand what I mean. </p>
</subsection>
<subsection><h1>Sample efficiency</h1>
<p>Let's take a step back and think more generally about how one can use
gradient descent in black-box (unconstrained) optimization. Imagine you
have a simple optimization problem: $$\min_\alpha g(\alpha),$$ and you can
directly evaluate $g()$, but not $\frac{\partial g}{\partial \alpha}$.
How can you perform gradient descent?</p>
<p>A standard technique for estimating the gradients, in lieu of
analytical gradient information, is the method of finite
differences<elib>Press92</elib>. The finite differences approach to
estimating the gradient involves making the same small perturbation,
$\epsilon$ to the input parameters in every dimension independently, and
using: $$\pd{g}{\alpha_i} \approx \frac{g(\balpha+{\bf \epsilon}_i) -
g(\alpha)}{\epsilon},$$ where ${\bf \epsilon_i}$ is the column vector with
$\epsilon$ in the $i$th row, and zeros everywhere else. Finite difference
methods can be computationally very expensive, requiring $n+1$ evaluations
of the function for every gradient step, where $n$ is the length of the
input vector.</p>
<p>What if each function evaluation is expensive? Perhaps it means
picking up a physical robot and running it for 10 seconds for each
evaluation of $g()$. Suddenly there is a premium on trying to optimize
the cost function with the fewest number of evaluations on $g()$. This is
the name of the game in reinforcement learning -- it's often called RL's
<i>sample complexity</i>. Can we perform gradient descent using less
evaluations?</p>
</subsection>
<subsection><h1>Stochastic Gradient Descent</h1>
<p>This leads us to the question of doing approximate gradient descent, or
"stochastic" gradient descent. Thinking of the cost landscape as a
Lyapunov function†<sidenote>† the Lyapunov function $V
(\alpha) = g(\alpha) - g(\alpha^*)$ is commonly used in
convergence/stability analysis of optimization algorithms. <todo>cite e.g.
Ben Recht?</todo></sidenote>, then any update that moves downhill on each
step will eventually reach the optimal. More generally, any update that
moves downhill on average will eventually reach a minimum... and sometimes
"stochastic" gradient descent updates that occassionally go uphill but go
downhill on average can even have desirable properties like bouncing out
of small local minima. The figure below gives some graphical intuition;
for a formal treatment of stochastic gradient methods, see e.g.
<elib>Bertsekas99</elib>.</p>
<figure>
<img width="90%" src="figures/weight_perturbation_quadratic_a.jpg"/>
<figcaption>Stochastic gradient descent of a quadratic cost function .</figcaption>
</figure>
</subsection>
<subsection><h1>The Weight Pertubation Algorithm</h1>
<p>Instead of sampling each dimension independently, consider making a
single small random change, $\bbeta$, to the parameter vector, $\balpha$.
Now consider the update of the form: $$\Delta \balpha= -\eta [ g(\balpha +
\bbeta) - g(\balpha)]\bbeta.$$ The intuition here is that if
$g(\balpha+\bbeta) < g(\balpha)$, then $\bbeta$ was a good change and
we'll move in the direction of $\bbeta$. If the cost increased, then
we'll move in the opposite direction. Assuming the function is smooth and
$\bbeta$ is small, then we will always move downhill on the Lyapunov
function.</p>
<p>Even more interesting, on average, the update is actually in the
direction of the true gradient. To see this, again we can assume the
function is locally smooth and $\bbeta$ is small to write the Taylor
expansion: $$g(\balpha + \bbeta) \approx g(\balpha) +
\pd{g}{\balpha}\bbeta.$$ Now we have \begin{align*}\Delta \balpha
\approx& -\eta \left[ \pd{g}{\alpha} \bbeta \right] \bbeta = -\eta \bbeta
\bbeta^T \pd{g}{\balpha}^T \\ \avg{\Delta \balpha} \approx& -\eta
\avg{\bbeta \bbeta^T} \pd{g}{\balpha}^T \end{align*} If we select each
element of $\bbeta$ independently from a distribution with zero mean and
variance $\sigma_\beta^2$, or $\avg{ \beta_i } = 0, \avg{ \beta_i \beta_j
} = \sigma_\beta^2 \delta_{ij}$, then we have $$\avg{\Delta \balpha}
\approx -\eta \sigma_\beta^2 \pd{g}{\balpha}^T.$$ Note that the
distribution $p_{\alpha}(\bx)$ need not be Gaussian, but it is the
variance of the distribution which determines the scaling on the
gradient.</p>
</subsection>
<subsection><h1>Weight Perturbation with an Estimated
Baseline</h1>
<p>The weight perturbation update above requires us to evaluate the
function $g$ twice for every update of the parameters. This is low
compared to the method of finite differences, but let us see if we can do
better. What if, instead of evaluating the function twice per update
[$g(\balpha+\bbeta)$ and $g(\balpha)$], we replace the evaluation of
$g(\balpha)$ with an estimator, $b = \hat{g}(\balpha)$, obtained from
previous trials? In other words, consider an update of the form:
\begin{equation} \Delta \balpha= - \frac{\eta}{\sigma_\beta^2}
[ g(\balpha + \bbeta) - b]\bbeta
\label{eq:weight_perturbation_baseline} \end{equation} The estimator can
take any of a number of forms, but perhaps the simplest is based on the
update (after the $n$th trial): $$b[n+1] = \gamma g[n] + (1 -
\gamma)b[n],\quad b[0] = 0, 0\le\gamma\le1,$$ where $\gamma$ parameterizes
the moving average. Let's compute the expected value of the update, using
the same Taylor expansion that we used above: \begin{align*} E[\Delta
\balpha] =& -\frac{\eta}{\sigma_\beta^2} \avg{ [g(\balpha) +
\pd{g}{\balpha}\bbeta - b]\bbeta} \\ =& -\frac{\eta}{\sigma_\beta^2}
\avg{[g(\balpha)-b]\bbeta} -\frac{\eta}{\sigma_\beta^2} \avg{\bbeta
\bbeta^T}\pd{g}{\balpha}^T \\
=& - \eta \pd{g}{\balpha}^T \end{align*} In other words, the baseline does
not effect our basic result - that the expected update is in the direction
of the gradient. Note that this computation works for any baseline
estimator which is uncorrelated with $\bbeta$, as it should be if the
estimator is a function of the performance in previous trials.</p>
<p>Although using an estimated baseline doesn't effect the average update,
it can have a dramatic effect on the performance of the algorithm.
As we will see in a later section, if the evaluation of $g$ is
stochastic, then the update with a baseline estimator can actually
outperform an update using straight function evaluations.</p>
<todo>
% Note: it would be very interesting to look at the performance of
% estimated baselines vs. perfect baselines (one has higher variance,
% but allows 2x the updates for the same number of function evaluations).
</todo>
<p>Let's take the extreme case of $b=0$. This seems like a very bad
idea... on each step we will move in the direction of every random
perturbation, but we will move more or less in that direction based on the
evaluated cost. In average, we will still move in the direction of the
true gradient, but only because we will eventually move more downhill than
we move uphill. It feels very naive.</p>
</subsection>
<subsection><h1>REINFORCE w/ additive Gaussian noise</h1>
<p>Now let's consider the simple form of the REINFORCE update:
\begin{align*} \frac{\partial}{\partial \alpha} E \left[ g(\bx) \right] =
E\left[ g(\bx) \frac{\partial}{\partial \alpha} \log p_\alpha(\bx)
\right]. \end{align*} It turns out that weight perturbation is a
REINFORCE algorithm. To see this, take $\bx = \balpha + \bbeta$, $\bbeta
\in N(0,\sigma^2)$, to have \begin{gather*} p_{\balpha}(\bx) =
\frac{1}{(2\pi\sigma^2)^N} e^{\frac{-(\bx-\balpha)^T(\bx -
\balpha)}{2\sigma^2}} \\ \log p_{\balpha}(\bx) =
\frac{-(\bx-\balpha)^T(\bx - \balpha)}{2\sigma^2} + \text{terms that don't
depend on }\balpha \\ \pd{}{\balpha} \log p_{\balpha}(\bx) =
\frac{1}{\sigma^2} (\balpha - \bx)^T = \frac{1}{\sigma^2} \bbeta^T. \\
\end{gather*} If we use only a single trial for each Monte-Carlo
evaluation, then the REINFORCE update is $$\Delta \balpha =
-\frac{\eta}{\sigma^2} g(\balpha + \bbeta) \bbeta,$$ which is precisely
the weight perturbation update (the crazy version with $b=0$ discussed
above). Although it will move in the direction of the gradient on
average, it can be highly inefficient. In practice, people use many more than one sample to estimate the poilicy gradient.</p>
</subsection>
<subsection><h1>Summary</h1>
<p>The policy gradient "trick" from REINFORCE using log probabilities
provides one means of estimating the true policy gradient. It is not the
only way to obtain the policy gradient... in fact the trivial
weight-perturbation update obtains the same gradient for the case of a
policy with a mean linear in $\alpha$ and a fixed diagonal covariance.
It's cleverness lies in the fact that it makes use of the information we
do have (instantaneous cost values and the gradients of the policy), and
that it provides an unbiased estimate of the gradient (note that taking
the gradients of a model that is only slightly wrong might not have this
virtue). But its inefficiency comes from having a potentially very high
variance. Variance reduction for policy gradient, often through baseline
estimation, continues to be an active area of research.</p>
</subsection>
</section>
<todo>REINFORCE for LTI systems</todo>
<section><h1>Sample performance via the signal-to-noise
ratio.</h1>
<subsection><h1>Performance of Weight Perturbation</h1>
<p>The simplicity of the REINFORCE / weight perturbation updates makes it
tempting to apply them to problems of arbitrary complexity. But a major
concern for the algorithm is its performance - although we have shown that
the update is in the direction of the true gradient <i>on average</i>, it
may still require a prohibitive number of computations to obtain a local
minima.</p>
<p>In this section, we will investigate the performance of the weight
perturbation algorithm by investigating its <i>signal-to-noise</i> ratio
(SNR). This idea was explored in <elib>Roberts09a</elib> -- my goal is
just to give you a taste here. The SNR is the ratio of the power in the
signal (here desired update in the direction of the true gradient) and the
power in the noise (the remaining component of the update, so that
$\Delta\balpha = -\eta\pd{g}{\balpha}^T + $
noise)†<sidenote>† SNR could alternatively be defined as the
ratio of the power of the component of the update in the direction of the
signal and the component orthogonal to the signal (does not penalize the
magnitudes of the updates): $$\frac{E[a^2]}{E\left[\left|\Delta\balpha -
a{\bf v} \right|^2\right]},$$ where ${\bf v} =
\frac{\pd{g}{\balpha}^T}{\left|\pd{g}{\balpha}^T\right|},$ $a = \Delta
\balpha \cdot {\bf v}$. This form is probably a better definition, but is
more difficult to work with.</sidenote> $$\text{SNR} = \frac{\left|-\eta
\pd{g}{\balpha}^T\right|^2}{\avg{\left|\Delta \balpha + \eta
\pd{g}{\balpha}^T \right|^2}}.$$ In the special case of the unbiased
update, the equation reduces to: $$\text{SNR} =
\frac{\avg{\Delta\balpha}^T\avg{\Delta\balpha}}
{\avg{(\Delta\balpha)^T(\Delta\balpha)} -
\avg{\Delta\balpha}^T\avg{\Delta\balpha}}.$$</p>
<p>For the weight perturbation update we have: $$\avg{\Delta
\balpha}^T\avg{\Delta \balpha} = \eta^2 \pd{g}{\balpha} \pd{g}{\balpha}^T
= \eta^2 \sum_{i=1}^{N} \left( \pd{g}{\alpha_i} \right)^2,$$
\begin{align*} \avg{(\Delta \balpha)^T(\Delta \balpha)} =&
\frac{\eta^2}{\sigma_\beta^4} \avg{\left[g(\balpha+\bbeta) -
g(\balpha)\right]^2 \bbeta^T\bbeta} \\ \approx&
\frac{\eta^2}{\sigma_\beta^4} \avg{ \left[ \pd{g}{\balpha} \bbeta
\right]^2 \bbeta^T\bbeta } \\ =& \frac{\eta^2}{\sigma_\beta^4} \avg{
\left(\sum_i \pd{g}{\alpha_i}\beta_i\right) \left( \sum_j \pd{g}{\alpha_j}
\beta_j\right) \left(\sum_k \beta_k^2 \right) } \\ =&
\frac{\eta^2}{\sigma_\beta^4} \avg{ \sum_i \pd{g}{\alpha_i}\beta_i \sum_j
\pd{g}{\alpha_j} \beta_j \sum_k \beta_k^2 } \\ =&
\frac{\eta^2}{\sigma_\beta^4} \sum_{i,j,k} \pd{g}{\alpha_i}
\pd{g}{\alpha_j} \avg{ \beta_i \beta_j \beta_k^2 }\\ \avg{\beta_i \beta_j
\beta_k^2} = \begin{cases} 0 & i \neq j \\ \sigma_\beta^4 & i=j\neq k \\
\mu_4(\beta) & i = j =k \end{cases}& \\ =& \eta^2 (N-1) \sum_i \left(
\pd{g}{\alpha_i} \right)^2 + \frac{\eta^2 \mu_4(z)}{\sigma_\beta^4}
\sum_i \left(\pd{g}{\alpha_i}\right)^2 \end{align*} where $\mu_n(z)$ is
the $n$th central moment of $z$: $$\mu_n(z) = \avg{(z - \avg{z})^n}.$$
Putting it all together, we have: $$\text{SNR} = \frac{1}{N - 2 +
\frac{\mu_4(\beta_i)}{\sigma_\beta^4}}.$$</p>
<example><h1>Signal-to-noise ratio for additive Gaussian
noise</h1>
<p>For $\beta_i$ drawn from a Gaussian distribution, we have $\mu_1 = 0,
\mu_2 = \sigma^2_\beta, \mu_3 = 0, \mu_4 = 3 \sigma^4_\beta,$
simplifying the above expression to: $$\text{SNR} = \frac{1}{N+1}.$$</p>
</example>
<example><h1>Signal-to-noise ratio for additive uniform
noise</h1>
<p>For $\beta_i$ drawn from a uniform distribution over [-a,a], we have
$\mu_1 = 0, \mu_2 = \frac{a^2}{3} = \sigma_\beta^2, \mu_3 = 0, \mu_4 =
\frac{a^4}{5} = \frac{9}{5}\sigma_\beta^4,$ simplifying the above to:
$$\text{SNR} = \frac{1}{N-\frac{1}{5}}.$$</p>
</example>
<p>Performance calculations using the SNR can be used to design the
parameters of the algorithm in practice. For example, based on these
results it is apparent that noise added through a uniform distribution
yields better gradient estimates than the Gaussian noise case for very
small $N$, but that these differences are neglibile for large $N$.</p>
<p>Similar calculations can yield insight into picking the size of the
additive noise (scaled by $\sigma_\beta$). The calculations in this
section seem to imply that larger $\sigma_\beta$ can only reduce the
variance, overcoming errors or noise in the baseline estimator,
$\tilde{b}$; this is a short-coming of our first-order Taylor expansion.
If the cost function is not linear in the parameters, then an examination
of higher-order terms reveals that large $\sigma_\beta$ can increase the
SNR. The derivation with a second-order Taylor expansion is left for an
exercise.</p>
</subsection>
</section>
<todo>Better baselines with Importance sampling</todo>
<!--
\begin{exercises}
\begin{matlabex}[Weight perturbation on a quadratic cost]
\label{ex:weight_perturbation_quadratic}
In this problem we will optimize a cost quadratic cost function using
weight perturbation.
$$y(\balpha) = \frac{1}{2} \balpha^T {\bf A} \balpha\text{, with } {\bf A} = {\bf A}^T$$
\begin{subex}
Simulate the weight perturbation algorithm for ${\bf A} = {\bf I}_{2\times2}$.
Plot the values of $\balpha$ at each iteration of the algorithm on top of
the cost function, $y$. Your results should resemble
figure {weight_perturbation_quadratic}, which was made with
these parameters. Try using additive Gaussian noise, and additive
uniform noise. How does the performance of convergence depend on
$\eta$? on $\sigma_\beta^2$?
\end{subex}
\begin{subex}[Signal-to-noise ratio]
Estimate the signal-to-noise ratio of your weight pertubation update
by holding $\balpha$ fixed and computing $\Delta \balpha$ for as many trials
as are required for your SNR statistics to converge. Using ${\bf A} =
{\bf I}_{N \times N}$, make a plot of the SNR for Gaussian and uniform
additive noise as a function of $N$. Compare these plots with the
theoretical predictions.
\end{subex}
\begin{solution}
\begin{itemize}
\item[(a)]
\verbatiminput{figures//weight_perturbation_quadratic_a.m}
\item[(b)]
\begin{figure}[!h]
\centerline{\includegraphics[width=4in]{figures/weight_perturbation_quadratic_b.jpg}}
\caption{Signal-to-noise ratio calculations}
\label{f:weight_perturbation_quadratic_b}
\end{figure}
The results are plotted in figure \ref{f:weight_perturbation_quadratic_b}
\verbatiminput{figures/weight_perturbation_quadratic_b.m}
\end{itemize}
\end{solution}
\end{matlabex}
\begin{ex}[Weight perturbation performance with a baseline estimator]
\label{p:snr_baseline}
\begin{solution}
Using $\tilde{b} = y(\balpha) - b$, we have:
$$\avg{\Delta \balpha}^T\avg{\Delta \balpha} = \eta^2 \pd{J}{\balpha} \pd{J}{\balpha}^T = \eta^2
\sum_{i=1}^{N} \left( \pd{J}{\alpha_i} \right)^2,$$
\begin{align*}
\avg{(\Delta \balpha)^T(\Delta \balpha)} =& \frac{\eta^2}{\sigma_\beta^4}
\avg{\left[y(\balpha+\bbeta) - y(\balpha)\right]^2 \bbeta^T\bbeta} \\
\approx& \frac{\eta^2}{\sigma_\beta^4} \avg{ \left[ \tilde{b} + \pd{J}{\balpha} \bbeta \right]^2 \bbeta^T\bbeta } \\
=& \frac{\eta^2}{\sigma_\beta^2} \avg{\tilde{b}^2} +
2 \frac{\eta^2}{\sigma_\beta^4} \avg{ \tilde{b} \left(\sum_i \pd{J}{\alpha_i}\beta_i\right)
\left(\sum_j \beta_j^2 \right) } + \text{ previous
terms} \\
=& \frac{\eta^2}{\sigma_\beta^2} \avg{\tilde{b}^2} +
2 \frac{\eta^2}{\sigma_\beta^4} \sum_{i,j} \pd{J}{\alpha_i} \avg{ \beta_i \beta_j^2 }
+ \text{ previous
terms} \\
=& \frac{\eta^2}{\sigma_\beta^2}\sigma^2_{\tilde{b}} +
2 \frac{\eta^2}{\sigma^4} \sum_i \pd{J}{\alpha_i} \mu_3(\beta_i) + \eta^2 (N-1)
\sum_i \left( \pd{J}{\alpha_i} \right)^2 + \frac{\eta^2}{\sigma_\beta^4}
\sum_i \left(\pd{J}{\alpha_i}\right)^2 \mu_4(\beta_i)
\end{align*}
For Gaussian (and all distributions which are symmetric about their
mean), we have $\mu_3(\beta_i)=0$, and therefore: $$\text{SNR} =
\frac{1}{N+1+n_b},$$ where $n_b$ is the contribution from errors in
the baseline estimate $b$: $$n_b =
\frac{\sigma^2_{\tilde{b}}}{\sigma_\beta^2 \sum_i
\left( \pd{J}{\alpha_i} \right)^2}.$$ For uniform, we have $$\text{SNR} =
\frac{1}{N-\frac{1}{5} +n_b}.$$
\end{solution}
\end{ex}
\begin{ex}[Weight perturbation - optimal noise calculations]
\label{p:optimal_noise}
Use a second-order Taylor expansion...
$$y(\balpha + \bbeta) = y(\balpha) + \pd{J}{\balpha} \bbeta + \frac{1}{2}\bbeta^T {\bf h} \bbeta,$$
where ${\bf h}$ is the Hermitian of the function evaluated at $\balpha$:
$${\bf h} = \pd{}{\balpha} \left[ \pd{J}{\balpha}^T \right] = \begin{bmatrix}
\pd{^2 y}{\alpha_1 \partial \alpha_1} & \cdots & \pd{^2 y}{\alpha_1 \partial \alpha_N} \\
\vdots & \ddots & \vdots \\ \pd{^2 y}{\alpha_N \partial \alpha_1} & \cdots &
\pd{^2 y}{\alpha_N \partial \alpha_N} \end{bmatrix}.$$
You may assume that the distribution on $\beta_i$ is symmetric about the
mean, which zeros all odd central-moments starting at $\mu_3(\beta_i) = 0$.
\begin{solution}
\begin{align*}
\avg{\Delta\balpha} \approx& -\frac{\eta}{\sigma_\beta^2} \avg{ \left[ \tilde{b} +
\pd{J}{\balpha}\bbeta + \frac{1}{2} \bbeta^T {\bf h} \bbeta \right] \bbeta } \\
=& - \eta \pd{J}{\balpha}^T + \sum_i \mu_3(\beta_i) h_{ii} = -\eta \pd{J}{\balpha}^T
\end{align*}
\begin{align*}
\avg{\Delta \balpha^T \Delta \balpha} \approx& \frac{\eta^2}{\sigma_\beta^4}
\avg{ \left[ \tilde{b} + \pd{J}{\balpha}\bbeta + \frac{1}{2}\bbeta^T {\bf h} \bbeta \right]^2 \bbeta^T
\bbeta }
\end{align*}
This introduces a number of new terms to the soln from problem
\ref{p:snr_baseline}:
\begin{align*}
\tilde{b} ( \bbeta^T {\bf h} \bbeta ) \bbeta^T \bbeta =&
\tilde{b} \sum_{i,j,k} h_{ij} \avg{\beta_i \beta_j \beta_k^2} \\
=& \tilde{b} \left[ \sigma_\beta^4 (N-1) \sum_i h_{ii} + \sum_i h_{ii}
\mu_4(\beta_i) \right]
\end{align*}
\begin{align*}
(\pd{J}{\balpha} \bbeta ) (\bbeta^T {\bf h} \bbeta) ) \bbeta^T \bbeta =&
\sum_{i,j,k,l} \pd{J}{\alpha_i} h_{jk} \avg{\beta_i \beta_j \beta_k \beta_l^2} = 0
\end{align*}
because $$ \avg{\beta_i \beta_j \beta_k \beta_l^2} = \begin{cases} 0 & i \neq j, \text{ or } i
\neq k, \text{ or } j \neq k
\\ \mu_3(\beta_i) \mu_2(\beta_i) & i=j=k \neq l \\
\mu_5(\beta_i) & i=j=k=l. \end{cases}$$
\begin{align*}
\frac{1}{4}(\bbeta^T {\bf h} \bbeta) (\bbeta^T {\bf h} \bbeta)
\bbeta^T \bbeta =&
\sum_{i,j,k,l,m} h_{ij} h_{kl} \avg{\beta_i \beta_j \beta_k \beta_l \beta_m^2} \\ =&
\sigma_\beta^6 (N-2) \sum_{i,j \neq i} (h_{ii}h_{jj} + 2h_{ij}^2) +
\sigma_\beta^2 \mu_4(\beta_i) \left[ (N-1)\sum_i h_{ii}^2 + \sum_{i,j \neq i} (2h_{ii}h_{jj}
+ 2h_{ij}^2) \right]
\end{align*}
because $$\avg{\beta_i \beta_j \beta_k \beta_l \beta_m^2} = \begin{cases} \sigma_\beta^6 &
\text{three distinct pairs} \\ \sigma_\beta^2 \mu_4(\beta_i) & \text{two
pairs} \\ 0 & \text{otherwise}.\end{cases}$$
yuck!
\end{solution}
\end{ex}
\end{exercises}
-->
<todo>Covariance Matrix Adaptation</todo>
<todo>Policy Improvement with Path Integrals</todo>
</chapter>
<!-- EVERYTHING BELOW THIS LINE IS OVERWRITTEN BY THE INSTALL SCRIPT -->
<div id="references"><section><h1>References</h1>
<ol>
<li id=Roberts09a>
<span class="author">John W. Roberts</span>,
<span class="title">"Motor Learning on a Heaving Plate via Improved-{SNR} Algorithms"</span>,
, February, <span class="year">2009</span>.
[ <a href="http://groups.csail.mit.edu/robotics-center/public_papers/Roberts09a.pdf">link</a> ]
</li><br>
<li id=Glynn90>
<span class="author">Peter W. Glynn</span>,
<span class="title">"Likelihood ratio gradient estimation for stochastic systems"</span>,
<span class="publisher">Communications of the ACM</span>, vol. 33, no. 10, pp. 75--84, oct, <span class="year">1990</span>.
</li><br>
<li id=Williams92>
<span class="author">R.J. Williams</span>,
<span class="title">"Simple statistical gradient-following algorithms for connectionist reinforcement learning"</span>,
<span class="publisher">Machine Learning</span>, vol. 8, pp. 229-256, <span class="year">1992</span>.
</li><br>
<li id=Press92>
<span class="author">William H. Press and Saul A. Teukolsky and William T. Vetterling and Brian P. Flannery</span>,
<span class="title">"Numerical Recipes in C: The Art of Scientific Computing"</span>, Cambridge University Press
, <span class="year">1992</span>.
</li><br>
<li id=Bertsekas99>
<span class="author">Dimitri P Bertsekas</span>,
<span class="title">"Nonlinear programming"</span>, Athena scientific Belmont
, <span class="year">1999</span>.
</li><br>
</ol>
</section><p/>
</div>
<table style="width:100%;"><tr style="width:100%">
<td style="width:33%;text-align:left;"><a class="previous_chapter" href=state_estimation.html>Previous Chapter</a></td>
<td style="width:33%;text-align:center;"><a href=index.html>Table of contents</a></td>
<td style="width:33%;text-align:right;"><a class="next_chapter" href=drake.html>Next Chapter</a></td>
</tr></table>
<div id="footer">
<hr>
<table style="width:100%;">
<tr><td><a href="https://accessibility.mit.edu/">Accessibility</a></td><td style="text-align:right">© Russ
Tedrake, 2022</td></tr>
</table>
</div>
</body>
</html>