-
Notifications
You must be signed in to change notification settings - Fork 10
/
David Mackay's book review and some solutions of exercises.tex
1023 lines (798 loc) · 41.4 KB
/
David Mackay's book review and some solutions of exercises.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[a4paper,11pt]{article}
\pdfoutput=1 % if your are submitting a pdflatex (i.e. if you have
% images in pdf, png or jpg format)
\usepackage{jheppub} % for details on the use of the package, please
% see the JHEP-author-manual
\usepackage{hyperref}
\usepackage{amssymb,amsmath,amsfonts}
\usepackage{epsfig}
\usepackage{array}
%\usepackage{verbatim}
\usepackage{tikz}
\usetikzlibrary{calc,through,backgrounds,patterns,decorations.pathmorphing,decorations.markings,%
trees,positioning,arrows,chains,shapes.geometric,%
decorations.pathreplacing,shapes}
\usepackage{pgfplots}
%\usepgfplotslibrary{patchplots,fillbetween}
\usepgfplotslibrary{fillbetween}
\pgfplotsset{compat=1.10}
%\usepackage[dvipsnames]{xcolor}
\usepackage{amsmath}
\usepackage{multirow}
\usepackage{array}
\usepackage{bm}
\newcommand{\vect}[1]{\boldsymbol{\mathbf{#1}}}
\newcommand{\CA}{\mathcal{A}}
\newcommand{\CB}{\mathcal{B}}
\newcommand{\CC}{\mathcal{C}}
\newcommand{\CD}{\mathcal{D}}
%\newcommand{\CE}{\mathcal{E}}
\newcommand{\CF}{\mathcal{F}}
\def\CJ{\mathcal{J}}
\newcommand{\CG}{\mathcal{G}}
\newcommand{\CL}{\mathcal{L}}
\newcommand{\CO}{\mathcal{O}}
\newcommand{\CN}{\mathcal{N}}
\newcommand{\CI}{\mathcal{I}}
\newcommand{\CT}{\mathcal{T}}
\newcommand{\CS}{\mathcal{S}}
\newcommand{\CM}{\mathcal{M}}
\newcommand{\CQ}{\mathcal{Q}}
\newcommand{\CE}{\mathcal{E}}
\newcommand{\CZ}{\mathcal{Z}}
\newcommand{\Z}{\mathbb{Z}}
\newcommand{\C}{\mathbb{C}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\gym}{g_{\rm YM}}
\newcommand{\eps}{\epsilon}
\newcommand{\grad}{\vec{\nabla}}
\newcommand{\el}{\kappa}
\newcommand{\ads}{\mbox{AdS}}
\newcommand{\nn}{\nonumber}
\newcommand{\lspa}{\ \ ,\ \ \ \ }
\newcommand{\spa}{\ , \ \ }
\newcommand{\ds}{\displaystyle}
\newcommand{\tr}{\mathop{{\rm Tr}}}
\DeclareMathOperator{\sgn}{sgn} \DeclareMathOperator{\diag}{diag}
\newcommand{\vecto}[2]{\left( \begin{array}{c} #1 \\ #2 \end{array}
\right) }
\newcommand{\matrto}[4]{\left( \begin{array}{cc} #1 & #2 \\
#3 & #4 \end{array} \right) }
\newcommand{\vectre}[3]{\left( \begin{array}{c} #1 \\ #2\\ #3 \end{array}
\right) }
\newcommand{\matrtre}[9]{\left( \begin{array}{ccc} #1 & #2 & #3 \\ #4 & #5 & #6 \\ #7 & #8 & #9 \end{array} \right) }
\newcommand{\hB}{{}^* \! B}
\newcommand{\hC}{{}^* \! C}
\newcommand{\hF}{{}^* \! F}
\newcommand{\hJ}{{}^* \! J}
\newtheorem{definition}{Definition}[section]
\newtheorem{proposition}[definition]{Proposition}
\newtheorem{theorem}[definition]{Theorem}
\newtheorem{lemma}[definition]{Lemma}
\newtheorem{corollary}[definition]{Corollary}
\newtheorem{comment}[definition]{Comment}
\newtheorem{example}[definition]{Example}
\newtheorem{conjecture}[definition]{Conjecture}
\newcommand{\proof}{\noindent {\bf Proof:}\ }
\newcommand{\squ}{\noindent $\square$}
\newcommand{\bin}[2]{\Big( \begin{array}{c} #1 \\[-1mm] #2 \end{array}
\Big) }
\def\Tr{\mathrm{Tr}}
%---------------------------------------------------------
\numberwithin{equation}{section}
%---------------------------------------------------
\begin{document}
In this note, I review \href{http://www.inference.phy.cam.ac.uk/mackay/itila/}{the book of David Mackay}, and solve some challenging exercises in the book. To review and summarize the book, I quote or copy and paste the sentences of the book without approval of the author of the book. I do not use this note with the purpose of financial profit. You can find the materials for this note from \href{https://github.com/physhik/Study-of-David-Mackay-s-book-}{\vect{my github}}, such as Mathematica notes, their pdf copies and some relevant python codes.
\section{Introduction}
\subsection{Some formula in the chapter 1 of the book}
\begin{figure}[h!]
\centerline{\includegraphics[scale=0.6]{probability_formulae.png}}
\caption{Some formula}
\end{figure}
\subsection{Assumptions in inference}
First, once assumptions are made, the inferences are objective and unique,
reproducible with complete agreement by anyone who has the same information
and makes the same assumptions. For example, given the assumptions
listed above, $\mathcal{H}$, and the data D, everyone will agree about the posterior probability
of the decay length
$$
P(\lambda | D, \mathcal{H})= {P(D|\lambda , \mathcal{H})P(\lambda | \mathcal{H}) \over P(D|\mathcal{H})}
$$
Second, when the assumptions are explicit, they are easier to criticize, and
easier to modify { indeed, we can quantify the sensitivity of our inferences to
the details of the assumptions.
Third, when we are not sure which of various alternative assumptions is
the most appropriate for a problem, we can treat this question as another
inference task. Thus, given data D, we can compare alternative assumptions
$\mathcal{H}$ using Bayes' theorem
$$
P( \mathcal{H}|D,I)= {P(D| \mathcal{H},I)P(\mathcal{H}|I) \over P(D|I)},
$$
where I denotes the highest assumptions, which we are not questioning.
Fourth, we can take into account our uncertainty regarding such assumptions
when we make subsequent predictions. Rather than choosing one particular
assumption $\mathcal{H}^*$, and working out our predictions about some quantity $t$,
$P(t |D, \mathcal{H},I)$, we obtain predictions that take into account our uncertainty
about H by using the sum rule
$$
P({\textbf t} | D, I)= \sum_{\mathcal{H}} P({\textbf t}|D, \mathcal{H},I)P( \mathcal{H}|D,I)
$$
This is another contrast with orthodox statistics, in which it is conventional
to `test' a default model, and then, if the test `accepts the model' at some
`significance level', to use exclusively that model to make predictions.
{\it probability theory reaches parts that ad hoc methods cannot reach.}
Model comparison as inference. Assume we have two hypotheses. In order to perform model comparison, We wish to
know how probable $P(\mathcal{H}_1)$ is given the data. By Bayes' theorem,
$$
P( \mathcal{H}_1|{\textbf s},F)= {P(s|F, \mathcal{H}_1)P(\mathcal{H}_1) \over P({\textbf s}|F)},
$$
and
$$
P( \mathcal{H}_0|{\textbf s},F)= {P({\textbf s}|F, \mathcal{H}_0)P(\mathcal{H}_0) \over P({\textbf s}|F)}
$$
The normalizing constant in both cases is the total probability of geting the observed data. and
$$
P(s|F)=P({\textbf s}|F, \mathcal{H}_1)P(\mathcal{H}_1) +P({\textbf s}|F, \mathcal{H}_0)P(\mathcal{H}_0)
$$
To evaluate the posterior probabilities of the hypotheses we need to assign
values to the prior probabilities $P(\mathcal{H}_1)$ and $P(\mathcal{H}_0)$; in this case, we might
set these to 1/2 each. And we need to evaluate the data-dependent terms
$P(s|F, \mathcal{H}_1)$ and $P(s|F,\mathcal{H}_0)$. We can give names to these quantities. The
quantity $P(s|F, \mathcal{H}_1)$ is a measure of how much the data favour $ \mathcal{H}_1)$, and we
call it the evidence for model $ \mathcal{H}_1)$. {\it How model comparison works : The evidence for a model is usually the normalizing constant of an earlier Bayesian inference.}
\section{Clustering and Maximum Likelihood}
\subsection{Motivation of clustering}
First, a good clustering has
predictive power. Second, clusters can be a useful aid to communication because they allow
lossy compression. A third reason for making a cluster model is that failures of the cluster
model may highlight interesting objects that deserve special attention. A fourth reason for liking clustering algorithms is that they may serve
as models of learning processes in neural systems.
\subsection{K-means}
The K-means algorithm is an algorithm for putting N data points in an M-dimensional space into K clusters. Each cluster is parameterized by a vector
${\textbf m}^{(k)}$ called its mean.
First of all, set K means {${\textbf m}^{(k)}$} to random values. In the assignment step, each data point n is
assigned to the nearest mean.
$$
\hat k^{(n)} = argmin_k d({\textbf m}^{(k)}, {\textbf x}^{(n)})
$$
An alternative, equivalent representation of this assignment of
points to clusters is given by `responsibilities', which are indicator
variables $r^{(n)}_k$. In the assignment step, we set $r^{(n)}_k$ to one if mean k is the closest mean to datapoint $ {\textbf x}^{(n)}$; otherwise, $r^{(n)}_k$ is zero.
$$
r^{(n)}_k = \{
\begin{array}{cc}
1 ~~~~~~if ~\hat k^{(n)}=k\\
0 ~~~~~~if ~\hat k^{(n)}\ne k\\
\end{array}
$$
In the update step, the means are adjusted to
match the sample means of the data points that they are responsible for. The update step is very similar to how to find the center of the mass in physics.
$$
{\textbf m}^{(k)}= {\sum_n ~ r^{(n)}_k {\textbf x}^{(n)} \over R^{(k)}}
$$
where $R^{(k)}$ is the total responsibility of mean $k$
$$
R^{(k)}=\sum_n~r^{(n)}_k
$$
\subsection{Exercise 22.5}
$$
P(k_n = 1|x_n, \vect{\theta})= {P(x_n|k_n=1,\vect{\theta})P(k_n=1,\vect{\theta}) \over P(x_n,\vect{\theta})}
$$
$$
= {P(x_n|k_n=1,\vect{\theta})P(k_n=1,\vect{\theta}) \over \sum_{k_n} P(x_n|k_n=1,\vect{\theta})P(k_n=1,\vect{\theta}) +P(x_n|k_n=2,\vect{\theta})P(k_n=2,\vect{\theta}) }
$$
where $\vect{\theta}=(\mu_k, ~\sigma_k)$.
$$
P(k_n=1,\vect{\theta}) \equiv p_1,~~P(k_n=2,\vect{\theta}) \equiv p_2
$$
Then,
$$
P(k_n=1|x_n,\vect{\theta})= {p_1 \over p_1+p_2 \exp[-(w_1x_n+w0)]}
$$
$$
P(k_n=2|x_n,\vect{\theta})= {p_2 \over p_2+p_1 \exp[-(w_1x_n+w0)]}
$$
where $w_1=2(\mu_1-\mu_2)$, $w_0=-(\mu_1-\mu_2)(\mu_1+\mu_2)$
$$
P(k_n=k|x_n,\vect{\theta}) \equiv p_{k|n}
$$
By assumption, the prior probability $p_1=p_2=1/2$ then, (22.17) of the book is satisfied.
$$
L\equiv \log \Pi_n P(x_n|\{\mu_k\},~\sigma)
$$
then trivially,
$$
{\partial \over \partial{\mu_k}}L = \sum_n {p_{k|n} (x_n-\mu_k)\over \sigma^2}
$$
$$
{\partial^2 \over \partial{\mu_k}^2}L = -\sum_n {p_{k|n} \over \sigma^2}
$$
The new updated $\vect \mu'$ should maximize the likelihood. Then,
$$
{\partial \over \partial{\mu_k'}}L = \sum_n {p_{k|n} (x_n-\mu_k')\over \sigma^2}=0
$$
$$
\sum_n p_{k|n} ~x_n - \sum_n p_{k|n} ~\mu'_k=\sum_n p_{k|n} ~x_n - \mu'_k\sum_n p_{k|n}=0
$$
Therefore,
$$
\mu'_k={\sum_n p_{k|n} ~x_n \over \sum_n p_{k|n} }
$$
Note that this equation is exactly the same as the updated means from the responsibilities and data points in soft K-means clustering. $p_{k|n}$ is the responsibility $r^{(k)}_n$.
$$
{{\partial \over \partial{\mu_k}}L \over {\partial^2 \over \partial{\mu_k}^2}L } = {\sum_n p_{k|n} ~x_n - \mu_k\sum_n p_{k|n} \over -\sum_n p_{k|n}}= -\mu'_k+\mu_k
$$
Thus,
$$
\mu'_k=\mu_k-{{\partial \over \partial{\mu_k}}L \over {\partial^2 \over \partial{\mu_k}^2}L }
$$
\subsection{Exercise 22.15}
Question : N datapoints $\{x_n\}$ are drawn from
N distributions, all of which are Gaussian with a common mean $\mu$ but
with different unknown standard deviations $\sigma_n$. What are the maximum
likelihood parameters $\mu$, $\{\sigma_n\}$ given the data? For example, seven scientists (A, B, C, D, E, F, G) with wildly-differing experimental skills
measure $\mu$.
\begin{eqnarray*}
x_n(A) &=& -27.020\\
x_n(B) &=& 3.570\\
x_n(C) &=& 8.191\\
x_n(D) &=& 9.898\\
x_n(E) &=& 9.603\\
x_n(F) &=& 9.945\\
x_n(G( &=& 10.056
\end{eqnarray*}
You expect some of them to do accurate work (i.e., to have
small $\sigma_n$), and some of them to turn in wildly inaccurate answers (i.e.,
to have enormous $\sigma_n$). What is $\mu$, and how reliable is each scientist?
$$
$$
Answer :
$$
N=7
$$
$$
x_n=(-27.02, 3.57, 8.191, 9.898, 9.603, 9.945, 10.056)
$$
$$
{\sum_n x_n\over N}=3.46329
$$
It must not be the correct mean.
The likelihood is as followings by the description of the problem.
$$
P(\{x_n\}|\sigma_n,\mu) = ({1 \over 2\pi })^{N/2} \Pi_n{1\over \sigma_n } \exp(-\sum_n {(x_n-\mu)^2 \over 2\sigma^2_n})
$$
To find the maximum likelihood,
$$
L \equiv \log P(\{x_n\}|\sigma_n,\mu) = -\sum_n \log~\sigma_n-\sum_n {(x_n -\mu)^2 \over 2\sigma_n^2}
$$
$$
{\partial L \over \partial x_n} = -\sum_n \left({x_n-\mu \over \sigma^2_n}\right)=-\sum_n{x_n \over \sigma_n^2}+\mu\sum_n {1 \over \sigma_n^2} = 0
$$
$$
\mu= {\sum_n{x_n\over \sigma_n^2}\over{1\over \sigma_n^2}}
$$
When $x_n=(-27.02, 3.57)$, the correspondent $\sigma_n$ are so huge, then it contributes very tiny in $\mu$. Then, the mean should be close to mean of $(8.191, 9.898,$
$9.603, 9.945, 10.056)$.
\subsection{Exercise 24.3}
Question : [This exercise requires macho integration capabilities.] Give
a Bayesian solution to exercise 22.15, where seven scientists of
varying capabilities have measured $\mu$ with personal noise levels $\sigma_n$,
and we are interested in inferring $\mu$. Let the prior on each $\sigma_n$ be a
broad prior, for example a gamma distribution with parameters $(s,c) =
(10, 0.1)$. Find the posterior distribution of $\mu$. Plot it, and explore its
properties for a variety of data sets such as the one given, and the data
set $\{x_n\} = \{13.01, 7.39\}$.
$$
$$
Answer :
Bayesian for posterior probability of $\sigma_n$ is
$$
P(\sigma_n|\{x_n\},\mu)= {P(\{x_n\}|\sigma_n,\mu) P(\sigma_n)\over P(\{x_n\}|\mu)}
$$
Given
$$
P(\{x_n\}|\sigma_n,\mu) = ({1 \over 2\pi })^{N/2} \left( \Pi_n{1\over \sigma_n }\right) \exp\left(-\sum_n {(x_n-\mu)^2 \over 2\sigma^2_n}\right)
$$
and
$$
P(\sigma_n)=({1\over \Gamma(c)~s^c})^N\Pi_n~(\sigma_n)^{c-1}~\exp({\sum_n \sigma_n \over s})$$
where $(s,~c)=(10,0.1)$.
and the normalizing constant is
$$
P(\{x_n\}|\mu)=\int^{\infty}_0~\Pi_n~d\sigma_n ~P(\{x_n\}|\sigma_n,\mu) P(\sigma_n)
$$
\begin{figure}
\centerline{
\begin{tabular}{cc}
\includegraphics[width=.5\textwidth]{7scientistplot.png} &
\includegraphics[width=.5\textwidth]{7scientistplot2.png}
\end{tabular}}
\caption[The plot of exercise 24.3]{\label{7scientist} \small The distribution functions are not normalized yet. However, this plot already shows the maximum likelihood should be around $\mu=10$. The small bumps around $x_n=(-27,~3.6,~8)$ are also seen in the graph. }
\end{figure}
The posterior probability of $\mu$ is
$$
P(\mu|x_n)={P(\{x_n\}|\mu)P(\mu)\over P(\{x_n\})}
$$
and the prior is determined by some assumption.
$P(\mu) = {1 \over \sigma_{\mu}}=const.$, then the normalizing constant is
$$
P(\{x_n\})=\int^{\infty}_{-\infty} ~d\mu ~P(\{x_n\}|\mu) {1 \over \sigma_{\mu}}
$$
\section{Gaussian approximation and model comparison}
\subsection{Laplace approximation}
\subsubsection{Exercise 27.1}
$r$ is a positive integer, and posterior over lambda
$$
P(\lambda|r)= {P(r|\lambda) P(\lambda)\over P(r)}
$$
$$
P(r|\lambda)=\exp(\lambda){\lambda^r \over r!}
$$
By assumption
$$
P(\lambda)={1\over \lambda}
$$
Then, the normalizing constant is
$$
P(r)=\int^{\infty}_0~d\lambda~ \exp(\lambda){\lambda^r \over r!}{1\over \lambda}= {\Gamma(r) \over r!} = {1\over r}
$$
We need to find the $lambda$ for maximum likelihood. By differentiate the posterior probability distribution function with respect to $\lambda$,
it has a maximum at $\lambda = \lambda_0=r-1$, and
$$
P(\lambda=\lambda_0=r-1|r)={(r-1)^{(r-1)}\exp(r-1)\over (r-1)!}
$$
$$
c=-{\partial^2 \over \partial \lambda^2}\log~P(\lambda|r)|_{\lambda=\lambda_0=r-1}={1-r \over \lambda^2}|_{\lambda=\lambda_0=r-1}={1\over r-1}
$$
Then, the Laplace approximation is
$$
G(\lambda|r)= P(\lambda=\lambda_0=r-1|r)\exp\left(-{c\over 2}{(\lambda-\lambda_0)^2}\right)
$$
\begin{figure}
\centerline{
\begin{tabular}{cc}
\includegraphics[width=.5\textwidth]{laplace.pdf} &
\includegraphics[width=.5\textwidth]{laplace2.pdf}
\end{tabular}}
\caption[The plot of exercise 27.1]{\label{laplace} \small $r=10$ is set. The blue plot is Poisson distribution and the orange one is the Laplace approximation.
%%% Look into the plot around the maximum. Very well matched around the tip. The maximum is at $r=9$.The blue plot is Poisson distribution and the orange one is the Laplace approximation. Poisson distribution is not symmetric about the axis of $r=9$. Laplace approximation is symmetric and has same maximum and variance.
%%%
}
\end{figure}
\subsubsection{Exercise 27.3}
Bayesian for posterior of $\omega_0, \omega_1$. $y(x_n)$ is the mean of $t_n$ data points.
$$
P(\{\omega_i\}|t_n,x_n, \sigma_{\nu})= { P(t_n|\{\omega_i\},x_n,\sigma_{\nu})P(\omega_0)P(\omega_1)\over P(t_n|x_n,\sigma_{\nu})}
$$
$P(t_n|\{\omega_i\},\sigma_{\nu})$, $P(\omega_0)$, $P(\omega_1)$ are all Gaussian distributed. Prior $P(\omega_0)$, $P(\omega_1)$ are assumed to be Gaussian.
$$
P(t_n|\sigma_{\nu})= \int^{\infty}_{-\infty} d\omega_0 \int^{\infty}_{-\infty} d\omega_1~ P(t_n|\{\omega_i\},\sigma_{\nu})P(\omega_0)P(\omega_1)
$$
$$
P(\{t_n\}|\{\omega_i\},x_n,\sigma) = ({1 \over 2\pi \sigma^2_{\nu}})^{N/2} \exp\left(-\sum_n {(t_n-y(x_n))^2 \over 2\sigma^2}\right)
$$
$$
y(x_n)=\omega_0 + \omega_1 x_n
$$
$$
P(\omega_i)= ({1 \over 2\pi \sigma^2})^{1/2} \exp\left(-{\omega_i^2 \over 2\sigma_i^2}\right)
$$
Because the mean of $\omega_i$ should be zero obviously, and the variances of $\omega_0$ and $\omega_1$, i.e. $\sigma_1$, $\sigma_2$ could be same without loss of generality. The variance of $t_n$ would have a very broad range of values dependent on $x_n$. I assumed $\sigma_i^2 = \sigma/3$ at Figure \ref{BayesianRegression}.
To use Laplace approximation, find $\{\omega_i\}$ satisfy ${\partial \over \partial \omega_i}\log P(\{\omega_i\}|t_n,x_n, \sigma)$. \footnote{Do you have any better suggestion?}
\begin{equation}\label{blr}
\log P(\{\omega_i\}|t_n,x_n, \sigma) = -\left(\sum_n {(t_n-\omega_0 - \omega_1 x_n)^2 \over 2\sigma^2}\right)-{\omega_0^2 \over 2\sigma_0^2}-{\omega_1^2 \over 2\sigma_1^2}
\end{equation}
Some reader might notice that $-\sum_n {(t_n-\omega_0 - \omega_1 x_n)^2 \over 2\sigma^2}$ term is minimized for the linear regression. We have two extra terms from the priors. It is {\it Bayesian linear regression}.
By the way, $\omega_0^*$ and $\omega_1^*$ satisfy Eq. \ref{blr} is quite complicated. I tried to find the simplified form, but can't. I would just keep symbolic expressions of $\omega_0^*$, $\omega_1^*$.
Using the Eq. (27.6) of the book,
$$
\vect A={1\over \sigma^2}\left(
\begin{array}{ccc}
N+1 & -\sum_n x_n \\
-\sum_n x_n & (\sum_n x_n)^2 + 1 \\
\end{array}
\right)
$$
Then, the Laplace approximation is
$$
G(\{\omega_i\}|t_n)= P(\{\omega_i^*\}|t_n,x_n, \sigma) exp\left( -{1 \over 2} (\vect \omega - \vect \omega_0)^T \vect A (\vect \omega - \vect \omega_0)\right)
$$
where $ \vect \omega = (\omega_0, \omega_1)$ and $ \vect \omega^* = (\omega_0^*, \omega_1^*)$
\begin{figure}
\centerline{
\begin{tabular}{cccc}
\includegraphics[width=.4\textwidth]{Likelihood.png} &
\includegraphics[width=.4\textwidth]{BayesianRegression.png} &
\includegraphics[width=.4\textwidth]{BayesianRegression2.png}&
\includegraphics[width=0.1\textwidth]{BayesianRegression3.png}
\end{tabular}}
\caption[The plot of exercise 27.3]{\label{BayesianRegression} \small The likelihood and posterior probability distribution plot and its Laplace approximation. I have put 10 random data points $x_n$ and $t_n ~ 3x_n+2 \pm 2$. The left plot has the maximum likelihood around $(\omega_0, \omega_2) = (1.79794, 3.02217)$, and the central posterior is about $(\omega_0, \omega_2) = (0.812961, 3.16977)$. The discrepancy occurs by the assumed priors. Because the mean of $\omega_i$ should be zero obviously, and the variances of $\omega_0$ and $\omega_1$, i.e. $\sigma_1$, $\sigma_2$ could be same without loss of generality. The variance of $t_n$ would have a very broad range of values dependent on $x_n$. I assumed $\sigma_i^2 = \sigma/3$. It is chosen from base of the mean value of $x_n$. The last one is obtained by Laplace approximation. I did not normalize them.}
\end{figure}
\subsection{Occam's razor}
This section is very well summarized at the figures of the book. Occam's factor is defined in the process of model comparison. The Occam's factor ${\sigma_{\omega|D}\over \sigma_{\omega}}$ is equal to the ratio of the posterior accessible volume of $\mathcal{H}_i$'s parameter space to the prior accessible volume, or the factor by which $\mathcal{H}_i$'s hypothesis space collapses when the data arrive. The factor is smaller as the model is simple. The logarithm of the Occam factor is a measure of the amount of
information we gain about the model's parameters when the data arrive.
{\it Occam factor for several parameters}
If the posterior is well approximated by a Gaussian, then the Occam factor
is obtained from the determinant of the corresponding covariance matrix (cf.
equation (28.8) and Chapter 27):
$$
P(D|\mathcal{H}_i) \sim P(D|\vect w_{MP}, \mathcal{H}_i) \times P(\vect w_{MP}| \mathcal{H}_i) \det (\vect A/2\pi)^{-1/2}
$$
$~~~~~~~~~~~$Evidence $\sim$ Best fit likelihood $\times$ Occam factor
where $\vect A = -\Delta \Delta \log P(D|\vect w_{MP}, \mathcal{H}_i)$. Bayesian model comparison is a simple extension of maximum likelihood model selection: the evidence is obtained by multiplying the best-fit likelihood by the Occam factor. To evaluate the Occam factor we need only the Hessian $\vect A$, if the Gaussian
approximation is good. Thus the Bayesian method of model comparison by
evaluating the evidence is no more computationally demanding than the task
of finding for each model the best-fit parameters and their error bars.
\subsubsection{Exercise 28.1}
For given $x= 0.3,0.5,0.7,0.8,0.9$, which hypothesis would be more probable? By intuition, $\mathcal{H}_0$, because the data looks rather uniformly distributed.
$P(D|\mathcal{H}_0)$ is ${1 \over 2}\times {1 \over 2} \times{1 \over 2} \times{1 \over 2}\times {1 \over 2} = 0.03125$ since the probability distribution is uniform.
$P(D|\mathcal{H}_1)$ is a product of ${1 \over 2} (1- 0.4 x)$ for $x= 0.3,0.5,0.7,0.8,0.9$. It is about $0.00689357$. The ratio ${P(D|\mathcal{H}_0) \over P(D|\mathcal{H}_1)} \sim 4.53321$. $P(D|\mathcal{H}_0)$ is 4.5 times more probable.
\subsubsection{Exercise 28.2}
Look at the data points plot.
\begin{figure}[h!]
\centerline{
\includegraphics[width=.7\textwidth]{datapointsExercise28-2.png}}
\caption[The plot of exercise 28.2-1]{\label{Exercise28-2-1} \small }
\end{figure}
The best fit would be almost flat but the slope is not zero, and the y-intersection would be around 10. Which hypothesis is more probable by your intuition?
The given data points,D= (x, t)= \{(-8, 8), (-2, 10), (6, 11)\};
The evidence is
{\small
$$
P(D|\sigma_{\nu}, \sigma_{\omega_0},\sigma_{\omega_1},\mathcal{H}_i) = \int^{\infty}_{-\infty} d\omega_0 \int^{\infty}_{-\infty} d\omega_1 P(D|\omega_0, \omega_1,\sigma_{\nu}, \mathcal{H}_i) P(\omega_0| \sigma_{\omega_0},\sigma_{\omega_1},\mathcal{H}_i)P(\omega_1| \sigma_{\omega_0},\sigma_{\omega_1},\mathcal{H}_i)
$$ }
For $\mathcal{H}_2$,
{\tiny
$$
P(D|\sigma_{\nu}, \sigma_{\omega_0},\sigma_{\omega_1},\mathcal{H}_2) = \int^{\infty}_{-\infty} d\omega_0 \int^{\infty}_{-\infty} d\omega_1
{1\over \sqrt{2\pi }\sigma_{\nu}}
\exp\left(-{(\omega_0+\omega_1 x - t)^2\over 2\sigma^2_{\nu}}\right){1\over \sqrt{2\pi }\sigma_{\omega_0}}
\exp\left(-{(\omega_0-\mu_{\omega_0})^2\over 2\sigma^2_{{\omega_0}}}\right){1\over \sqrt{2\pi }\sigma_{\omega_1}}
\exp\left(-{(\omega_1-\mu_{\omega_1})^2\over 2\sigma^2_{{\omega_1}}}\right)
$$ }
Then, the evidence for $\mathcal{H}_2$ with respect to 3 data points are about $0.0302393, ~0.0000391484, ~0.0131697$ for each. The total evidence is about $1.55905 \times 10^{-8}$.
For $\mathcal{H}_1$, $\omega_1 =0$ is fixed, so we do not have related term.
{\small $$
P(D|\sigma_{\nu}, \sigma_{\omega_0},\sigma_{\omega_1},\mathcal{H}_1) = \int^{\infty}_{-\infty} d\omega_0 \int^{\infty}_{-\infty} d\omega_1
{1\over \sqrt{2\pi }\sigma_{\nu}}
\exp\left(-{(\omega_0+\omega_1 x - t)^2\over 2\sigma^2_{\nu}}\right){1\over \sqrt{2\pi }\sigma_{\omega_0}}
\exp\left(-{(\omega_0-\mu_{\omega_0})^2\over 2\sigma^2_{{\omega_0}}}\right)
$$ }
The evidence for $\mathcal{H}_1$ with respect to 3 data points are about $3.17456 \times 10^{-8},~3.91772 \times 10^{-12},~2.05583 \times 10^{-14}$ for each. The total evidence is about $2.55684 \times 10^{-33}$.
Therefore, the horizontal line has much smaller evidence, and it agrees with the intuition.
\section{Monte Carlo method}
Where the likelihood function
is multimodal, and has nasty unboundedly-high spikes in certain locations in
the parameter space; so maximizing the posterior probability and fitting a
Gaussian is not always going to work. This difficulty with Laplace's method is
one motivation for being interested in Monte Carlo methods. In fact, Monte
Carlo methods provide a general-purpose set of tools with applications in
Bayesian data modelling and many other fields.
The book uses the word {\it sample} in the following sense: a sample
from a distribution $P(\vect x)$ is a single realization $\vect x$ whose probability distribution
is $P(\vect x)$. This contrasts with the alternative usage in statistics, where {\it sample}
refers to a collection of realizations $\{\vect x\}$.
The aims of Monte Carlo methods are to solve one or both of the
following problems.
\vect{Problem 1}: to generate samples $\{x^{(r)}\}^R_{r=1}$ from a given probability distribution
$P(x)$.
\vect{Problem 2}: to estimate expectations of functions under this distribution, for
example
$$
\Phi = \int d^N \vect x P(\vect x) \phi(\vect x)
$$
Estimator
$$
\hat \Phi = {1 \over R} \sum_r \phi(\vect x^{(r)})
$$
If the vectors $\{x^{(r)}\}^R_{r=1}$ are generated from $P(x)$ then the expectation of $\hat \Phi$ is $\Phi$, and the variance of $\hat \Phi$ is the variance,$\sigma^2$, of $\phi$ divided by $R$. \footnote{Look at \href{https://github.com/physhik/Study-of-David-Mackay-s-book-/blob/master/PopulationAndSampling.pdf}{\vect{this}} for a simple example.}
$$
\sigma^2 = \int d^N \vect x P(\vect x) (\phi(\vect x) - \Phi)^2
$$
We can evaluate $P^*(\vect x)$.
$$
P(\vect x)=P^*(\vect x)/Z
$$
We do not know $Z$. Secondly, even if we did know $Z$, the problem of drawing samples
from $P(\vect x)$ is still a challenging one, especially in high-dimensional spaces,
because there is no obvious way to sample from $P$ without enumerating most
or all of the possible states. Correct samples from $P$ will by definition tend
to come from places in $\vect x$-space where $P(\vect x)$ is big; how can we identify those
places where $P(\vect x)$ is big, without evaluating $P(\vect x)$ everywhere? There are only
a few high-dimensional densities from which it is easy to draw samples, for
example the Gaussian distribution.
\subsection{Exercise 29.3}
Problem : Consider a one-dimensional random walk, on each step of
which the state moves randomly to the left or to the right with equal
probability. Show that after N steps of size a, the state is likely to have
moved only a distance about $\sqrt{N}a$. (Compute the root mean square
distance travelled.)
Answer : Consider only right steps. p = 1/2. Then the average distance is $N~p~a = N~a/2$ and the deviation is $\sqrt{Np(1-p)}a = \sqrt{N}a/2$. While consider the left step as well, the average is 0, and the deviation is between $\sqrt{N}a$ and zero.
\subsection{Rejection sampling}
We assume
that we have a simpler proposal density $Q(x)$ which we can evaluate (within a
multiplicative factor $Z_Q$, as before), and from which we can generate samples.
We further assume that we know the value of a constant $c$ such that
$$
c Q^*(x) > P^*(x),~for~all~x.
$$
We generate two random numbers. The first, $x$, is generated from the
proposal density $Q(x)$. We then evaluate $cQ^*(x)$ and generate a uniformly
distributed random variable $u$ from the interval $[0, cQ^*(x)]$. These two random
numbers can be viewed as selecting a point in the two-dimensional plane.
We now evaluate $P^*(x)$ and accept or reject the sample x by comparing the
value of u with the value of $P^*(x)$. If $u > P^*(x)$ then $x$ is rejected; otherwise
it is accepted, which means that we add $x$ to our set of samples $\{x^{(r)}\}$. The
value of $u$ is discarded.
\href{https://github.com/physhik/Study-of-David-Mackay-s-book-/blob/master/RejectSamplingMC.ipynb}{\vect{The reject sampling ipython notebook}} present the python code of the sampling. The bald is linked to ipython notebook in the GitHub. In the example, $Q^*(x)$ is trivially given as a constant, and $P^*(x)$ is given as a specific function.
\begin{figure}[h!]
\centerline{\includegraphics[scale=0.4]{rejectSampling.png}}
\caption{Reject sampling. $cQ^*(x)$ is a constant which is a maximum value of the function, $P^*(x)$. }\label{rejectsampling}
\end{figure}
The
acceptance rate is the ratio of the volume under the curve $P(x)$ to the volume
under $cQ(x)$, the fact that $P$ and $Q$ are both normalized here implies that
the acceptance rate will be $1/c$. If it is $N$-dimensional problem, the acceptance rate becomes $(1/c)^N$
Rejection sampling, therefore, whilst a useful method for one-dimensional
problems, is not expected to be a practical technique for generating samples
from high-dimensional distributions $P(x)$.
\subsection{The Metropolis-Hastings method}
The Metropolis-Hastings algorithm instead makes use of a proposal density
$Q$ which depends on the current state $x^{(t)}$. The density $Q(x', x^{(t)})$ might
be a simple distribution such as a Gaussian centred on the current $x^{(t)}$. The
proposal density $Q(x', x^{(t)})$ can be any fixed density from which we can draw
samples. In contrast to importance sampling and rejection sampling, it is not
necessary that $Q(x', x^{(t)})$ look at all similar to $P(x)$ in order for the algorithm
to be practically useful.
We assume that we can evaluate $P^*(x)$ for any $x$. A tentative
new state $x'$ is generated from the proposal density $Q(x', x^{(t)})$. To decide whether to accept the new state, we compute the quantity
$$
a= {P^*(x)~Q( x^{(t)},x') \over P^*(x^{(t)})~Q(x', x^{(t)}) }
$$
The algorithm is
\begin{itemize}
\item If $a \le 1$ then the new state is accepted.
\item Otherwise, the new state is accepted with probability a.
\item If the step is accetpted, we set $x^{(t+1)}=x'$.
\item If the step is rejected, then we set $x^{(t+1)}=x^{(t)}$ (The rejected points are not discarded.)
\end{itemize}
If the
proposal density is a simple symmetrical density such as a Gaussian centred on
the current point, then the factor ${Q( x^{(t)},x') \over Q(x', x^{(t)}) }$ is unity, and the Metropolis-Hastings
method simply involves comparing the value of the target density at the two
points. {\it i.e.} $a={P^*(x) \over P^*(x^{(t)})}$.
Rule of thumb: lower bound on number of iterations of a
Metropolis method. If the largest length scale of the space of
probable states is $L$, a Metropolis method whose proposal distribution
generates a random walk with step size $\epsilon$ must be run for at
least $T \sim (L/\epsilon)^2$ iterations to obtain an independent sample. This rule of thumb gives only a lower bound; the situation may be much
worse, if, for example, the probability distribution consists of several islands
of high probability separated by regions of low probability.
It is good news because, unlike the
cases of rejection sampling and importance sampling, there is no catastrophic
dependence on the dimensionality $N$ in MH. Roughly, $T \sim (\sigma_{Max}/\sigma_{min})^2$ from the rule of thumb. Because this quadratic dependence on the lengthscale-ratio may still force us
to make very lengthy simulations. We will discuss how to suppress the random works in Monte Carlo simulations later.
\subsection{Gibbs sampling}
In the general case of a system with K variables, a single iteration involves
sampling one parameter at a time.
$$
x^{(t+1)}_1 \sim P(x_1|x^{(t)}_2,x^{(t)}_3,x^{(t)}_4,...,x^{(t)}_K),
$$
$$
x^{(t+1)}_2 \sim P(x_2|x^{(t)}_1,x^{(t)}_3,x^{(t)}_4,...,x^{(t)}_K),
$$
$$
\cdots
$$
Gibbs sampling suffers from the same defect as simple Metropolis algorithms - the state space is explored by a slow random walk, unless a fortuitous parameterization
has been chosen that makes the probability distribution $P(x)$
separable. It slowly progresses made by
Gibbs sampling when $ L \gg \epsilon$. However Gibbs sampling involves no adjustable parameters, so it is an attractive
strategy when one wants to get a model running quickly.
\subsection{Markov chain}
We denote by $p^{(t)}(\vect x)$ the probability
distribution of the state of a Markov chain simulator. Our aim is to find a Markov chain such that as $t \rightarrow \infty$, $p^{(t)}(\vect x)$ tends
to the desired distribution $P(x)$.
A Markov chain can be specified by an initial probability distribution $p^{(0)}(\vect x)$ and a transition probability $T(\vect x' ; \vect x)$.
Then,
$$
p^{(t+1)}(\vect x) = \int d^N \vect x ~T(\vect x' ; \vect x)p^{(t)}(\vect x)
$$
Look at \href{https://github.com/physhik/Study-of-David-Mackay-s-book-/blob/master/reproduceFigure29.14.pdf/}{the link} for a simple example of Markov chain and Metropolis demonstration of section 29.4 of \href{http://www.inference.phy.cam.ac.uk/mackay/itila/}{the book of David Mackay.}
The Markov chain must have the following properties.
1. The desired distribution P(x) is an invariant distribution of the chain's transition. An invariant distribution is an eigenvector of the transition probability matrix that has eigenvalue 1.
2. The chain must also be ergodic, that is, $p^{(t)}(\vect x) \rightarrow \pi(\vect x)$ as $t \rightarrow \infty$, for any initial $p^{(0)}(\vect x)$.
A chain might not be ergodic when the matrix are reducible, or the chain is a periodic set.
\subsection{Exercise 29.7}
Detailed balance condition.
$$
T(\vect x_a ; \vect x_b)P(\vect x_b) = T(\vect x_b ; \vect x_a)P(\vect x_a),
$$
for all $\vect x_a$ and $\vect x_b$.
Then,
$$
\int d^N x_b~T(\vect x_a ; \vect x_b)P(\vect x_b) = \int d^N x_b~T(\vect x_b ; \vect x_a)P(\vect x_a),
$$
for all $\vect x_a$ and $\vect x_b$.
Then, using (29.45) of the book,
$$
\int d^N x_b ~d^N x_c~B(\vect x_a ; \vect x_c)B(\vect x_c ; \vect x_b)P(\vect x_b) =\int d^N x_b ~d^N x_c~B(\vect x_a ; \vect x_c)B(\vect x_c ; \vect x_a)P(\vect x_a),
$$
for all $\vect x_a$.
Both the left hand and the right hand side are $P(x_a)$ for all $\vect x_a$.
\subsection{Slice sampling}
A single transition $(x,u) \rightarrow (x',u')$ of a one-dimensional slice sampling
algorithm has the following steps, of which steps 3 and 8 will require further
elaboration.
1. evaluate $P^*(x)$
2. draw a vertical coordinate $u' \sim$ Uniform$(0,P^*(x))$
3. create a horizontal interval $(x_l, x_r)$ enclosing $x$
~~3a. draw $r \sim$ Uniform$(0,1)$
~~3b. $x_l := x-rw$
~~3c. $x_r := x+(1-r)w$
~~3d. while $(P^*(x_l) > u')$ $\{x_l := x-rw\}$
~~3e. while $(P^*(x_r) > u')$ $\{x_r:= x+w\}$
4. loop $\{$
5. ~~~~~~~~~ draw $x' \sim$ Uniform$(x_l,x_r)$
6. ~~~~~~~~~ evaluate $P^*(x')$
7. ~~~~~~~~~ if $P^*(x') >u'$ break out of loop 4-9
8. ~~~~~~~~~ else modify the interval $(x_l,x_r)$
~~8a. ~~~~~~~~~if $(x' > x)\{x_r:=x'\}$
~~8b. ~~~~~~~~~else $\{x_l:=x'\}$
9. $\}$
$$
$$
Like a standard Metropolis method, slice sampling gets around by a random
walk, but whereas in the Metropolis method, the choice of the step size is critical to the rate of progress, in slice sampling the step size is self-tuning. If
the initial interval size w is too small by a factor $f$ compared with the width of
the probable region then the stepping-out procedure expands the interval size.
The cost of this stepping-out is only linear in $f$, whereas in the Metropolis
method the computer-time scales as the square of $f$ if the step size is too
small.
\subsubsection{Exercise 29.10}
Problem : Investigate the properties of slice sampling applied to the
density function $P^*(x)$.
$$
P^*(x)= \{
\begin{array}{cc}
10 ~~~~~~as ~0\leq x<1\\
~~0 ~~~~~~as ~1\leq x \leq 11\\
\end{array}
$$
$x$ is a real variable between 0.0 and 11.0.
How long does it take typically for slice sampling to get from an $x$ in
the peak region $x \in (0,~ 1)$ to an $x$ in the tail region $x \in (1,~11)$, and vice
versa? Confirm that the probabilities of these transitions do yield an
asymptotic probability density that is correct.
$$
$$
Answer : I have explained the answer
\href{https://github.com/physhik/Study-of-David-Mackay-s-book-/blob/master/Exercise29.10.ipynb}{here(click it)}.
\subsection{Practicalities}
{\it Can we predict how long a Markov chain Monte Carlo simulation
will take to equilibrate?}
By considering the random walks involved in a
Markov chain Monte Carlo simulation we can obtain simple lower bounds on
the time required for convergence. But predicting this time more precisely is a
difficult problem, and most of the theoretical results giving upper bounds on
the convergence time are of little practical use. The exact sampling methods
will be discussed later and it offers a solution to this problem for certain Markov chains.
$$
$$
{\it Can we diagnose or detect convergence in a running simulation?}
This is also a difficult problem. There are a few practical tools available, but
none of them is perfect (Cowles and Carlin, 1996).
$$
$$
{\it Can we speed up the convergence time and time between independent samples of a Markov chain Monte Carlo method? }
Here, there is
good news, as described in the next chapter, which describes the Hamiltonian
Monte Carlo method, overrelaxation, and simulated annealing.
$$
$$
{\it Can the normalizing constant be evaluated?}
If the target density $P(\vect x)$ is given in the form of an unnormalized density
$P(\vect x)$ with $P(\vect x) = P^*(\vect x) /Z $, the value of $Z$ may well be of interest. Monte
Carlo methods do not readily yield an estimate of this quantity, and it is an
area of active research to find ways of evaluating it. Techniques for evaluating
$Z$ include:
$$
$$
1. Importance sampling (reviewed by Neal (1993b)) and annealed importance
sampling (Neal, 1998).
$$
$$
2. `Thermodynamic integration' during simulated annealing, the `acceptance ratio' method, and `umbrella sampling' (reviewed by Neal (1993b)).
$$
$$
3. `Reversible jump Markov chain Monte Carlo' (Green, 1995).
$$
$$
One way of dealing with Z, however, may be to find a solution to one's
task that does not require that Z be evaluated. In Bayesian data modelling
one might be able to avoid the need to evaluate Z { which would be important
for model comparison { by not having more than one model. Instead of using
several models (differing in complexity, for example) and evaluating their relative
posterior probabilities, one can make a single hierarchical model having,
for example, various continuous hyperparameters which play a role similar to
that played by the distinct models (Neal, 1996). In noting the possibility of
not computing Z, I am not endorsing this approach. The normalizing constant
Z is often the single most important number in the problem, and I think every
effort should be devoted to calculating it.
\section{Efficient Monte Carlo Methods}
\subsection{Summary of Monte Carlos methods }
In high-dimensional problems the only satisfactory methods are those
based on Markov chains, such as the Metropolis method, Gibbs sampling
and slice sampling. Gibbs sampling is an attractive method because
it has no adjustable parameters but its use is restricted to cases
where samples can be generated from the conditional distributions. Slice
sampling is attractive because, whilst it has step-length parameters, its
performance is not very sensitive to their values.
Simple Metropolis algorithms and Gibbs sampling algorithms, although
widely used, perform poorly because they explore the space by a slow
random walk. The next chapter will discuss methods for speeding up
Markov chain Monte Carlo simulations.
Slice sampling does not avoid random walk behaviour, but it automatically
chooses the largest appropriate step size, thus reducing the bad
effects of the random walk compared with, say, a Metropolis method
with a tiny step size.
\subsection{Hamiltonian Monte Carlo}
The Hamiltonian Monte Carlo method is a Metropolis method, applicable
to continuous state spaces, that makes use of gradient information to reduce
random walk behaviour.
Using the Algorithm 30.1(figure \ref{HMC}), figure 30.2 is reproduced in \href{https://github.com/physhik/Study-of-David-Mackay-s-book-/blob/master/Exercise29.10.ipynb}{this ipython notebook(click it)}.
\begin{figure}[h!]
\centerline{\includegraphics[scale=0.6]{algorithmhmc.png}}
\caption{Algorithm 30.1 of the book. Octave source
code for the Hamiltonian Monte
Carlo method.}\label{HMC}
\end{figure}
\subsection{Over-relaxation}
The method of overrelaxation is a method for reducing random walk behaviour
in Gibbs sampling. Overrelaxation was originally introduced for systems in
which all the conditional distributions are Gaussian.
In Adler's overrelaxation method, one instead samples $x^{(t+1)}_i$
from a Gaussian that is biased to the opposite side of the conditional distribution.
If the conditional distribution of $x_i$ is Normal($\mu$, $\sigma^2$) and the current value of
$x_i$ is $x^{(t)}_i$, then Adler's method sets $x_i$ to
$$
x^{(t+1)}_i = \mu +\alpha (x^{(t)}_i - \mu)+ (1-\alpha^2)^{1/2}\sigma \nu
$$
where $\nu$ $\sim$ Normal(0, 1) and $\alpha$ is a parameter between -1 and 1, usually set to
a negative value. (If $\alpha$ is positive, then the method is called under-relaxation.)
\subsubsection{Exercise 30.2}
Show that this individual transition leaves invariant the conditional
distribution $x_i \sim$ Normal($\mu$, $\sigma^2$).
$$
$$
Answer : $x^{(t+1)}_i = \mu +\alpha (x^{(t)}_i - \mu)+ (1-\alpha^2)^{1/2}\sigma \nu$.
$$
$$
The distribution of $x^{(t+1)}_i$ is still Gaussian. The mean of {$x^{(t+1)}_i $} is also $\mu$. At last, the standard deviation is $\sigma$ as well. To compute the mean and variance, you can use $ \nu \equiv {(x-\mu) \over \sigma}$.
\section{Ising Model}
\subsection{3 reasons for relevance}
1. Studying Ising model can be useful to understand phase transition of various systems.
$$
$$
2. Hopfield network or Boltzmann machine to the neural network is just a generalized form of Ising model.
$$