David Mackay's book review and some solutions of exercises.tex

\documentclass[a4paper,11pt]{article}
\pdfoutput=1 % if your are submitting a pdflatex (i.e. if you have
             % images in pdf, png or jpg format)

\usepackage{jheppub} % for details on the use of the package, please
                     % see the JHEP-author-manual


\usepackage{hyperref}

\usepackage{amssymb,amsmath,amsfonts}
\usepackage{epsfig}
\usepackage{array}
%\usepackage{verbatim}

\usepackage{tikz}

\usetikzlibrary{calc,through,backgrounds,patterns,decorations.pathmorphing,decorations.markings,%
		trees,positioning,arrows,chains,shapes.geometric,%
    decorations.pathreplacing,shapes}

\usepackage{pgfplots}
%\usepgfplotslibrary{patchplots,fillbetween}

\usepgfplotslibrary{fillbetween}    
\pgfplotsset{compat=1.10}
%\usepackage[dvipsnames]{xcolor}


\usepackage{amsmath}
\usepackage{multirow}
\usepackage{array}
\usepackage{bm}

\newcommand{\vect}[1]{\boldsymbol{\mathbf{#1}}}

\newcommand{\CA}{\mathcal{A}}
\newcommand{\CB}{\mathcal{B}}
\newcommand{\CC}{\mathcal{C}}
\newcommand{\CD}{\mathcal{D}}
%\newcommand{\CE}{\mathcal{E}}
\newcommand{\CF}{\mathcal{F}}
\def\CJ{\mathcal{J}}
\newcommand{\CG}{\mathcal{G}}
\newcommand{\CL}{\mathcal{L}}
\newcommand{\CO}{\mathcal{O}}
\newcommand{\CN}{\mathcal{N}}
\newcommand{\CI}{\mathcal{I}}
\newcommand{\CT}{\mathcal{T}}
\newcommand{\CS}{\mathcal{S}}
\newcommand{\CM}{\mathcal{M}}
\newcommand{\CQ}{\mathcal{Q}}
\newcommand{\CE}{\mathcal{E}}
\newcommand{\CZ}{\mathcal{Z}}

\newcommand{\Z}{\mathbb{Z}}
\newcommand{\C}{\mathbb{C}}
\newcommand{\R}{\mathbb{R}}

\newcommand{\gym}{g_{\rm YM}}

\newcommand{\eps}{\epsilon}
\newcommand{\grad}{\vec{\nabla}}
\newcommand{\el}{\kappa}

\newcommand{\ads}{\mbox{AdS}}

\newcommand{\nn}{\nonumber}
\newcommand{\lspa}{\ \ ,\ \ \ \ }
\newcommand{\spa}{\ , \ \ }

\newcommand{\ds}{\displaystyle}

\newcommand{\tr}{\mathop{{\rm Tr}}}
\DeclareMathOperator{\sgn}{sgn} \DeclareMathOperator{\diag}{diag}

\newcommand{\vecto}[2]{\left( \begin{array}{c} #1 \\ #2 \end{array}
\right) }
\newcommand{\matrto}[4]{\left( \begin{array}{cc} #1 & #2 \\
#3 & #4 \end{array} \right) }
\newcommand{\vectre}[3]{\left( \begin{array}{c} #1 \\ #2\\ #3 \end{array}
\right) }
\newcommand{\matrtre}[9]{\left( \begin{array}{ccc} #1 & #2 & #3 \\ #4 & #5 & #6 \\ #7 & #8 & #9 \end{array} \right) }


\newcommand{\hB}{{}^* \! B}
\newcommand{\hC}{{}^* \! C}
\newcommand{\hF}{{}^* \! F}
\newcommand{\hJ}{{}^* \! J}

\newtheorem{definition}{Definition}[section]
\newtheorem{proposition}[definition]{Proposition}
\newtheorem{theorem}[definition]{Theorem}
\newtheorem{lemma}[definition]{Lemma}
\newtheorem{corollary}[definition]{Corollary}
\newtheorem{comment}[definition]{Comment}
\newtheorem{example}[definition]{Example}
\newtheorem{conjecture}[definition]{Conjecture}

\newcommand{\proof}{\noindent {\bf Proof:}\ }
\newcommand{\squ}{\noindent $\square$}

\newcommand{\bin}[2]{\Big( \begin{array}{c} #1 \\[-1mm] #2 \end{array}
\Big) }

\def\Tr{\mathrm{Tr}}

%---------------------------------------------------------
\numberwithin{equation}{section}
%---------------------------------------------------

\begin{document} 

In this note, I review \href{http://www.inference.phy.cam.ac.uk/mackay/itila/}{the book of David Mackay}, and solve some challenging exercises in the book. To review and summarize the book, I quote or copy and paste the sentences of the book without approval of the author of the book. I do not use this note with the purpose of financial profit. You can find the materials for this note from \href{https://github.com/physhik/Study-of-David-Mackay-s-book-}{\vect{my github}}, such as Mathematica notes, their pdf copies and some relevant python codes. 

\section{Introduction}
\subsection{Some formula in the chapter 1 of the book}
\begin{figure}[h!]
\centerline{\includegraphics[scale=0.6]{probability_formulae.png}}
\caption{Some formula}
\end{figure}

\subsection{Assumptions in inference}

First, once assumptions are made, the inferences are objective and unique,
reproducible with complete agreement by anyone who has the same information
and makes the same assumptions. For example, given the assumptions
listed above, $\mathcal{H}$, and the data D, everyone will agree about the posterior probability
of the decay length
$$
P(\lambda | D, \mathcal{H})= {P(D|\lambda , \mathcal{H})P(\lambda |  \mathcal{H}) \over P(D|\mathcal{H})}
$$

Second, when the assumptions are explicit, they are easier to criticize, and
easier to modify { indeed, we can quantify the sensitivity of our inferences to
the details of the assumptions.


Third, when we are not sure which of various alternative assumptions is
the most appropriate for a problem, we can treat this question as another
inference task. Thus, given data D, we can compare alternative assumptions
$\mathcal{H}$ using Bayes' theorem

$$
P( \mathcal{H}|D,I)= {P(D| \mathcal{H},I)P(\mathcal{H}|I) \over P(D|I)},
$$
where I denotes the highest assumptions, which we are not questioning.

Fourth, we can take into account our uncertainty regarding such assumptions
when we make subsequent predictions. Rather than choosing one particular
assumption $\mathcal{H}^*$, and working out our predictions about some quantity $t$,
$P(t |D, \mathcal{H},I)$, we obtain predictions that take into account our uncertainty 
about H by using the sum rule
$$
P({\textbf t} | D, I)= \sum_{\mathcal{H}} P({\textbf t}|D, \mathcal{H},I)P( \mathcal{H}|D,I) 
$$

This is another contrast with orthodox statistics, in which it is conventional
to `test' a default model, and then, if the test `accepts the model' at some
`significance level', to use exclusively that model to make predictions.
{\it probability theory reaches parts that ad hoc methods cannot reach.}


Model comparison as inference. Assume we have two hypotheses. In order to perform model comparison, We wish to
know how probable $P(\mathcal{H}_1)$ is given the data. By Bayes' theorem,

$$
P( \mathcal{H}_1|{\textbf s},F)= {P(s|F, \mathcal{H}_1)P(\mathcal{H}_1) \over P({\textbf s}|F)},
$$
and 
$$ 
P( \mathcal{H}_0|{\textbf s},F)= {P({\textbf s}|F, \mathcal{H}_0)P(\mathcal{H}_0) \over P({\textbf s}|F)}
$$

The normalizing constant in both cases is the total probability of geting the observed data. and 
$$
P(s|F)=P({\textbf s}|F, \mathcal{H}_1)P(\mathcal{H}_1) +P({\textbf s}|F, \mathcal{H}_0)P(\mathcal{H}_0)
$$

To evaluate the posterior probabilities of the hypotheses we need to assign
values to the prior probabilities $P(\mathcal{H}_1)$ and $P(\mathcal{H}_0)$; in this case, we might
set these to 1/2 each. And we need to evaluate the data-dependent terms
$P(s|F, \mathcal{H}_1)$ and $P(s|F,\mathcal{H}_0)$. We can give names to these quantities. The
quantity $P(s|F, \mathcal{H}_1)$  is a measure of how much the data favour $ \mathcal{H}_1)$, and we
call it the evidence for model $ \mathcal{H}_1)$. {\it How model comparison works : The evidence for a model is usually the normalizing constant of an earlier Bayesian inference.}

\section{Clustering and Maximum Likelihood}
\subsection{Motivation of clustering}
First, a good clustering has
predictive power. Second, clusters can be a useful aid to communication because they allow
lossy compression. A third reason for making a cluster model is that failures of the cluster
model may highlight interesting objects that deserve special attention. A fourth reason for liking clustering algorithms is that they may serve
as models of learning processes in neural systems.

\subsection{K-means}
The K-means algorithm is an algorithm for putting N data points in an M-dimensional space into K clusters. Each cluster is parameterized by a vector
${\textbf m}^{(k)}$ called its mean. 

First of all, set K means {${\textbf m}^{(k)}$} to random values. In the assignment step, each data point n is
assigned to the nearest mean. 
$$
\hat k^{(n)} = argmin_k d({\textbf m}^{(k)}, {\textbf x}^{(n)})
$$
An alternative, equivalent representation of this assignment of
points to clusters is given by `responsibilities', which are indicator
variables $r^{(n)}_k$. In the assignment step, we set $r^{(n)}_k$ to one if mean k is the closest mean to datapoint $ {\textbf x}^{(n)}$; otherwise, $r^{(n)}_k$ is zero.
$$
r^{(n)}_k = \{
\begin{array}{cc}
1 ~~~~~~if ~\hat k^{(n)}=k\\
0 ~~~~~~if ~\hat k^{(n)}\ne k\\
\end{array}
$$
In the update step, the means are adjusted to
match the sample means of the data points that they are responsible for. The update step is very similar to how to find the center of the mass in physics. 
$$
{\textbf m}^{(k)}= {\sum_n ~ r^{(n)}_k {\textbf x}^{(n)} \over R^{(k)}}
$$
where $R^{(k)}$ is the total responsibility of mean $k$
$$
R^{(k)}=\sum_n~r^{(n)}_k
$$


\subsection{Exercise 22.5}
$$
P(k_n = 1|x_n, \vect{\theta})= {P(x_n|k_n=1,\vect{\theta})P(k_n=1,\vect{\theta}) \over P(x_n,\vect{\theta})}
$$
$$
= {P(x_n|k_n=1,\vect{\theta})P(k_n=1,\vect{\theta}) \over \sum_{k_n}  P(x_n|k_n=1,\vect{\theta})P(k_n=1,\vect{\theta}) +P(x_n|k_n=2,\vect{\theta})P(k_n=2,\vect{\theta}) }
$$
where $\vect{\theta}=(\mu_k, ~\sigma_k)$. 
$$
P(k_n=1,\vect{\theta}) \equiv p_1,~~P(k_n=2,\vect{\theta}) \equiv p_2
$$

Then, 
$$
P(k_n=1|x_n,\vect{\theta})= {p_1 \over p_1+p_2 \exp[-(w_1x_n+w0)]}
$$
$$
P(k_n=2|x_n,\vect{\theta})= {p_2 \over p_2+p_1 \exp[-(w_1x_n+w0)]}
$$
where $w_1=2(\mu_1-\mu_2)$, $w_0=-(\mu_1-\mu_2)(\mu_1+\mu_2)$

$$
P(k_n=k|x_n,\vect{\theta}) \equiv p_{k|n}
$$

By assumption, the prior probability $p_1=p_2=1/2$ then, (22.17) of the book is satisfied. 
$$
L\equiv \log \Pi_n P(x_n|\{\mu_k\},~\sigma)
$$
then trivially, 
$$
{\partial \over \partial{\mu_k}}L = \sum_n  {p_{k|n} (x_n-\mu_k)\over \sigma^2}
$$
$$
{\partial^2 \over \partial{\mu_k}^2}L = -\sum_n  {p_{k|n}  \over \sigma^2}
$$
The new updated $\vect \mu'$ should maximize the likelihood. Then,
$$
{\partial \over \partial{\mu_k'}}L = \sum_n  {p_{k|n} (x_n-\mu_k')\over \sigma^2}=0
$$
$$
\sum_n p_{k|n} ~x_n - \sum_n p_{k|n} ~\mu'_k=\sum_n p_{k|n} ~x_n -  \mu'_k\sum_n p_{k|n}=0
$$
Therefore,
$$
\mu'_k={\sum_n p_{k|n} ~x_n \over \sum_n p_{k|n} }
$$
Note that this equation is exactly the same as the updated means from the responsibilities and data points in soft K-means clustering. $p_{k|n}$ is the responsibility $r^{(k)}_n$.

$$
{{\partial \over \partial{\mu_k}}L \over {\partial^2 \over \partial{\mu_k}^2}L  } = {\sum_n p_{k|n} ~x_n -  \mu_k\sum_n p_{k|n} \over -\sum_n p_{k|n}}= -\mu'_k+\mu_k
$$
Thus, 
$$
\mu'_k=\mu_k-{{\partial \over \partial{\mu_k}}L \over {\partial^2 \over \partial{\mu_k}^2}L  }
$$

\subsection{Exercise 22.15}

Question : N datapoints $\{x_n\}$ are drawn from
N distributions, all of which are Gaussian with a common mean $\mu$ but
with different unknown standard deviations $\sigma_n$. What are the maximum
likelihood parameters $\mu$, $\{\sigma_n\}$ given the data? For example, seven scientists (A, B, C, D, E, F, G) with wildly-differing experimental skills
measure $\mu$.
\begin{eqnarray*}
x_n(A) &=& -27.020\\
x_n(B) &=& 3.570\\
x_n(C) &=& 8.191\\
x_n(D) &=& 9.898\\
x_n(E) &=& 9.603\\
x_n(F) &=& 9.945\\
x_n(G( &=& 10.056
\end{eqnarray*}


You expect some of them to do accurate work (i.e., to have
small $\sigma_n$), and some of them to turn in wildly inaccurate answers (i.e.,
to have enormous $\sigma_n$). What is $\mu$, and how reliable is each scientist?
$$
$$
Answer :

$$
N=7
$$
$$
x_n=(-27.02, 3.57, 8.191, 9.898, 9.603, 9.945, 10.056)
$$
$$
{\sum_n x_n\over N}=3.46329
$$
It must not be the correct mean. 

The likelihood is as followings by the description of the problem. 
$$
P(\{x_n\}|\sigma_n,\mu) = ({1 \over 2\pi })^{N/2} \Pi_n{1\over \sigma_n } \exp(-\sum_n {(x_n-\mu)^2 \over 2\sigma^2_n})
$$ 
To find the maximum likelihood,
$$
L \equiv \log P(\{x_n\}|\sigma_n,\mu) = -\sum_n \log~\sigma_n-\sum_n {(x_n -\mu)^2 \over 2\sigma_n^2} 
$$
$$
{\partial L \over \partial x_n} = -\sum_n \left({x_n-\mu \over \sigma^2_n}\right)=-\sum_n{x_n \over \sigma_n^2}+\mu\sum_n {1 \over \sigma_n^2} = 0
$$
$$
\mu= {\sum_n{x_n\over \sigma_n^2}\over{1\over \sigma_n^2}}
$$
When $x_n=(-27.02, 3.57)$, the correspondent $\sigma_n$ are so huge, then it contributes very tiny in $\mu$. Then, the mean should be close to mean of $(8.191, 9.898,$ 
$9.603, 9.945, 10.056)$.


\subsection{Exercise 24.3}

Question : [This exercise requires macho integration capabilities.] Give
a Bayesian solution to exercise 22.15, where seven scientists of
varying capabilities have measured $\mu$ with personal noise levels $\sigma_n$, 
and we are interested in inferring $\mu$. Let the prior on each $\sigma_n$ be a
broad prior, for example a gamma distribution with parameters $(s,c) =
(10, 0.1)$. Find the posterior distribution of $\mu$. Plot it, and explore its
properties for a variety of data sets such as the one given, and the data
set $\{x_n\} = \{13.01, 7.39\}$.

$$
$$
Answer : 

Bayesian for posterior probability of $\sigma_n$ is 
$$
P(\sigma_n|\{x_n\},\mu)= {P(\{x_n\}|\sigma_n,\mu) P(\sigma_n)\over P(\{x_n\}|\mu)}
$$
Given
$$
P(\{x_n\}|\sigma_n,\mu) = ({1 \over 2\pi })^{N/2} \left( \Pi_n{1\over \sigma_n }\right) \exp\left(-\sum_n {(x_n-\mu)^2 \over 2\sigma^2_n}\right)
$$ 
and
$$
P(\sigma_n)=({1\over \Gamma(c)~s^c})^N\Pi_n~(\sigma_n)^{c-1}~\exp({\sum_n \sigma_n \over s})$$
where $(s,~c)=(10,0.1)$.

and the normalizing constant is
$$
P(\{x_n\}|\mu)=\int^{\infty}_0~\Pi_n~d\sigma_n ~P(\{x_n\}|\sigma_n,\mu) P(\sigma_n)
$$


\begin{figure}
\centerline{
\begin{tabular}{cc}
\includegraphics[width=.5\textwidth]{7scientistplot.png} &
\includegraphics[width=.5\textwidth]{7scientistplot2.png}
\end{tabular}}
\caption[The plot of exercise 24.3]{\label{7scientist} \small The distribution functions are not normalized yet. However, this plot already shows the maximum likelihood should be around $\mu=10$. The small bumps around $x_n=(-27,~3.6,~8)$ are also seen in the graph. }
\end{figure}


The posterior probability of $\mu$ is 
$$
P(\mu|x_n)={P(\{x_n\}|\mu)P(\mu)\over P(\{x_n\})}
$$
and the prior is determined by some assumption. 
$P(\mu) = {1 \over \sigma_{\mu}}=const.$, then the normalizing constant is  
$$
P(\{x_n\})=\int^{\infty}_{-\infty} ~d\mu ~P(\{x_n\}|\mu)  {1 \over \sigma_{\mu}}
$$


\section{Gaussian approximation and model comparison}

\subsection{Laplace approximation}
\subsubsection{Exercise 27.1}
$r$ is a positive integer, and posterior over lambda
$$
P(\lambda|r)= {P(r|\lambda) P(\lambda)\over P(r)}
$$

$$
P(r|\lambda)=\exp(\lambda){\lambda^r \over r!}
$$
By assumption
$$
P(\lambda)={1\over \lambda}
$$
Then, the normalizing constant is 
$$
P(r)=\int^{\infty}_0~d\lambda~ \exp(\lambda){\lambda^r \over r!}{1\over \lambda}= {\Gamma(r) \over r!} = {1\over r}
$$
We need to find the $lambda$ for maximum likelihood. By differentiate the posterior probability distribution function with respect to $\lambda$, 
it has a maximum at $\lambda = \lambda_0=r-1$, and 
$$
P(\lambda=\lambda_0=r-1|r)={(r-1)^{(r-1)}\exp(r-1)\over (r-1)!}
$$
$$
c=-{\partial^2 \over \partial \lambda^2}\log~P(\lambda|r)|_{\lambda=\lambda_0=r-1}={1-r \over \lambda^2}|_{\lambda=\lambda_0=r-1}={1\over r-1}
$$ 

Then, the Laplace approximation is 
$$
G(\lambda|r)= P(\lambda=\lambda_0=r-1|r)\exp\left(-{c\over 2}{(\lambda-\lambda_0)^2}\right)
$$


\begin{figure}
\centerline{
\begin{tabular}{cc}
\includegraphics[width=.5\textwidth]{laplace.pdf} &
\includegraphics[width=.5\textwidth]{laplace2.pdf}
\end{tabular}}
\caption[The plot of exercise 27.1]{\label{laplace} \small $r=10$ is set. The blue plot is Poisson distribution and the orange one is the Laplace approximation. 
%%% Look into the plot around the maximum. Very well matched around the tip. The maximum is at $r=9$.The blue plot is Poisson distribution and the orange one is the Laplace approximation. Poisson distribution is not symmetric about the axis of $r=9$. Laplace approximation is symmetric and has same maximum and variance. 
%%%
}
\end{figure}

\subsubsection{Exercise 27.3}

Bayesian for posterior of $\omega_0, \omega_1$. $y(x_n)$ is the mean of $t_n$ data points. 

$$
P(\{\omega_i\}|t_n,x_n, \sigma_{\nu})= { P(t_n|\{\omega_i\},x_n,\sigma_{\nu})P(\omega_0)P(\omega_1)\over P(t_n|x_n,\sigma_{\nu})}
$$
$P(t_n|\{\omega_i\},\sigma_{\nu})$, $P(\omega_0)$, $P(\omega_1)$ are all Gaussian distributed. Prior $P(\omega_0)$, $P(\omega_1)$ are assumed to be Gaussian. 

$$
P(t_n|\sigma_{\nu})= \int^{\infty}_{-\infty} d\omega_0 \int^{\infty}_{-\infty} d\omega_1~ P(t_n|\{\omega_i\},\sigma_{\nu})P(\omega_0)P(\omega_1)
$$

$$
P(\{t_n\}|\{\omega_i\},x_n,\sigma) = ({1 \over 2\pi \sigma^2_{\nu}})^{N/2}  \exp\left(-\sum_n {(t_n-y(x_n))^2 \over 2\sigma^2}\right)
$$
$$
y(x_n)=\omega_0 + \omega_1 x_n
$$
$$
P(\omega_i)=  ({1 \over 2\pi \sigma^2})^{1/2}  \exp\left(-{\omega_i^2 \over 2\sigma_i^2}\right)
$$
Because the mean of $\omega_i$ should be zero obviously, and the variances of $\omega_0$ and $\omega_1$, i.e. $\sigma_1$, $\sigma_2$ could be same without loss of generality. The variance of $t_n$ would have a very broad range of values dependent on $x_n$. I assumed $\sigma_i^2 = \sigma/3$ at Figure \ref{BayesianRegression}.  

To use Laplace approximation, find $\{\omega_i\}$ satisfy ${\partial \over \partial \omega_i}\log P(\{\omega_i\}|t_n,x_n, \sigma)$. \footnote{Do you have any better suggestion?}

\begin{equation}\label{blr}
\log P(\{\omega_i\}|t_n,x_n, \sigma) = -\left(\sum_n {(t_n-\omega_0 - \omega_1 x_n)^2 \over 2\sigma^2}\right)-{\omega_0^2 \over 2\sigma_0^2}-{\omega_1^2 \over 2\sigma_1^2}
\end{equation}

Some reader might notice that $-\sum_n {(t_n-\omega_0 - \omega_1 x_n)^2 \over 2\sigma^2}$ term is minimized for the linear regression. We have two extra terms from the priors. It is {\it Bayesian linear regression}.  

By the way, $\omega_0^*$ and $\omega_1^*$ satisfy Eq. \ref{blr} is quite complicated. I tried to find the simplified form, but can't. I would just keep symbolic expressions of $\omega_0^*$, $\omega_1^*$. 


Using the Eq. (27.6) of the book,
$$
\vect A={1\over \sigma^2}\left(
\begin{array}{ccc}
N+1 & -\sum_n x_n  \\
  -\sum_n x_n  & (\sum_n x_n)^2 + 1 \\
\end{array}
\right)
$$


Then, the Laplace approximation is 

$$
G(\{\omega_i\}|t_n)= P(\{\omega_i^*\}|t_n,x_n, \sigma) exp\left(  -{1 \over 2} (\vect \omega - \vect \omega_0)^T \vect A  (\vect \omega - \vect \omega_0)\right)
$$
where $ \vect \omega  = (\omega_0, \omega_1)$ and $ \vect \omega^*  = (\omega_0^*, \omega_1^*)$


\begin{figure}
\centerline{
\begin{tabular}{cccc}
\includegraphics[width=.4\textwidth]{Likelihood.png} &
\includegraphics[width=.4\textwidth]{BayesianRegression.png} &
\includegraphics[width=.4\textwidth]{BayesianRegression2.png}&
\includegraphics[width=0.1\textwidth]{BayesianRegression3.png}
\end{tabular}}
\caption[The plot of exercise 27.3]{\label{BayesianRegression} \small The likelihood and posterior probability distribution plot and its Laplace approximation. I have put 10 random data points $x_n$ and $t_n ~ 3x_n+2 \pm 2$. The left plot has the maximum likelihood around $(\omega_0, \omega_2) = (1.79794, 3.02217)$, and the central posterior is about $(\omega_0, \omega_2) = (0.812961, 3.16977)$. The discrepancy occurs by the assumed priors. Because the mean of $\omega_i$ should be zero obviously, and the variances of $\omega_0$ and $\omega_1$, i.e. $\sigma_1$, $\sigma_2$ could be same without loss of generality. The variance of $t_n$ would have a very broad range of values dependent on $x_n$. I assumed $\sigma_i^2 = \sigma/3$. It is chosen from base of the mean value of $x_n$. The last one is obtained by Laplace approximation. I did not normalize them.}
\end{figure}


\subsection{Occam's razor}

This section is very well summarized at the figures of the book. Occam's factor is defined in the process of model comparison. The Occam's factor ${\sigma_{\omega|D}\over \sigma_{\omega}}$ is equal to the ratio of the posterior accessible volume of $\mathcal{H}_i$'s parameter space to the prior accessible volume, or the factor by which $\mathcal{H}_i$'s hypothesis space collapses when the data arrive. The factor is smaller as the model is simple. The logarithm of the Occam factor is a measure of the amount of
information we gain about the model's parameters when the data arrive.

{\it Occam factor for several parameters}
If the posterior is well approximated by a Gaussian, then the Occam factor
is obtained from the determinant of the corresponding covariance matrix (cf.
equation (28.8) and Chapter 27):

$$
P(D|\mathcal{H}_i) \sim P(D|\vect w_{MP}, \mathcal{H}_i) \times P(\vect w_{MP}| \mathcal{H}_i) \det (\vect A/2\pi)^{-1/2}
$$ 
$~~~~~~~~~~~$Evidence $\sim$ Best fit likelihood $\times$ Occam factor

where $\vect A = -\Delta \Delta \log P(D|\vect w_{MP}, \mathcal{H}_i)$. Bayesian model comparison is a simple extension of maximum likelihood model selection: the evidence is obtained by multiplying the best-fit likelihood by the Occam factor. To evaluate the Occam factor we need only the Hessian $\vect A$, if the Gaussian
approximation is good. Thus the Bayesian method of model comparison by
evaluating the evidence is no more computationally demanding than the task
of finding for each model the best-fit parameters and their error bars.


\subsubsection{Exercise 28.1}

For given $x= 0.3,0.5,0.7,0.8,0.9$, which hypothesis would be more probable? By intuition, $\mathcal{H}_0$, because the data looks rather uniformly distributed. 

$P(D|\mathcal{H}_0)$ is ${1 \over 2}\times {1 \over 2} \times{1 \over 2} \times{1 \over 2}\times {1 \over 2}  = 0.03125$ since the probability distribution is uniform. 

$P(D|\mathcal{H}_1)$ is a product of ${1 \over 2} (1- 0.4 x)$ for $x= 0.3,0.5,0.7,0.8,0.9$. It is about $0.00689357$. The ratio ${P(D|\mathcal{H}_0) \over P(D|\mathcal{H}_1)} \sim 4.53321$. $P(D|\mathcal{H}_0)$ is 4.5 times more probable. 

\subsubsection{Exercise 28.2}

Look at the data points plot. 

\begin{figure}[h!]
\centerline{
\includegraphics[width=.7\textwidth]{datapointsExercise28-2.png}}
 \caption[The plot of exercise 28.2-1]{\label{Exercise28-2-1} \small }
\end{figure}

The best fit would be almost flat but the slope is not zero, and the y-intersection would be around 10. Which hypothesis is more probable by your intuition? 


The given data points,D= (x, t)= \{(-8, 8), (-2, 10), (6, 11)\};
The evidence is  

{\small 
$$
P(D|\sigma_{\nu}, \sigma_{\omega_0},\sigma_{\omega_1},\mathcal{H}_i) = \int^{\infty}_{-\infty} d\omega_0 \int^{\infty}_{-\infty} d\omega_1 P(D|\omega_0, \omega_1,\sigma_{\nu}, \mathcal{H}_i) P(\omega_0| \sigma_{\omega_0},\sigma_{\omega_1},\mathcal{H}_i)P(\omega_1| \sigma_{\omega_0},\sigma_{\omega_1},\mathcal{H}_i)
$$ }

For $\mathcal{H}_2$,
{\tiny
$$
P(D|\sigma_{\nu}, \sigma_{\omega_0},\sigma_{\omega_1},\mathcal{H}_2) = \int^{\infty}_{-\infty} d\omega_0 \int^{\infty}_{-\infty} d\omega_1 
{1\over \sqrt{2\pi }\sigma_{\nu}} 
\exp\left(-{(\omega_0+\omega_1 x - t)^2\over 2\sigma^2_{\nu}}\right){1\over \sqrt{2\pi }\sigma_{\omega_0}} 
\exp\left(-{(\omega_0-\mu_{\omega_0})^2\over 2\sigma^2_{{\omega_0}}}\right){1\over \sqrt{2\pi }\sigma_{\omega_1}} 
\exp\left(-{(\omega_1-\mu_{\omega_1})^2\over 2\sigma^2_{{\omega_1}}}\right)
$$ }

Then, the evidence for $\mathcal{H}_2$ with respect to 3 data points are about $0.0302393, ~0.0000391484, ~0.0131697$ for each. The total evidence is about $1.55905 \times 10^{-8}$.


For $\mathcal{H}_1$, $\omega_1 =0$ is fixed, so we do not have related term. 
{\small $$
P(D|\sigma_{\nu}, \sigma_{\omega_0},\sigma_{\omega_1},\mathcal{H}_1) = \int^{\infty}_{-\infty} d\omega_0 \int^{\infty}_{-\infty} d\omega_1 
{1\over \sqrt{2\pi }\sigma_{\nu}} 
\exp\left(-{(\omega_0+\omega_1 x - t)^2\over 2\sigma^2_{\nu}}\right){1\over \sqrt{2\pi }\sigma_{\omega_0}} 
\exp\left(-{(\omega_0-\mu_{\omega_0})^2\over 2\sigma^2_{{\omega_0}}}\right)
$$ }

The evidence for $\mathcal{H}_1$ with respect to 3 data points are about $3.17456 \times 10^{-8},~3.91772 \times 10^{-12},~2.05583 \times 10^{-14}$ for each. The total evidence is about $2.55684 \times 10^{-33}$.

Therefore, the horizontal line has much smaller evidence, and it agrees with the intuition. 


\section{Monte Carlo method}
Where the likelihood function
is multimodal, and has nasty unboundedly-high spikes in certain locations in
the parameter space; so maximizing the posterior probability and fitting a
Gaussian is not always going to work. This difficulty with Laplace's method is
one motivation for being interested in Monte Carlo methods. In fact, Monte
Carlo methods provide a general-purpose set of tools with applications in
Bayesian data modelling and many other fields.

The book uses the word {\it sample} in the following sense: a sample
from a distribution $P(\vect x)$ is a single realization $\vect x$ whose probability distribution
is $P(\vect x)$. This contrasts with the alternative usage in statistics, where {\it sample}
refers to a collection of realizations $\{\vect x\}$.

The aims of Monte Carlo methods are to solve one or both of the
following problems.

\vect{Problem 1}: to generate samples $\{x^{(r)}\}^R_{r=1}$ from a given probability distribution
$P(x)$.

\vect{Problem 2}: to estimate expectations of functions under this distribution, for
example

$$
\Phi = \int d^N \vect x P(\vect x) \phi(\vect x)
$$

Estimator
$$
\hat \Phi = {1 \over R} \sum_r \phi(\vect x^{(r)})
$$

If the vectors $\{x^{(r)}\}^R_{r=1}$ are generated from $P(x)$ then the expectation of $\hat \Phi$ is $\Phi$, and the variance of $\hat \Phi$ is the variance,$\sigma^2$, of $\phi$ divided by $R$. \footnote{Look at \href{https://github.com/physhik/Study-of-David-Mackay-s-book-/blob/master/PopulationAndSampling.pdf}{\vect{this}} for a simple example.}

$$
\sigma^2 = \int d^N \vect x P(\vect x) (\phi(\vect x) - \Phi)^2
$$

We can evaluate $P^*(\vect x)$.

$$
P(\vect x)=P^*(\vect x)/Z
$$
We do not know $Z$. Secondly, even if we did know $Z$, the problem of drawing samples
from $P(\vect x)$ is still a challenging one, especially in high-dimensional spaces,
because there is no obvious way to sample from $P$ without enumerating most
or all of the possible states. Correct samples from $P$ will by definition tend
to come from places in $\vect x$-space where $P(\vect x)$ is big; how can we identify those
places where $P(\vect x)$ is big, without evaluating $P(\vect x)$ everywhere? There are only
a few high-dimensional densities from which it is easy to draw samples, for
example the Gaussian distribution.


\subsection{Exercise 29.3}

Problem : Consider a one-dimensional random walk, on each step of
which the state moves randomly to the left or to the right with equal
probability. Show that after N steps of size a, the state is likely to have
moved only a distance about $\sqrt{N}a$. (Compute the root mean square
distance travelled.)

Answer : Consider only right steps. p = 1/2. Then the average distance is $N~p~a = N~a/2$ and the deviation is $\sqrt{Np(1-p)}a = \sqrt{N}a/2$. While consider the left step as well, the average is 0, and the deviation is between $\sqrt{N}a$ and zero. 

\subsection{Rejection sampling}

We assume
that we have a simpler proposal density $Q(x)$ which we can evaluate (within a
multiplicative factor $Z_Q$, as before), and from which we can generate samples.
We further assume that we know the value of a constant $c$ such that
$$
c Q^*(x) > P^*(x),~for~all~x.
$$

We generate two random numbers. The first, $x$, is generated from the
proposal density $Q(x)$. We then evaluate $cQ^*(x)$ and generate a uniformly
distributed random variable $u$ from the interval $[0, cQ^*(x)]$. These two random
numbers can be viewed as selecting a point in the two-dimensional plane.

We now evaluate $P^*(x)$ and accept or reject the sample x by comparing the
value of u with the value of $P^*(x)$. If $u > P^*(x)$ then $x$ is rejected; otherwise
it is accepted, which means that we add $x$ to our set of samples $\{x^{(r)}\}$. The
value of $u$ is discarded.

\href{https://github.com/physhik/Study-of-David-Mackay-s-book-/blob/master/RejectSamplingMC.ipynb}{\vect{The reject sampling ipython notebook}} present the python code of the sampling. The bald is linked to ipython notebook in the GitHub. In the example, $Q^*(x)$ is trivially given as a constant, and $P^*(x)$ is given as a specific function.  

\begin{figure}[h!]
\centerline{\includegraphics[scale=0.4]{rejectSampling.png}}
\caption{Reject sampling. $cQ^*(x)$ is a constant which is a maximum value of the function, $P^*(x)$. }\label{rejectsampling}
\end{figure}

The
acceptance rate is the ratio of the volume under the curve $P(x)$ to the volume
under $cQ(x)$, the fact that $P$ and $Q$ are both normalized here implies that
the acceptance rate will be $1/c$. If it is $N$-dimensional problem, the acceptance rate becomes $(1/c)^N$ 

Rejection sampling, therefore, whilst a useful method for one-dimensional
problems, is not expected to be a practical technique for generating samples
from high-dimensional distributions $P(x)$.


\subsection{The Metropolis-Hastings method}

The Metropolis-Hastings algorithm instead makes use of a proposal density
$Q$ which depends on the current state $x^{(t)}$. The density $Q(x', x^{(t)})$ might
be a simple distribution such as a Gaussian centred on the current $x^{(t)}$. The
proposal density $Q(x', x^{(t)})$ can be any fixed density from which we can draw
samples. In contrast to importance sampling and rejection sampling, it is not
necessary that $Q(x', x^{(t)})$ look at all similar to $P(x)$ in order for the algorithm
to be practically useful.

We assume that we can evaluate $P^*(x)$ for any $x$. A tentative
new state $x'$ is generated from the proposal density $Q(x', x^{(t)})$. To decide whether to accept the new state, we compute the quantity
$$
a= {P^*(x)~Q( x^{(t)},x') \over P^*(x^{(t)})~Q(x', x^{(t)}) }
$$
The algorithm is 

\begin{itemize}
  \item If $a \le 1$ then the new state is accepted.  
  \item Otherwise, the new state is accepted with probability a. 
  \item If the step is accetpted, we set $x^{(t+1)}=x'$.
  \item If the step is rejected, then we set $x^{(t+1)}=x^{(t)}$ (The rejected points are not discarded.)
\end{itemize}

If the
proposal density is a simple symmetrical density such as a Gaussian centred on
the current point, then the factor ${Q( x^{(t)},x') \over Q(x', x^{(t)}) }$ is unity, and the Metropolis-Hastings
method simply involves comparing the value of the target density at the two
points. {\it i.e.} $a={P^*(x) \over P^*(x^{(t)})}$.

Rule of thumb: lower bound on number of iterations of a
Metropolis method. If the largest length scale of the space of
probable states is $L$, a Metropolis method whose proposal distribution
generates a random walk with step size $\epsilon$ must be run for at
least $T \sim (L/\epsilon)^2$ iterations to obtain an independent sample. This rule of thumb gives only a lower bound; the situation may be much
worse, if, for example, the probability distribution consists of several islands
of high probability separated by regions of low probability.

It is good news because, unlike the
cases of rejection sampling and importance sampling, there is no catastrophic
dependence on the dimensionality $N$ in MH. Roughly, $T \sim (\sigma_{Max}/\sigma_{min})^2$ from the rule of thumb. Because this quadratic dependence on the lengthscale-ratio may still force us
to make very lengthy simulations. We will discuss how to suppress the random works in Monte Carlo simulations later. 

\subsection{Gibbs sampling}

In the general case of a system with K variables, a single iteration involves
sampling one parameter at a time. 
$$
x^{(t+1)}_1 \sim P(x_1|x^{(t)}_2,x^{(t)}_3,x^{(t)}_4,...,x^{(t)}_K),
$$
$$
x^{(t+1)}_2 \sim P(x_2|x^{(t)}_1,x^{(t)}_3,x^{(t)}_4,...,x^{(t)}_K),
$$
$$
\cdots
$$

Gibbs sampling suffers from the same defect as simple Metropolis algorithms - the state space is explored by a slow random walk, unless a fortuitous parameterization
has been chosen that makes the probability distribution $P(x)$
separable. It slowly progresses made by
Gibbs sampling when $ L \gg \epsilon$. However Gibbs sampling involves no adjustable parameters, so it is an attractive
strategy when one wants to get a model running quickly.

\subsection{Markov chain}

We denote by $p^{(t)}(\vect x)$ the probability
distribution of the state of a Markov chain simulator. Our aim is to find a Markov chain such that as $t \rightarrow \infty$, $p^{(t)}(\vect x)$ tends
to the desired distribution $P(x)$.

A Markov chain can be specified by an initial probability distribution $p^{(0)}(\vect x)$ and a transition probability $T(\vect x' ; \vect x)$.

Then, 
$$
p^{(t+1)}(\vect x) = \int d^N \vect x ~T(\vect x' ; \vect x)p^{(t)}(\vect x)
$$

Look at \href{https://github.com/physhik/Study-of-David-Mackay-s-book-/blob/master/reproduceFigure29.14.pdf/}{the link} for a simple example of Markov chain and Metropolis demonstration of section 29.4 of \href{http://www.inference.phy.cam.ac.uk/mackay/itila/}{the book of David Mackay.}

The Markov chain must have the following properties. 

1. The desired distribution P(x) is an invariant distribution of the chain's transition. An invariant distribution is an eigenvector of the transition probability matrix that has eigenvalue 1.

2. The chain must also be ergodic, that is, $p^{(t)}(\vect x) \rightarrow \pi(\vect x)$ as $t \rightarrow \infty$, for any initial $p^{(0)}(\vect x)$. 

A chain might not be ergodic when the matrix are reducible, or the chain is a periodic set. 

\subsection{Exercise 29.7}
Detailed balance condition. 
$$
T(\vect x_a ; \vect x_b)P(\vect x_b) = T(\vect x_b ; \vect x_a)P(\vect x_a),
$$
for all $\vect x_a$ and $\vect x_b$. 
Then, 
$$
\int d^N x_b~T(\vect x_a ; \vect x_b)P(\vect x_b) = \int d^N x_b~T(\vect x_b ; \vect x_a)P(\vect x_a),
$$
for all $\vect x_a$ and $\vect x_b$.
Then, using (29.45) of the book,
$$
\int d^N x_b ~d^N x_c~B(\vect x_a ; \vect x_c)B(\vect x_c ; \vect x_b)P(\vect x_b) =\int d^N x_b ~d^N x_c~B(\vect x_a ; \vect x_c)B(\vect x_c ; \vect x_a)P(\vect x_a),
$$
for all $\vect x_a$.

Both the left hand and the right hand side are $P(x_a)$ for all $\vect x_a$.

\subsection{Slice sampling}

A single transition $(x,u) \rightarrow (x',u')$ of a one-dimensional slice sampling
algorithm has the following steps, of which steps 3 and 8 will require further
elaboration.

1. evaluate $P^*(x)$

2. draw a vertical coordinate $u' \sim$ Uniform$(0,P^*(x))$

3. create a horizontal interval $(x_l, x_r)$ enclosing $x$

~~3a. draw $r \sim$ Uniform$(0,1)$

~~3b. $x_l := x-rw$

~~3c. $x_r := x+(1-r)w$

~~3d. while $(P^*(x_l) > u')$ $\{x_l := x-rw\}$

~~3e. while $(P^*(x_r) > u')$ $\{x_r:= x+w\}$

4. loop $\{$

5. ~~~~~~~~~ draw $x' \sim$ Uniform$(x_l,x_r)$

6. ~~~~~~~~~ evaluate $P^*(x')$

7. ~~~~~~~~~ if $P^*(x') >u'$ break out of loop 4-9

8. ~~~~~~~~~ else modify the interval $(x_l,x_r)$

~~8a. ~~~~~~~~~if $(x' > x)\{x_r:=x'\}$

~~8b. ~~~~~~~~~else $\{x_l:=x'\}$ 

9. $\}$  

$$
$$
Like a standard Metropolis method, slice sampling gets around by a random
walk, but whereas in the Metropolis method, the choice of the step size is critical to the rate of progress, in slice sampling the step size is self-tuning. If
the initial interval size w is too small by a factor $f$ compared with the width of
the probable region then the stepping-out procedure expands the interval size.
The cost of this stepping-out is only linear in $f$, whereas in the Metropolis
method the computer-time scales as the square of $f$ if the step size is too
small.


\subsubsection{Exercise 29.10}
Problem : Investigate the properties of slice sampling applied to the
density function $P^*(x)$. 

$$
P^*(x)= \{
\begin{array}{cc}
10 ~~~~~~as ~0\leq x<1\\
~~0 ~~~~~~as ~1\leq x \leq 11\\
\end{array}
$$

$x$ is a real variable between 0.0 and 11.0.
How long does it take typically for slice sampling to get from an $x$ in
the peak region $x  \in (0,~ 1)$ to an $x$ in the tail region $x \in (1,~11)$, and vice
versa? Confirm that the probabilities of these transitions do yield an
asymptotic probability density that is correct.
$$
$$
Answer : I have explained the answer 
 \href{https://github.com/physhik/Study-of-David-Mackay-s-book-/blob/master/Exercise29.10.ipynb}{here(click it)}.

\subsection{Practicalities}

{\it Can we predict how long a Markov chain Monte Carlo simulation
will take to equilibrate?} 

By considering the random walks involved in a
Markov chain Monte Carlo simulation we can obtain simple lower bounds on
the time required for convergence. But predicting this time more precisely is a
difficult problem, and most of the theoretical results giving upper bounds on
the convergence time are of little practical use. The exact sampling methods
 will be discussed later and it offers a solution to this problem for certain Markov chains.
$$
$$
{\it Can we diagnose or detect convergence in a running simulation?}

This is also a difficult problem. There are a few practical tools available, but
none of them is perfect (Cowles and Carlin, 1996).
$$
$$
{\it Can we speed up the convergence time and time between independent samples of a Markov chain Monte Carlo method? }

Here, there is
good news, as described in the next chapter, which describes the Hamiltonian
Monte Carlo method, overrelaxation, and simulated annealing.
$$
$$
{\it Can the normalizing constant be evaluated?}

If the target density $P(\vect x)$ is given in the form of an unnormalized density
 $P(\vect x)$ with  $P(\vect x) = P^*(\vect x) /Z $, the value of $Z$ may well be of interest. Monte
Carlo methods do not readily yield an estimate of this quantity, and it is an
area of active research to find ways of evaluating it. Techniques for evaluating
$Z$ include:
$$
$$
1. Importance sampling (reviewed by Neal (1993b)) and annealed importance
sampling (Neal, 1998).
$$
$$
2. `Thermodynamic integration' during simulated annealing, the `acceptance ratio' method, and `umbrella sampling' (reviewed by Neal (1993b)).
$$
$$
3. `Reversible jump Markov chain Monte Carlo' (Green, 1995).
$$
$$
One way of dealing with Z, however, may be to find a solution to one's
task that does not require that Z be evaluated. In Bayesian data modelling
one might be able to avoid the need to evaluate Z { which would be important
for model comparison { by not having more than one model. Instead of using
several models (differing in complexity, for example) and evaluating their relative
posterior probabilities, one can make a single hierarchical model having,
for example, various continuous hyperparameters which play a role similar to
that played by the distinct models (Neal, 1996). In noting the possibility of
not computing Z, I am not endorsing this approach. The normalizing constant
Z is often the single most important number in the problem, and I think every
effort should be devoted to calculating it.


\section{Efficient Monte Carlo Methods}

\subsection{Summary of Monte Carlos methods }

In high-dimensional problems the only satisfactory methods are those
based on Markov chains, such as the Metropolis method, Gibbs sampling
and slice sampling. Gibbs sampling is an attractive method because
it has no adjustable parameters but its use is restricted to cases
where samples can be generated from the conditional distributions. Slice
sampling is attractive because, whilst it has step-length parameters, its
performance is not very sensitive to their values.


Simple Metropolis algorithms and Gibbs sampling algorithms, although
widely used, perform poorly because they explore the space by a slow
random walk. The next chapter will discuss methods for speeding up
Markov chain Monte Carlo simulations.


Slice sampling does not avoid random walk behaviour, but it automatically
chooses the largest appropriate step size, thus reducing the bad
effects of the random walk compared with, say, a Metropolis method
with a tiny step size.


\subsection{Hamiltonian Monte Carlo}

The Hamiltonian Monte Carlo method is a Metropolis method, applicable
to continuous state spaces, that makes use of gradient information to reduce
random walk behaviour.

Using the Algorithm 30.1(figure \ref{HMC}), figure 30.2 is reproduced in  \href{https://github.com/physhik/Study-of-David-Mackay-s-book-/blob/master/Exercise29.10.ipynb}{this ipython notebook(click it)}.


\begin{figure}[h!]
\centerline{\includegraphics[scale=0.6]{algorithmhmc.png}}
\caption{Algorithm 30.1 of the book. Octave source
code for the Hamiltonian Monte
Carlo method.}\label{HMC}
\end{figure}


\subsection{Over-relaxation}

The method of overrelaxation is a method for reducing random walk behaviour
in Gibbs sampling. Overrelaxation was originally introduced for systems in
which all the conditional distributions are Gaussian.

In Adler's overrelaxation method, one instead samples $x^{(t+1)}_i$
 from a Gaussian that is biased to the opposite side of the conditional distribution.
If the conditional distribution of $x_i$ is Normal($\mu$, $\sigma^2$) and the current value of
$x_i$ is $x^{(t)}_i$, then Adler's method sets $x_i$ to

$$
x^{(t+1)}_i = \mu +\alpha (x^{(t)}_i - \mu)+ (1-\alpha^2)^{1/2}\sigma \nu
$$
where $\nu$ $\sim$ Normal(0, 1) and $\alpha$ is a parameter between -1 and 1, usually set to
a negative value. (If $\alpha$ is positive, then the method is called under-relaxation.)

\subsubsection{Exercise 30.2}

Show that this individual transition leaves invariant the conditional
distribution $x_i \sim$ Normal($\mu$, $\sigma^2$).
$$
$$
Answer : $x^{(t+1)}_i = \mu +\alpha (x^{(t)}_i - \mu)+ (1-\alpha^2)^{1/2}\sigma \nu$. 
$$
$$
The distribution of $x^{(t+1)}_i$ is still Gaussian. The mean of {$x^{(t+1)}_i $} is also $\mu$. At last, the standard deviation is $\sigma$ as well. To compute the mean and variance, you can use $ \nu \equiv {(x-\mu) \over \sigma}$.  

\section{Ising Model}

\subsection{3 reasons for relevance}

1. Studying Ising model can be useful to understand phase transition of various systems. 
$$
$$
2. Hopfield network or Boltzmann machine to the neural network is just a generalized form of Ising model. 
$$
$$
3. Ising model is also useful as a statistical model in its own right.

\subsection{Direct computation of partition function of Ising models}
We concentrate on models that have the form of a long thin strip of width W with periodic boundary conditions in both directions
$$
$$
W : width of the strip. 
$$
$$
C : length of the strip.  
$$
$$
WC = the volume of the strip. 
$$
$$
free energy per spin in the limit of an infinite thin strip :  $f = -kT~ln~\mu_{max}/W $
$$
$$
$\mu_{max}$ : the largest eigenvalue of the matrix, $M_{ss'}$. Look at the definition of $M$ at (31.28), (31.29).


\end{document}