-
Notifications
You must be signed in to change notification settings - Fork 3
/
94A17-DerivationOfMutualInformation.tex
89 lines (80 loc) · 3.08 KB
/
94A17-DerivationOfMutualInformation.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
\documentclass[12pt]{article}
\usepackage{pmmeta}
\pmcanonicalname{DerivationOfMutualInformation}
\pmcreated{2013-03-22 15:13:38}
\pmmodified{2013-03-22 15:13:38}
\pmowner{tdunning}{9331}
\pmmodifier{tdunning}{9331}
\pmtitle{derivation of mutual information}
\pmrecord{5}{36994}
\pmprivacy{1}
\pmauthor{tdunning}{9331}
\pmtype{Derivation}
\pmcomment{trigger rebuild}
\pmclassification{msc}{94A17}
\endmetadata
% this is the default PlanetMath preamble. as your knowledge
% of TeX increases, you will probably want to edit this, but
% it should be fine as is for beginners.
% almost certainly you want these
\usepackage{amssymb}
\usepackage{amsmath}
\usepackage{amsfonts}
% used for TeXing text within eps files
%\usepackage{psfrag}
% need this for including graphics (\includegraphics)
%\usepackage{graphicx}
% for neatly defining theorems and propositions
%\usepackage{amsthm}
% making logically defined graphics
%%%\usepackage{xypic}
% there are many more packages, add them here as you need them
% define commands here
\begin{document}
The maximum likelihood estimater for mutual information is identical (except for a scale factor) to the generalized log-likelihood ratio for multinomials and closely related to Pearson's
$\chi^2$ test. This implies that the distribution of observed values of mutual information computed using maximum likelihood estimates for probabilities is $\chi^2$ distributed except for that scaling factor.
In particular if we sample each of $X$ and $Y$ and combine the samples to form $N$ tuples sampled from $X \times Y$. Now define $T(x, y)$ to be the total number of times the tuple $(x, y)$ was observed. Further define $T(x, *)$ to be the number of times that a tuple starting with $x$ was observed and $T(*, y)$ to be the number of times that a tuple ending with $y$ was observed. Clearly, $T(*, *)$ is just $N$, the number of tuples in the sample. From the definition, the generalized log-likelihood ratio test of independence for $X$ and $Y$ (based on the sample of tuples) is
\begin{equation*}
-2 log \lambda =
2 \sum_{xy} T(x,y) \log \frac {\pi_{x|y}} {\mu_x}
\end{equation*}
where
\begin{equation*}
\pi_{x|y} = T(x,y) / \sum_x T(x,y)
\end{equation*}
and
\begin{equation*}
\mu_x = T(x,*) / T(*, *)
\end{equation*}
This allows the log-likelihood ratio to be expressed in terms of row and column sums,
\begin{equation*}
-2 log \lambda =
2 \sum_{xy} T(x,y)
\log {\frac
{T(x,y) T(*, *)}
{T(x, *) T(*, y) }
}
\end{equation*}
This reduces to the following expression in terms of maximum likelihood estimates of cell, row and column probabilities,
\begin{equation*}
-2 log \lambda =
2 \sum_{xy} T(x,y)
\log {\frac
{\pi_{xy}}
{ \mu_{*y} \mu_{x*} }
}
\end{equation*}
This can be rearranged into
\begin{equation*}
-2 log \lambda =
2 N \left[
\sum_{xy} \pi_{xy} \log \pi_{xy}
\sum_{x} \mu_{x*} \log \mu_{x*}
\sum_{y} \mu_{*y} \log \mu_{*y}
\right] = 2 N \hat I (X;Y)
\end{equation*}
where the hat indicates a maximum likelihood estimation of $I(X;Y)$.
This also gives the asymptotic distribution of $\hat I(X;Y)$ as $2N$ times a $\chi^2$ deviate.
%%%%%
%%%%%
\end{document}