-
Notifications
You must be signed in to change notification settings - Fork 0
/
preliminary.tex
286 lines (259 loc) · 16.4 KB
/
preliminary.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
\section{Data as Matrices}\label{pre:dam}
Matrices are ubiquitous in data\hyp{}related fields, such as computer science,
statistics, and machine learning. From different perspectives, a matrix can
represent a variety of objects:
\begin{itemize}
\item A simple and compact way of storing information. For example, each
column of a matrix corresponds to a data point and each row stands for a
feature.
\item A linear map $A\Colon\calV\to\calU$ from one vector space to another
through matrix\hyp{}vector\hyp{}multiplication $v\mapsto u = Av$.
\item A bilinear map $A\Colon\calV\times\calU\to \BBR$ such that $(v,
u)\mapsto u^TAv$.
\end{itemize}
Most importantly, these representations are not incompatible with each other,
and it is the capacity for multiple simultaneous interpretations that makes
matrices such a powerful tool in applications. For instance, an adjacency matrix
$A\in\BBR^{N\Times N}$ can describe an unweighted graph where $A_{ij} = 1$ if
node $i$ and $j$ are connected by an edge, and $0$ otherwise. Each row of $A$
can be viewed as the connectivity for the corresponding node. Alternatively,
applying $A$ to a vector standing for the initial condition of a path counting
process captures the dynamics on this graph. Finally, the bilinear map of $A$ on
$\calV = \calU = \{0, 1\}^N$ measures the number of edges from one set of nodes
to the other and thus can be used in graph partitioning tasks.
Therefore, it is inadequate to categorize data matrices by their linear
algebraic roles. Instead, we introduce the matrices encountered in this
dissertation by the properties of datasets they encapsulate. Regardless of the
type of data matrices we are dealing with, the most critical task is identifying
the intrinsic numerical properties attached to it, so that we can adapt our
algorithms to specific applications.
\paragraph{Matrices for Features}
An object\hyp{}feature matrix $A\in\BBR^{M\Times N}$ encodes $N$ data points as
column vectors, each with $M$ features. Many algorithms involve linear algebraic
operations on data, and such a representation naturally facilitates their
computation. For example, data compression and dimensionality reduction
techniques, such as frequency sampling and principal component analysis, are
usually implemented as low\hyp{}rank decompositions on data matrices. Neural
networks can be viewed as a sequence of linear transformations and element
\hyp{}wise operations, which rely on the matrix computation for efficient
parallelization. As a result, matrices together with higher order variations are
easily the most popular mathematical objects for keeping track of data during
computation.
In this dissertation, a majority of matrices are this type, even though they are
not always the focus of study. For example, multi\hyp{}output Gaussian process
regression uses object\hyp{}feature matrices for both input and output. In topic
modeling, the topic\hyp{}word matrix is a latent matrix of this type, and plays
an crucial role in interpreting the model. Finally, the main matrix of interest
in \cref{ch6} is formed by having electron orbital pairs as objects and
real\hyp{}space discretization as features.
\paragraph{Matrices for Correlations}
This type of matrix describes the pairwise relationship of objects rather
than object\hyp{}feature relationships. Examples in this dissertation include
the co\hyp{}occurrence matrix in \cref{ch5} that indicates the probability of
two objects occurring together in a document, and the kernel matrices in
\cref{ch4} that measures the covariance between two points for a Gaussian
process. The former can be derived from the document\hyp{}object matrix, which
suggests this type of data matrix can sometimes be considered as
higher\hyp{}order information extracted from the object\hyp{}feature matrices.
The correlation between objects has various benefits and is widely used in
statistical models. In our case, it leads to uncertainty quantification,
a theoretical optimality guarantee, and algorithmic robustness.
\paragraph{Matrices for Dynamics}
The last type of data matrix is generally associated with linear
time\hyp{}invariant systems. These matrices appear in \cref{ch6} both
explicitly as the discretization of the Laplace operator, and implicitly as
integral operators---e.g., the Hartree\hyp{}Fock operator. Here, we focus on
exploiting structural information to compress these matrices and gain
computational efficiency. On the other hand, in \cref{ch3} the graph Laplacian
and the random walk matrix describe the dynamics on the original graph, which we
use to reveal structural properties. There is a deep connection between the
dynamics on an object and its structure. In general, we can leverage our
knowledge of one to gain valuable insights into the other.
\section{Functions of Matrices}\label{pre:fom}
The main mathematical problems in both \cref{ch3,ch4} can be simplified to
estimating key quantities for functions of matrices. In this dissertation, a
matrix function $f(A)$ is produced by a spectral mapping of $f$ on a matrix
$A\in\BBR^{N\Times N}$. \citet{higham2008functions} gives a rigorous and general
definition of $f(A)$, which only requires $f$ to be \emph{defined}\footnote{
Defined is a formal term here. Given a matrix $A$ with distinct
eigenvalues $\lambda_1,\cdots,\lambda_s$ and $n_i$ the size of the largest
Jordan block for $\lambda_i$, the function $f$ is defined on the spectrum of $A$
if $f^{(j)}(\lambda_i)$ exists for $j = 0\Colon n_i-1$ and $i = 1\Colon s$.
Please refer to \cite[Definition~1.1]{higham2008functions} for a detailed
discussion.} on the spectrum of $A$. However, we mostly work with a special
case, where $A$ is Hermitian and $f$ is analytic over the spectrum of $A$. Given
the eigendecomposition $A = V^T\Lambda V$ where $V$ is orthogonal and $\Lambda =
\diag(\lambda_1,\cdots,\lambda_N)$ is diagonal, the spectral mapping theorem
tells us
\begin{equation}\label{eqn:smt}
f(A) = f(V^T\Lambda V) = V^T f(\Lambda)V = V^T\diag(f(\lambda_1),\cdots,f
(\lambda_N))V\,.
\end{equation}
Numerical methods for computing functions of matrices are important in many
applications, such as differential equations, Markov models, and network
science. The straightforward approach in \cref{eqn:smt} takes an $\calO(N^3)$
eigendecomposition, which is not always feasible. Many tailored algorithms have
been developed for a few specific but prevalent functions, like the matrix
square root and matrix exponential \cite[Chapter 6 \& 10]{higham2008functions}.
In a remarkable work, \citet{moler2003nineteen} presented nineteen ways of
computing the matrix exponential, comparing these in terms of both efficiency
and stability.
Given the context in this dissertation, we adopt two methods for evaluating
functions of matrices. The first is a polynomial expansion method in the
Chebyshev polynomial basis. The Chebyshev polynomials $T_m(x)$ can be defined
with the recurrence:
\begin{equation}\label{eqn:cheb_3term}
T_0(x) = 1,\quad T_1(x) = x,\quad T_{m+1}(x) = 2xT_m(x) - T_{m-1}(x)\,.
\end{equation}
The corresponding expansion is simply $f(x) \approx \tilde{f}(x)= \sum_
{m=0}^Mc_mT_m(x)$. For any smooth function $f$, the approximation error decays
exponentially with the number of terms in the expansion~\cite{
Trefethen-2013-ATAP}. Many functions we deal with satisfy the smoothness
condition, thereby making this method very attractive in our applications. To
evaluate the matrix function $f(A)$, we take advantage of the three\hyp{}term
recurrence relation
\begin{equation}\label{eqn:3term_fom}
T_{m+1}(A) = 2AT_m(A)-T_{m-1}(A)\,,
\end{equation}
and avoid storing all $T_m(A)$ at once. When we are only interested in applying
$f(A)$ to a vector $v$, we can further reduce the computation to just
calculating $T_m(A)v$ for $m = 1,\cdots, M$, and exploit fast matrix\hyp{}vector
products with $A$ if possible.
The second method is Gauss quadrature and Lanczos (GQL) algorithm proposed by
\citet{golub1997matrices}. It is based on the Lanczos algorithm, which produces
the decomposition
\begin{equation}\label{eqn:lan_decomp}
AZ_M = Z_M\Gamma_M + r_Me_M^T\,,
\end{equation}
where $Z_M^TZ_M = I_M$, $Z_M^Tr_M = 0$, and $\Gamma_M$ tridiagonal. If $A$ is
numerically low\hyp{}rank, or has relatively few distinct eigenvalues, it will
only take a small number of Lanczos iterations to obtain the approximation
$A\approx Z_M\Gamma_MZ_M^T$. Afterwards, we can quickly compute $f(\Gamma_M)$
through eigendecomposition since $\Gamma_M$ is $M$\hyp{}by\hyp{}$M$ and
tridiagonal. The final approximation is $f(A) \approx Z_Mf(\Gamma_M)Z_M^T$.
Both methods we introduce here have clear advantages and drawbacks. The
Chebyshev expansion method has strong theoretical guarantees on the convergence
rate for $f$ with some degree of smoothness over the spectrum of $A$. Many terms
may be required to accurately represent less smooth functions. On the other
hand, GQL is well suited for matrices with a few distinct eigenvalues, which
allows the Lanczos algorithm to converge quickly. However, it requires extra
computation to ensure numerical stability in floating point arithmetic.
Therefore, it is crucial for us to determine the numerical characteristics of
the matrices we are interested in, and choose the appropriate method for each
application.
\section{Stochastic Estimation}\label{pre:ste}
Along with approximation of matrix functions, stochastic estimation is
one of the key tools in this dissertation. Given a linear operator $A\in\BBR^
{N\Times N}$, often in implicit form, stochastic estimators allow us to
efficiently extract the trace and diagonal by probing the operator with random
vectors.
The stochastic trace estimator was first proposed by~
\citet{hutchinson1990stochastic}, who used the technique to approximate the
trace of the influence matrix for Laplacian smoothing splines. In his case,
applying influence matrix $A$ to a vector requires solving an expensive
smoothing spline; obtaining all diagonal elements exactly --- by calculating
$e_i^TAe_i$ for $i=1,\cdots, N$ and $e_i$ the $i$\hyp{}th column of the identity
matrix --- has a significant cost. Instead, Hutchinson applied $A$ to a set of
independent probe vectors $z$ such that $z_i$'s are i.i.d. with mean $0$ and
variance $1$. Consequently,
\begin{equation}\label{eq:probe_trace}
\BBE[z^TAz] = \sum_{i,j}A_{ij}\BBE[z_iz_j] = \tr(A)\,.
\end{equation}
Choosing $N_z$ independent probe vectors $Z_j$, we obtain the unbiased estimator
\begin{equation}\label{eqn:trace_est}
\tr(A) = \BBE[z^TAz] \approx \frac{1}{N_z}\sum_{j=1}^{N_z}Z_j^TAZ_j\,.
\end{equation}
Hutchinson also showed that sampling $Z_i$ from the Bernoulli distribution
leads to minimal variance on the estimated trace.
Since then, \citet{avron2011randomized} reviewed many possible choices of probes
for \cref{eq:probe_trace,eq:probe_diag}; another common choice is vectors with
independent standard normal entries. \citet{bekas2007estimator} extended the idea for diagonal estimation by
observing
\begin{equation}\label{eq:probe_diag}
\BBE[z\odot Az] = \diag(A)\,,
\end{equation}
where $\odot$ represents the Hadamard (elementwise) product. Let $Z\in\BBR^
{N\Times N_z}$ so the columns are independent probe vectors. If we inspect the
$k$\hyp{}th diagonal element,
\begin{equation}\label{eqn:exact_cond}
A_{kk} = \BBE[z_k(Az)_k] = \frac{1}{N_z}\sum_{i}A_{ki}\sum_{j=1}^{N_z}Z_{kj}Z_
{ij}
\end{equation}
Based on \cref{eqn:exact_cond}, \citet{bekas2007estimator} proposed the exact
condition for the diagonal estimator.
\begin{theorem}[Exact Condition for Diagonal Estimatior]\label{thm:exact_cond}
Let $Z\in\BBR^{N\times N_z}$ be the matrix whose columns are probe vectors of
the diagonal estimator in \cref{eqn:exact_cond}. Use $Z_{i\ast}$ to denote the
$i$\hyp{}th row of $Z$. The estimation of $A_{kk}$ is exact if $\norm{Z_
{k\ast}} = 1$, and $Z_{k\ast}^TZ_{i\ast} = 0$ whenever $H_{ki}\neq 0$ for $i\neq
k$.
\end{theorem}
The direct consequence of \cref{thm:exact_cond} is that we can design a
specific set of probe vectors to satisfy the exact condition by solving a graph
coloring problem. Treating $A$ as the adjacency matrix of a graph, and the
non\hyp{}zero entries as edges, a graph coloring partitions the nodes into
disjoint subsets for which no edge exists between each other. If we assign each
subset a row vector chosen from an orthonormal set, the resulting $Z$ will yield
the exact diagonal through the same calculation in \cref{eqn:exact_cond}.
Although obtaining an optimal coloring scheme is NP\hyp{}hard, there exists a
greedy coloring algorithm using breadth\hyp{}first\hyp{}search that can produce
reasonable results for some applications \cite[Theorem~28.33]
{arumugam2016handbook}.
Finally, we also attempted to use a control variate to improve the stochastic
estimation. In \cref{ch4}, where the goal is to estimate the predictive
variance, we take the more accurate results from pivoted Cholesky approximation
on a few data points as a control variate. Despite little reduction in variance,
we are able to obtain an unbiased estimator.
\section{Interpretable Low\hyp{}Rank Approximations}\label{pre:lra}
Low\hyp{}rank structure of matrices plays an essential role in many of our
methods. If a matrix $A\in\BBR^{N\times N}$ has rank $k$, it can be
decomposed into a product of two smaller matrices $W\in\BBR^{N\times k}$ and
$H\in\BBR^{k\times N}$:
\begin{equation}\label{eqn:low_rank}
A = WH\,.
\end{equation}
Hence, we are able to efficiently store $A$ and multiply it with a vector in
$\calO(Nk)$ rather than $\calO(N^2)$. Furthermore, the columns of matrix $W$
provide us useful insight about the data, since they form a basis that span the
original columns of $A$. Low\hyp{}rank approximation is one of the most
well\hyp{}studied problems in linear algebra, and it is widely used in data
science and machine learning. High\hyp{}dimensional data matrices encountered in
these fields commonly have a low numerical rank. Correspondingly, researchers
have proposed low\hyp{}rank models for a wide variety of applications, such as
principal component analysis, natural language processing, and recommender
system. \citet{udell2019big} formalized this notion by showing a nice latent
variable model produces data that can be well approximated by low\hyp{}rank
matrices.
Among the extensive literature on computing and exploiting low\hyp{}rank
structures, there is a growing interest in interpretability. Low\hyp{}rank
approximation via the truncated singular value decomposition optimally
approximates the matrix, but often fails to preserve intrinsic properties of
datasets, e.g., non\hyp{}negativity, sparsity, and convexity. Thus, many
constrained low\hyp{}rank approximation algorithms have been developed that
retain these properties, and allow us to interpret the output in a natural
setting. A prominent example is the non\hyp{}negative matrix approximation, for
which $W$ and $H$ are required to be element\hyp{}wise non\hyp{}negative.
\citet{lee1999learning} applied it to images of faces, and the produced factors
can be visualized as components of facial structures. In \cref{ch5}, the topic
model is also a non\hyp{}negative low\hyp{}rank decomposition, and we are able
to analyze each topic as a probability distribution over words.
One attractive approach to interpretable low\hyp{}rank approximations
forces $W$ to consist of columns from $A$. This allows us to describe all data
points in terms of a small subset, maintaining the fundamental characteristics.
The CUR matrix decomposition, a well\hyp{}known example of this type, is often
preferred over singular value decomposition in certain applications because of
its efficiency and interpretability. The biggest challenge here is to select a
good subset of data points to obtain an accurate representation. The column
subset selection problem is also a heavily discussed topic in numerical linear
algebra \cite{boutsidis2009improved,deshpande2010efficient}. One of the most
popular building block for solving it is QR factorization with column pivoting
(QRCP). The column pivoting procedure generally gives QR factorization the
rank\hyp{}revealing property. Combined with randomized sampling techniques, QRCP
is able to effectively select columns that lead to near\hyp{}optimal
low\hyp{}rank approximations. Both \cref{ch5,ch6} tackle the column subset
selection problem, but from different perspectives. \cref{ch5} develops a
scalable pre\hyp{}processing framework to overcome the robustness issue for
QRCP on a noisy dataset. \cref{ch6} exploits the physical attributes in the
system to replace pivots from QRCP with centroids from the weighted K\hyp{}means
algorithm. The resulting decomposition is much more efficient without any loss
in accuracy.