@def title = "Hutch++ | Raphael A. Meyer" @def hascode = true
\enabletheorems \newcounter{NumAlgorithms}
\newcommand{\trace}{\text{tr}}
At SOSA 2021, I presented the paper Hutch++: Optimal Stochastic Trace Estimation (with Cameron Musco, Christopher Musco, and David P. Woodruff). This webpage contains some short updates that may be useful to people.
Table of Contents:
\tableofcontents
I believe that the analysis of both Hutchinson's Estimator and Hutch++ is simple and approachable. This section contains the analysis an advanced undergraduate should hopefully be able to understand.
\begin{definition}{Hutchinson's Estimator}{}
Fix
We can easily verify two properties of this classical estimator for symmetric
\begin{lemma}{Hutchinson's Mean and Variance}{}
In order to analyze the variance, we present the analysis of Lemma 9 from \cite{avron2011randomized}.
Let
\begin{align} \Var[H_\ell(\mA)] &= \frac1\ell \Var[\vx^\intercal\mA\vx] \ &= \frac1\ell \Var[\vx^\intercal\mU\mLambda\mU^\intercal\vx] \ &= \frac1\ell \Var[(\mU^\intercal\vx)^\intercal\mLambda(\mU\vx)] \ &= \frac1\ell \Var[\vy^\intercal\mLambda\vy] \ &= \frac1\ell \Var[\sum_{i=1}^n \lambda_i y_i^2] \ &= \frac1\ell \sum_{i=1}^n \Var[\lambda_i y_i^2] \ &= \frac1\ell \sum_{i=1}^n 2\lambda_i^2 \ &= \frac2\ell \onormof{\mA}{F}^2 \end{align}
Note we cannot just use linearity of trace naively on the sum
This variance is in term of the Frobenius norm of
\begin{lemma}{Hutchinson's Estimator}{hutch-rate}
Fix
So, if we want to ensure
This was a section in the first draft of the paper, but was unfortunately cut. This is a slow, methodical intuition for the geometry Hutch++ takes advantage of.
Hutchinson's Estimator is powerfull, and gives a nice rate of convergence. However, the proof of \lemmaref{hutch-rate} has a suspicious step. If we write out three lines of analysis: \begin{align} |\trace(\mA) - H_\ell(\mA)| &\leq \tsfrac{\sqrt 2}{\sqrt \ell} \normof{\mA}_F & \hspace{1cm} & (\text{Standard Deviation})\ &\leq \tsfrac{\sqrt 2}{\sqrt \ell} \trace(\mA) & & (\normof{\mA}_F \leq \trace(\mA)) \ &= \eps \ \trace(\mA) & & (\ell = O(\tsfrac{1}{\eps})) \ \end{align}
Looking at these inequalities,
- The analysis of line 1 is always tight, since we know the variance of
$H_\ell(\mA)$ exactly. - The analysis of line 3 is always tight, since we just set
$\ell$ to the smallest value that gets error$\eps$ . - The analysis of line 2 is not always tight.
The second line is tight only if
Try to show this yourself. If you're having trouble, look at the argument below.
\end{claim}
\begin{dropdown}{Proof}
\begin{proof}
First, we rewrite
We now need to show that if a vector
There are many intuition, and we will look at the extreme, and assume that
We now finish the discussion with a comparison of two ideas:
- So, Hutchinson's Estimator only needs to use many samples (i.e.
$O(\frac1\eps)$ samples) if$\mA$ has very special structure: it has a small number of large eigenvalues. - If a matrix has a small number of large eigenvalues, the trace must be well approximated by the sum of those eigenvalues.
With this, we conclude the fundamental intuition behind Hutch++:
\begin{claim}{Hutch++ Intuition}{}
If
This leads us to pick the following rough design of an algorithm:
- Find a good low-rank approximation
$\tilde\mA_k$ - Notice that
$\trace(\mA) = \trace(\tilde \mA_k) + \trace(\mA - \tilde\mA_k)$ - Compute
$\trace(\tilde \mA_k)$ exactly - Approximate
$\trace(\mA-\tilde\mA_k)$ with Hutchinson's Estimator - Return
$\text{Hutch++}(\mA) = \trace(\tilde\mA_k) + H_\ell(\mA-\tilde\mA_k)$
In the next section, we state a formal version of \claimref{l2-l1-l0-intuit} (see \lemmaref{l2-l1-l0}), show how to compute such a matrix
Special thanks to Joel A. Tropp for working most of this out, and encouraging us to look into sharpening the constants in the analysis of Hutch++.
By expressing the variance of Hutch++ with all of its constants laid bare, and by using a very simple analysis, this analysis will hopefully allow practitioners to easily the exact parameters in Hutch++ code.
Before bounding the variance of Hutch++, we include the proof of one lemma and import another theorem:
\begin{lemma}{L2/L1/L0 Bounds}{l2-l1-l0}
Let
\begin{dropdown}{Proof: Lemma 7 from \cite{gilbert2007one}}
\begin{proof}
Let
Hölder's inequality states that
We also import the following theorem, which controls the expected error from our low-rank approximation, and which we do not prove:
\begin{theorem}{Expected Projection Error}{tropp-error}
Sample
Now, we proceed to the analysis of the variance of Hutch++:
\label{thm_hutchpp_variance}
\begin{theorem}{Variance of Hutch++}{thm-hutchpp-variance}
Fix parameters
We first look at the expectation by using the Tower Law, the expectation of Hutchinson's estimator, the cyclic property of trace, and the linearity of trace:
\begin{align}
\E[\text{Hutch++}(\mA)]
&= \E[\trace(\mQ^\intercal\mA\mQ) + H_\ell((\eye-\mQ\mQ^\intercal)\mA)] \
&= \E[\trace(\mQ^\intercal\mA\mQ)] + \E\left[\E[H_\ell((\eye-\mQ\mQ^\intercal)\mA)|\mQ]\right] \
&= \E[\trace(\mQ^\intercal\mA\mQ)] + \E[\trace((\eye-\mQ\mQ^\intercal)\mA)] \
&= \E[\trace(\mA\mQ\mQ^\intercal)] + \E[\trace(\mA-\mQ\mQ^\intercal\mA)] \
&= \E[\trace(\mA\mQ\mQ^\intercal)+\trace(\mA-\mQ\mQ^\intercal\mA)] \
&= \trace(\mA)
\end{align}
We now move onto the Variance bound.
First, recall the Conditional Variance Formula, which says for any random variables
The variance in \theoremref{thm-hutchpp-variance} still has two parameters in it,
which leave a bit of ambiguity for practitioners.
So, we quickly analyze the choice of
Formally, suppose we are allowed to compute exactly
- Computing
$\mQ$ uses$k+p=2k+1$ products - Computing
$\trace(\mQ^\intercal\mA\mQ)$ uses$k+p=2k+1$ products - Computing
$H_\ell( (\eye-\mQ\mQ^\intercal)\mA)$ uses$\ell$ products So, we have$m=2(2k+1)+\ell=4k+2+\ell$ . We can then verify that$k=\frac{m-2}{8}$ and$\ell=\frac{m}{2}-1$ minimizes the variance. Notably, this is equivalent to setting$\ell=4k$ , which looks more intuitive. This produces a final variance of$\Var[\text{Hutch++}(\mA)]\leq\frac{16}{(m-2)^2}\trace^2(\mA)$ .
One additional approach to implementing Hutch++ for PSD matrices considers the Nyström method, where we use the approximation
[
\tilde\mA_k = (\mA\mQ)(\mQ^\intercal\mA\mQ)^{-1}(\mA\mQ)^{\intercal}
]
Note that we can compute
\begin{algorithm}{NYS-Hutch++}
input: Matrix-Vector Oracle access to
output: Approximation to
- Sample
$\mS\in\bbR^{d \times \frac{m}{4}}$ and$\mG\in\bbR^{d \times \frac{m}{2}}$ with i.i.d.${+1,-1}$ entries. - Compute an orthonormal basis
$\mQ\in\bbR^{d \times \frac{m}{4}}$ for the span of$\mA\mS$ - Compute
$\mY=\mA\mQ$ and$\mR=\mY^\intercal\mG$ - return
$\trace((\mQ^\intercal\mY)^{-1}(\mY^\intercal\mY)) + \frac{2}{m}\left(\trace(\mG^\intercal\mA\mG) - \trace(\mR^\intercal(\mQ^\intercal\mY)^{-1}\mR) \right)$ \end{algorithm}
Or, in matlab:
function trace_est=simple_nystrom_hutchplusplus(A, num_queries)
S = 2*randi(2,size(A,1),ceil(num_queries/4))-3;
G = 2*randi(2,size(A,1),floor(num_queries/2))-3;
[Q,~] = qr(A*S,0);
Y = A*Q;
R = Y' * G;
QYinv = pinv(Q' * Y);
trace_est = trace(QYinv * (Y'*Y)) + 1/size(G,2)*[trace(G'*A*G) - trace(R'*QYinv*R)];
end
-
We provide code on Github, written in MATLAB.
-
Akshay Agrawal also has an implementation on Github using pytorch, though we haven't verified this code ourselves.
-
Email me if you want to add your implementation to the list!
If you have read Hutch++, then quiz yourself by trying to answer some intuitive high-level questions:
- Why is the constant-factor approximation on our low-rank approximation
$|A-\tilde{A}_k|_F \leq 2 |A-A_k|_F$ sufficient? Why don't we need a relative error approximation like$|A-\tilde{A}_k|_F \leq (1+\varepsilon) |A-A_k|_F$ ? \begin{right} Give an intuitive answer, not a mathematical one. \end{right}
[^highprobability] We analyze variance in this page for convenience, but all of these confidence intervals hold with high probability (i.e. with logarithmic dependence on the failure probability), and this is widely analyzed in the formal publications.
-
\biblabel{gilbert2007one}{Gilbert et al. (2007)} Gilbert, Strauss, Tropp, and Vershynin. One sketch for all: fast algorithms for compressed sensing. STOC 2007.
-
\biblabel{halko2011finding}{Halko et al. (2011)} Halko, Martinsson, and Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review 53 no. 2 2011.
-
\biblabel{avron2011randomized}{Avron, Toledo (2011)}Avron and Toledo. Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix. Journal of the ACM (JACM) 58, no. 2 2011.
\theoremscripts