Merge pull request #507 from aai-institute/feature/506-nystroem-pcg

Feature/506 nystroem pcg
aai-institute · Mar 20, 2024 · 8a5399e · 8a5399e
2 parents d22cbb4 + d3a163a
commit 8a5399e
Show file tree

Hide file tree

Showing 12 changed files with 818 additions and 196 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,16 +4,23 @@
 
 ### Added
 
-- Implement new method: `NystroemSketchInfluence` [PR #504](https://github.com/aai-institute/pyDVL/pull/504)
+- Implement new method: `NystroemSketchInfluence` 
+  [PR #504](https://github.com/aai-institute/pyDVL/pull/504)
 - Add property `model_dtype` to instances of type `TorchInfluenceFunctionModel`
+- Implement a preconditioned block variant of conjugate gradient 
+  [PR #507](https://github.com/aai-institute/pyDVL/pull/507)
 
 ### Fixed
 
-- Bug in `LissaInfluence`, when not using CPU device [PR #495](https://github.com/aai-institute/pyDVL/pull/495)
-- Memory issue with `CgInfluence` and `ArnoldiInfluence`[PR #498](https://github.com/aai-institute/pyDVL/pull/498)
-- Raising specific error message with install instruction, when trying to load `pydvl.utils.cache.memcached`
-  without `pymemcache` installed. If `pymemcache` is available, all symbols from
-  `pydvl.utils.cache.memcached` are available through `pydvl.utils.cache`[PR #509](https://github.com/aai-institute/pyDVL/pull/509)  
+- Bug in `LissaInfluence`, when not using CPU device 
+  [PR #495](https://github.com/aai-institute/pyDVL/pull/495)
+- Memory issue with `CgInfluence` and `ArnoldiInfluence`
+  [PR #498](https://github.com/aai-institute/pyDVL/pull/498)
+- Raising specific error message with install instruction, when trying to load 
+  `pydvl.utils.cache.memcached` without `pymemcache` installed. 
+  If `pymemcache` is available, all symbols from `pydvl.utils.cache.memcached` 
+  are available through `pydvl.utils.cache`
+  [PR #509](https://github.com/aai-institute/pyDVL/pull/509)  
 
 ### Miscellaneous
 
@@ -35,9 +42,10 @@
 ### Fixed
 
 - Bug in using `DaskInfluenceCalcualator` with `TorchnumpyConverter`
-  for single dimensional arrays [PR #485](https://github.com/aai-institute/pyDVL/pull/485)
-- Fix implementations of `to` methods of `TorchInfluenceFunctionModel` implementations
-  [PR #487](https://github.com/aai-institute/pyDVL/pull/487)
+  for single dimensional arrays 
+  [PR #485](https://github.com/aai-institute/pyDVL/pull/485)
+- Fix implementations of `to` methods of `TorchInfluenceFunctionModel` 
+  implementations [PR #487](https://github.com/aai-institute/pyDVL/pull/487)
 - Fixed bug with checking for converged values in semivalues
   [PR #341](https://github.com/appliedAI-Initiative/pyDVL/pull/341)
 
@@ -56,7 +64,8 @@
 - New influence function interface `InfluenceFunctionModel`
 - Data parallel computation with `DaskInfluenceCalculator`
   [PR #26](https://github.com/aai-institute/pyDVL/issues/26)
-- Sequential batch-wise computation and write to disk with `SequentialInfluenceCalculator` 
+- Sequential batch-wise computation and write to disk with 
+  `SequentialInfluenceCalculator` 
   [PR #377](https://github.com/aai-institute/pyDVL/issues/377)
 - Adapt notebooks to new influence abstractions
   [PR #430](https://github.com/aai-institute/pyDVL/issues/430)

diff --git a/docs/assets/pydvl.bib b/docs/assets/pydvl.bib
@@ -13,6 +13,23 @@ @article{agarwal_secondorder_2017
   langid = {english}
 }
 
+@article{bekas_estimator_2007,
+  title = {An Estimator for the Diagonal of a Matrix},
+  author = {Bekas, C. and Kokiopoulou, E. and Saad, Y.},
+  date = {2007-11-01},
+  journaltitle = {Applied Numerical Mathematics},
+  shortjournal = {Applied Numerical Mathematics},
+  series = {Numerical {{Algorithms}}, {{Parallelism}} and {{Applications}} (2)},
+  volume = {57},
+  number = {11},
+  pages = {1214--1229},
+  issn = {0168-9274},
+  doi = {10.1016/j.apnum.2007.01.003},
+  url = {https://www.sciencedirect.com/science/article/pii/S0168927407000244},
+  urldate = {2024-03-19},
+  abstract = {A number of applications require to compute an approximation of the diagonal of a matrix when this matrix is not explicitly available but matrix–vector products with it are easy to evaluate. In some cases, it is the trace of the matrix rather than the diagonal that is needed. This paper describes methods for estimating diagonals and traces of matrices in these situations. The goal is to obtain a good estimate of the diagonal by applying only a small number of matrix–vector products, using selected vectors. We begin by considering the use of random test vectors and then explore special vectors obtained from Hadamard matrices. The methods are tested in the context of computational materials science to estimate the diagonal of the density matrix which holds the charge densities. Numerical experiments indicate that the diagonal estimator may offer an alternative method that in some cases can greatly reduce computational costs in electronic structures calculations.}
+}
+
 @article{benmerzoug_re_2023,
   title = {[{{Re}}] {{If}} You like {{Shapley}}, Then You'll Love the Core},
   author = {Benmerzoug, Anes and Delgado, Miguel de Benito},
@@ -345,6 +362,19 @@ @inproceedings{schoch_csshapley_2022
   keywords = {notion}
 }
 
+@book{trefethen_numerical_1997,
+  title = {Numerical {{Linear Algebra}}},
+  author = {Trefethen, Lloyd N. and Bau, Iii, David},
+  date = {1997-01},
+  publisher = {{Society for Industrial and Applied Mathematics}},
+  location = {Philadelphia, PA},
+  doi = {10.1137/1.9780898719574},
+  url = {https://epubs.siam.org/doi/book/10.1137/1.9780898719574},
+  urldate = {2024-03-19},
+  isbn = {978-0-89871-361-9 978-0-89871-957-4},
+  langid = {english}
+}
+
 @inproceedings{wang_data_2022,
   title = {Data {{Banzhaf}}: {{A Robust Data Valuation Framework}} for {{Machine Learning}}},
   shorttitle = {Data {{Banzhaf}}},

diff --git a/docs/influence/influence_function_model.md b/docs/influence/influence_function_model.md
@@ -1,7 +1,9 @@
 In almost every practical application it is not possible to construct, even less
-invert the complete Hessian in memory. pyDVL offers several implementations of the interface
-[InfluenceFunctionModel][pydvl.influence.base_influence_function_model.InfluenceFunctionModel], which do not compute
-the full Hessian (in contrast to [DirectInfluence][pydvl.influence.torch.influence_function_model.DirectInfluence]).
+invert the complete Hessian in memory. pyDVL offers several implementations of 
+the interface [InfluenceFunctionModel
+][pydvl.influence.base_influence_function_model.InfluenceFunctionModel], 
+which do not compute the full Hessian (in contrast to [DirectInfluence
+][pydvl.influence.torch.influence_function_model.DirectInfluence]).
 
 
 ### Conjugate Gradient
@@ -11,26 +13,47 @@ method that does not require the explicit inversion of the Hessian. Instead, it
 only requires the calculation of Hessian-vector products, making it a good
 choice for large datasets or models with many parameters. It is nevertheless
 much slower to converge than the direct inversion method and not as accurate.
+
 More info on the theory of conjugate gradient can be found on
-[Wikipedia](https://en.wikipedia.org/wiki/Conjugate_gradient_method).
+[Wikipedia](https://en.wikipedia.org/wiki/Conjugate_gradient_method), or in
+text books such as [@trefethen_numerical_1997, Lecture 38].
+
+pyDVL also implements a stable block variant of the conjugate 
+gradient method, defined in [@ji_breakdownfree_2017], which solves several
+right hand sides simultaneously.
+
+Optionally, the user can provide a pre-conditioner to improve convergence, such 
+as a [Jacobi pre-conditioner
+][pydvl.influence.torch.pre_conditioner.JacobiPreConditioner], which
+is a simple [diagonal pre-conditioner](
+https://en.wikipedia.org/wiki/Preconditioner#Jacobi_(or_diagonal)_preconditioner) 
+based on Hutchinson's diagonal estimator [@bekas_estimator_2007],
+or a [Nyström approximation based pre-conditioner
+][pydvl.influence.torch.pre_conditioner.NystroemPreConditioner], 
+described in [@frangella_randomized_2023]. 
 
 ```python
 from pydvl.influence.torch import CgInfluence
+from pydvl.influence.torch.pre_conditioner import NystroemPreConditioner
 
 if_model = CgInfluence(
     model,
     loss,
     hessian_regularization=0.0,
-    x0=None,
     rtol=1e-7,
     atol=1e-7,
     maxiter=None,
+    use_block_cg=True,
+    pre_conditioner=NystroemPreConditioner(rank=10)
 )
+if_model.fit(train_loader)
 ```
 
-The additional optional parameters `x0`, `rtol`, `atol`, and `maxiter` are
-respectively the initial guess for the solution, the relative
-tolerance, the absolute tolerance, and the maximum number of iterations.
+The additional optional parameters `rtol`, `atol`, `maxiter`, `use_block_cg` and 
+`pre_conditioner` are respectively, the relative
+tolerance, the absolute tolerance, the maximum number of iterations, 
+a flag indicating whether to use block variant of cg and an optional
+pre-conditioner.
 
 
 ### Linear time Stochastic Second-Order Approximation (LiSSA)
@@ -62,6 +85,7 @@ if_model = LissaInfluence(
    h0=None,
    rtol=1e-4,
 )
+if_model.fit(train_loader)
 ```
 
 with the additional optional parameters `maxiter`, `dampen`, `scale`, `h0`, and
@@ -94,25 +118,50 @@ if_model = ArnoldiInfluence(
     rank_estimate=10,
     tol=1e-6,
 )
+if_model.fit(train_loader)
 ```
 
 ### Eigenvalue Corrected K-FAC
 
-K-FAC, short for Kronecker-Factored Approximate Curvature, is a method that approximates the Fisher information matrix [FIM](https://en.wikipedia.org/wiki/Fisher_information) of a model. It is possible to show that for classification models with appropriate loss functions the FIM is equal to the Hessian of the model’s loss over the dataset. In this restricted but nonetheless important context K-FAC offers an efficient way to approximate the Hessian and hence the influence scores. 
+K-FAC, short for Kronecker-Factored Approximate Curvature, is a method that 
+approximates the Fisher information matrix [FIM](https://en.wikipedia.org/wiki/Fisher_information) of a model. 
+It is possible to show that for classification models with appropriate loss 
+functions the FIM is equal to the Hessian of the model’s loss over the dataset. 
+In this restricted but nonetheless important context K-FAC offers an efficient 
+way to approximate the Hessian and hence the influence scores. 
 For more info and details refer to the original paper [@martens_optimizing_2015].
 
-The K-FAC method is implemented in the class [EkfacInfluence](pydvl/influence/torch/influence_function_model.py). The following code snippet shows how to use the K-FAC method to calculate the influence function of a model. Note that, in contrast to the other methods for influence function calculation, K-FAC does not require the loss function as an input. This is because the current implementation is only applicable to classification models with a cross entropy loss function. 
+The K-FAC method is implemented in the class [EkfacInfluence
+][pydvl.influence.torch.influence_function_model.EkfacInfluence]. 
+The following code snippet shows how to use the K-FAC method to calculate the 
+influence function of a model. Note that, in contrast to the other methods for 
+influence function calculation, K-FAC does not require the loss function as an 
+input. This is because the current implementation is only applicable to 
+classification models with a cross entropy loss function. 
 
 ```python
 from pydvl.influence.torch import EkfacInfluence
 if_model = EkfacInfluence(
     model,
     hessian_regularization=0.0,
 )
+if_model.fit(train_loader)
 ```
-Upon initialization, the K-FAC method will parse the model and extract which layers require grad and which do not. Then it will only calculate the influence scores for the layers that require grad. The current implementation of the K-FAC method is only available for linear layers, and therefore if the model contains non-linear layers that require gradient the K-FAC method will raise a NotImplementedLayerRepresentationException.
-
-A further improvement of the K-FAC method is the Eigenvalue Corrected K-FAC (EKFAC) method [@george_fast_2018], which allows to further re-fit the eigenvalues of the Hessian, thus providing a more accurate approximation. On top of the K-FAC method, the EKFAC method is implemented by setting `update_diagonal=True` when initialising [EkfacInfluence](pydvl/influence/torch/influence_function_model.py). The following code snippet shows how to use the EKFAC method to calculate the influence function of a model. 
+Upon initialization, the K-FAC method will parse the model and extract which 
+layers require grad and which do not. Then it will only calculate the influence 
+scores for the layers that require grad. The current implementation of the 
+K-FAC method is only available for linear layers, and therefore if the model 
+contains non-linear layers that require gradient the K-FAC method will raise a 
+NotImplementedLayerRepresentationException.
+
+A further improvement of the K-FAC method is the Eigenvalue Corrected 
+K-FAC (EKFAC) method [@george_fast_2018], which allows to further re-fit the 
+eigenvalues of the Hessian, thus providing a more accurate approximation. 
+On top of the K-FAC method, the EKFAC method is implemented by setting 
+`update_diagonal=True` when initialising [EkfacInfluence
+][pydvl.influence.torch.influence_function_model.EkfacInfluence]. 
+The following code snippet shows how to use the EKFAC method to calculate the 
+influence function of a model. 
 
 ```python
 from pydvl.influence.torch import EkfacInfluence
@@ -129,10 +178,12 @@ if_model.fit(train_loader)
 This approximation is based on a Nyström low-rank approximation of the form
 
 \begin{align*}
-H_{\text{nys}} &= (H\Omega)(\Omega^TH\Omega)^{+}(H\Omega)^T \\\
-&= U \Lambda U^T
+H_{\text{nys}} &= (H\Omega)(\Omega^TH\Omega)^{\dagger}(H\Omega)^T \\\
+&= U \Lambda U^T,
 \end{align*}
 
+where $(\cdot)^{\dagger}$ denotes the [Moore-Penrose inverse](
+https://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_inverse),
 in combination with the [Sherman–Morrison–Woodbury formula](
 https://en.wikipedia.org/wiki/Woodbury_matrix_identity) to calculate the
 action of its inverse:
@@ -142,8 +193,8 @@ action of its inverse:
 \frac{1}{\lambda}(I−UU^T)x,
 \end{equation*}
 
-see also [@hataya_nystrom_2023] and [@frangella_randomized_2021]. The essential parameter is the rank of the
-approximation.
+see also [@hataya_nystrom_2023] and [@frangella_randomized_2023]. The essential 
+parameter is the rank of the approximation.
 
 ```python
 from pydvl.influence.torch import NystroemSketchInfluence
@@ -156,6 +207,7 @@ if_model = NystroemSketchInfluence(
 if_model.fit(train_loader)
 ```
 
-These implementations represent the calculation logic on in memory tensors. To scale up to large collection
-of data, we map these influence function models over these collections. For a detailed discussion see the
+These implementations represent the calculation logic on in memory tensors. 
+To scale up to large collection of data, we map these influence function models 
+over these collections. For a detailed discussion see the
 documentation page [Scaling Computation](scaling_computation.md).
diff --git a/notebooks/influence_wine.ipynb b/notebooks/influence_wine.ipynb
diff --git a/src/pydvl/influence/torch/__init__.py b/src/pydvl/influence/torch/__init__.py
@@ -6,3 +6,4 @@
     LissaInfluence,
     NystroemSketchInfluence,
 )
+from .pre_conditioner import JacobiPreConditioner, NystroemPreConditioner
diff --git a/src/pydvl/influence/torch/functional.py b/src/pydvl/influence/torch/functional.py
@@ -48,8 +48,11 @@
     "create_per_sample_mixed_derivative_function",
     "model_hessian_low_rank",
     "LowRankProductRepresentation",
+    "randomized_nystroem_approximation",
+    "model_hessian_nystroem_approximation",
 ]
 
+
 logger = logging.getLogger(__name__)
 
 
@@ -169,8 +172,10 @@ def create_empirical_loss_function(
     on a given dataset. If we denote the model parameters with \( \theta \),
     the resulting function approximates:
 
-    \[ f(\theta) = \frac{1}{N}\sum_{i=1}^N
-        \operatorname{loss}(y_i, \operatorname{model}(\theta, x_i)) \]
+    \[
+        f(\theta) = \frac{1}{N}\sum_{i=1}^N
+        \operatorname{loss}(y_i, \operatorname{model}(\theta, x_i))
+    \]
 
     for a loss function $\operatorname{loss}$ and a model $\operatorname{model}$
     with model parameters $\theta$, where $N$ is the number of all elements provided
@@ -848,19 +853,31 @@ def randomized_nystroem_approximation(
 ) -> LowRankProductRepresentation:
     r"""
     Given a matrix vector product function (representing a symmetric positive definite
-    matrix \(A\) ), computes a random Nyström low rank approximation of
-    \(A\) in factored form, i.e.
-
-    \[ A_{\text{nys}} = (A \Omega)(\Omega^T A \Omega)^{\Cross}(A \Omega)^T
-    = U \Sigma U^T\]
-
-    :param mat_mat_prod: A callable representing the matrix vector product
-    :param input_dim: dimension of the input for the matrix vector product
-    :param input_type:
-    :param rank: rank of the approximation
-    :param shift_func:
-    :param mat_vec_device: device where the matrix vector product has to be executed
-    :return: object containing, \(U\) and \(\Sigma\)
+    matrix $A$ ), computes a random Nyström low rank approximation of
+    $A$ in factored form, i.e.
+
+    $$ A_{\text{nys}} = (A \Omega)(\Omega^T A \Omega)^{\dagger}(A \Omega)^T
+    = U \Sigma U^T $$
+
+    where $\Omega$ is a standard normal random matrix.
+
+    Args:
+        mat_mat_prod: A callable representing the matrix vector product
+        input_dim: dimension of the input for the matrix vector product
+        input_type: data_type of inputs
+        rank: rank of the approximation
+        shift_func: optional function for computing the stabilizing shift in the
+            construction of the randomized nystroem approximation, defaults to
+
+            $$ \sqrt{\operatorname{\text{input_dim}}} \cdot
+                \varepsilon(\operatorname{\text{input_type}}) \cdot \|A\Omega\|_2,$$
+
+            where $\varepsilon(\operatorname{\text{input_type}})$ is the value of the
+            machine precision corresponding to the data type.
+        mat_vec_device: device where the matrix vector product has to be executed
+
+    Returns:
+        object containing, $U$ and $\Sigma$
     """
 
     if shift_func is None:
@@ -924,14 +941,31 @@ def model_hessian_nystroem_approximation(
     rank: int,
     shift_func: Optional[Callable[[torch.Tensor], torch.Tensor]] = None,
 ) -> LowRankProductRepresentation:
-    """
+    r"""
+    Given a model, loss and a data_loader, computes a random Nyström low rank approximation of
+    the corresponding Hessian matrix in factored form, i.e.
 
-    :param model:
-    :param loss:
-    :param data_loader:
-    :param rank:
-    :param shift_func:
-    :return:
+    $$ H_{\text{nys}} = (H \Omega)(\Omega^T H \Omega)^{+}(H \Omega)^T
+    = U \Sigma U^T $$
+
+    Args:
+        model: A PyTorch model instance. The Hessian will be calculated with respect to
+            this model's parameters.
+        loss : A callable that computes the loss.
+        data_loader: A DataLoader instance that provides the model's training data.
+            Used in calculating the Hessian-vector products.
+        rank: rank of the approximation
+        shift_func: optional function for computing the stabilizing shift in the
+            construction of the randomized nystroem approximation, defaults to
+
+            $$ \sqrt{\operatorname{\text{input_dim}}} \cdot
+                \varepsilon(\operatorname{\text{input_type}}) \cdot \|A\Omega\|_2,$$
+
+            where $\varepsilon(\operatorname{\text{input_type}})$ is the value of the
+            machine precision corresponding to the data type.
+
+    Returns:
+        object containing, $U$ and $\Sigma$
     """
 
     model_hvp = create_hvp_function(