Skip to content

Commit

Permalink
Merge pull request #507 from aai-institute/feature/506-nystroem-pcg
Browse files Browse the repository at this point in the history
Feature/506 nystroem pcg
  • Loading branch information
mdbenito authored Mar 20, 2024
2 parents d22cbb4 + d3a163a commit 8a5399e
Show file tree
Hide file tree
Showing 12 changed files with 818 additions and 196 deletions.
29 changes: 19 additions & 10 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,23 @@

### Added

- Implement new method: `NystroemSketchInfluence` [PR #504](https://github.com/aai-institute/pyDVL/pull/504)
- Implement new method: `NystroemSketchInfluence`
[PR #504](https://github.com/aai-institute/pyDVL/pull/504)
- Add property `model_dtype` to instances of type `TorchInfluenceFunctionModel`
- Implement a preconditioned block variant of conjugate gradient
[PR #507](https://github.com/aai-institute/pyDVL/pull/507)

### Fixed

- Bug in `LissaInfluence`, when not using CPU device [PR #495](https://github.com/aai-institute/pyDVL/pull/495)
- Memory issue with `CgInfluence` and `ArnoldiInfluence`[PR #498](https://github.com/aai-institute/pyDVL/pull/498)
- Raising specific error message with install instruction, when trying to load `pydvl.utils.cache.memcached`
without `pymemcache` installed. If `pymemcache` is available, all symbols from
`pydvl.utils.cache.memcached` are available through `pydvl.utils.cache`[PR #509](https://github.com/aai-institute/pyDVL/pull/509)
- Bug in `LissaInfluence`, when not using CPU device
[PR #495](https://github.com/aai-institute/pyDVL/pull/495)
- Memory issue with `CgInfluence` and `ArnoldiInfluence`
[PR #498](https://github.com/aai-institute/pyDVL/pull/498)
- Raising specific error message with install instruction, when trying to load
`pydvl.utils.cache.memcached` without `pymemcache` installed.
If `pymemcache` is available, all symbols from `pydvl.utils.cache.memcached`
are available through `pydvl.utils.cache`
[PR #509](https://github.com/aai-institute/pyDVL/pull/509)

### Miscellaneous

Expand All @@ -35,9 +42,10 @@
### Fixed

- Bug in using `DaskInfluenceCalcualator` with `TorchnumpyConverter`
for single dimensional arrays [PR #485](https://github.com/aai-institute/pyDVL/pull/485)
- Fix implementations of `to` methods of `TorchInfluenceFunctionModel` implementations
[PR #487](https://github.com/aai-institute/pyDVL/pull/487)
for single dimensional arrays
[PR #485](https://github.com/aai-institute/pyDVL/pull/485)
- Fix implementations of `to` methods of `TorchInfluenceFunctionModel`
implementations [PR #487](https://github.com/aai-institute/pyDVL/pull/487)
- Fixed bug with checking for converged values in semivalues
[PR #341](https://github.com/appliedAI-Initiative/pyDVL/pull/341)

Expand All @@ -56,7 +64,8 @@
- New influence function interface `InfluenceFunctionModel`
- Data parallel computation with `DaskInfluenceCalculator`
[PR #26](https://github.com/aai-institute/pyDVL/issues/26)
- Sequential batch-wise computation and write to disk with `SequentialInfluenceCalculator`
- Sequential batch-wise computation and write to disk with
`SequentialInfluenceCalculator`
[PR #377](https://github.com/aai-institute/pyDVL/issues/377)
- Adapt notebooks to new influence abstractions
[PR #430](https://github.com/aai-institute/pyDVL/issues/430)
Expand Down
30 changes: 30 additions & 0 deletions docs/assets/pydvl.bib
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,23 @@ @article{agarwal_secondorder_2017
langid = {english}
}

@article{bekas_estimator_2007,
title = {An Estimator for the Diagonal of a Matrix},
author = {Bekas, C. and Kokiopoulou, E. and Saad, Y.},
date = {2007-11-01},
journaltitle = {Applied Numerical Mathematics},
shortjournal = {Applied Numerical Mathematics},
series = {Numerical {{Algorithms}}, {{Parallelism}} and {{Applications}} (2)},
volume = {57},
number = {11},
pages = {1214--1229},
issn = {0168-9274},
doi = {10.1016/j.apnum.2007.01.003},
url = {https://www.sciencedirect.com/science/article/pii/S0168927407000244},
urldate = {2024-03-19},
abstract = {A number of applications require to compute an approximation of the diagonal of a matrix when this matrix is not explicitly available but matrix–vector products with it are easy to evaluate. In some cases, it is the trace of the matrix rather than the diagonal that is needed. This paper describes methods for estimating diagonals and traces of matrices in these situations. The goal is to obtain a good estimate of the diagonal by applying only a small number of matrix–vector products, using selected vectors. We begin by considering the use of random test vectors and then explore special vectors obtained from Hadamard matrices. The methods are tested in the context of computational materials science to estimate the diagonal of the density matrix which holds the charge densities. Numerical experiments indicate that the diagonal estimator may offer an alternative method that in some cases can greatly reduce computational costs in electronic structures calculations.}
}

@article{benmerzoug_re_2023,
title = {[{{Re}}] {{If}} You like {{Shapley}}, Then You'll Love the Core},
author = {Benmerzoug, Anes and Delgado, Miguel de Benito},
Expand Down Expand Up @@ -345,6 +362,19 @@ @inproceedings{schoch_csshapley_2022
keywords = {notion}
}

@book{trefethen_numerical_1997,
title = {Numerical {{Linear Algebra}}},
author = {Trefethen, Lloyd N. and Bau, Iii, David},
date = {1997-01},
publisher = {{Society for Industrial and Applied Mathematics}},
location = {Philadelphia, PA},
doi = {10.1137/1.9780898719574},
url = {https://epubs.siam.org/doi/book/10.1137/1.9780898719574},
urldate = {2024-03-19},
isbn = {978-0-89871-361-9 978-0-89871-957-4},
langid = {english}
}

@inproceedings{wang_data_2022,
title = {Data {{Banzhaf}}: {{A Robust Data Valuation Framework}} for {{Machine Learning}}},
shorttitle = {Data {{Banzhaf}}},
Expand Down
90 changes: 71 additions & 19 deletions docs/influence/influence_function_model.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
In almost every practical application it is not possible to construct, even less
invert the complete Hessian in memory. pyDVL offers several implementations of the interface
[InfluenceFunctionModel][pydvl.influence.base_influence_function_model.InfluenceFunctionModel], which do not compute
the full Hessian (in contrast to [DirectInfluence][pydvl.influence.torch.influence_function_model.DirectInfluence]).
invert the complete Hessian in memory. pyDVL offers several implementations of
the interface [InfluenceFunctionModel
][pydvl.influence.base_influence_function_model.InfluenceFunctionModel],
which do not compute the full Hessian (in contrast to [DirectInfluence
][pydvl.influence.torch.influence_function_model.DirectInfluence]).


### Conjugate Gradient
Expand All @@ -11,26 +13,47 @@ method that does not require the explicit inversion of the Hessian. Instead, it
only requires the calculation of Hessian-vector products, making it a good
choice for large datasets or models with many parameters. It is nevertheless
much slower to converge than the direct inversion method and not as accurate.

More info on the theory of conjugate gradient can be found on
[Wikipedia](https://en.wikipedia.org/wiki/Conjugate_gradient_method).
[Wikipedia](https://en.wikipedia.org/wiki/Conjugate_gradient_method), or in
text books such as [@trefethen_numerical_1997, Lecture 38].

pyDVL also implements a stable block variant of the conjugate
gradient method, defined in [@ji_breakdownfree_2017], which solves several
right hand sides simultaneously.

Optionally, the user can provide a pre-conditioner to improve convergence, such
as a [Jacobi pre-conditioner
][pydvl.influence.torch.pre_conditioner.JacobiPreConditioner], which
is a simple [diagonal pre-conditioner](
https://en.wikipedia.org/wiki/Preconditioner#Jacobi_(or_diagonal)_preconditioner)
based on Hutchinson's diagonal estimator [@bekas_estimator_2007],
or a [Nyström approximation based pre-conditioner
][pydvl.influence.torch.pre_conditioner.NystroemPreConditioner],
described in [@frangella_randomized_2023].

```python
from pydvl.influence.torch import CgInfluence
from pydvl.influence.torch.pre_conditioner import NystroemPreConditioner

if_model = CgInfluence(
model,
loss,
hessian_regularization=0.0,
x0=None,
rtol=1e-7,
atol=1e-7,
maxiter=None,
use_block_cg=True,
pre_conditioner=NystroemPreConditioner(rank=10)
)
if_model.fit(train_loader)
```

The additional optional parameters `x0`, `rtol`, `atol`, and `maxiter` are
respectively the initial guess for the solution, the relative
tolerance, the absolute tolerance, and the maximum number of iterations.
The additional optional parameters `rtol`, `atol`, `maxiter`, `use_block_cg` and
`pre_conditioner` are respectively, the relative
tolerance, the absolute tolerance, the maximum number of iterations,
a flag indicating whether to use block variant of cg and an optional
pre-conditioner.


### Linear time Stochastic Second-Order Approximation (LiSSA)
Expand Down Expand Up @@ -62,6 +85,7 @@ if_model = LissaInfluence(
h0=None,
rtol=1e-4,
)
if_model.fit(train_loader)
```

with the additional optional parameters `maxiter`, `dampen`, `scale`, `h0`, and
Expand Down Expand Up @@ -94,25 +118,50 @@ if_model = ArnoldiInfluence(
rank_estimate=10,
tol=1e-6,
)
if_model.fit(train_loader)
```

### Eigenvalue Corrected K-FAC

K-FAC, short for Kronecker-Factored Approximate Curvature, is a method that approximates the Fisher information matrix [FIM](https://en.wikipedia.org/wiki/Fisher_information) of a model. It is possible to show that for classification models with appropriate loss functions the FIM is equal to the Hessian of the model’s loss over the dataset. In this restricted but nonetheless important context K-FAC offers an efficient way to approximate the Hessian and hence the influence scores.
K-FAC, short for Kronecker-Factored Approximate Curvature, is a method that
approximates the Fisher information matrix [FIM](https://en.wikipedia.org/wiki/Fisher_information) of a model.
It is possible to show that for classification models with appropriate loss
functions the FIM is equal to the Hessian of the model’s loss over the dataset.
In this restricted but nonetheless important context K-FAC offers an efficient
way to approximate the Hessian and hence the influence scores.
For more info and details refer to the original paper [@martens_optimizing_2015].

The K-FAC method is implemented in the class [EkfacInfluence](pydvl/influence/torch/influence_function_model.py). The following code snippet shows how to use the K-FAC method to calculate the influence function of a model. Note that, in contrast to the other methods for influence function calculation, K-FAC does not require the loss function as an input. This is because the current implementation is only applicable to classification models with a cross entropy loss function.
The K-FAC method is implemented in the class [EkfacInfluence
][pydvl.influence.torch.influence_function_model.EkfacInfluence].
The following code snippet shows how to use the K-FAC method to calculate the
influence function of a model. Note that, in contrast to the other methods for
influence function calculation, K-FAC does not require the loss function as an
input. This is because the current implementation is only applicable to
classification models with a cross entropy loss function.

```python
from pydvl.influence.torch import EkfacInfluence
if_model = EkfacInfluence(
model,
hessian_regularization=0.0,
)
if_model.fit(train_loader)
```
Upon initialization, the K-FAC method will parse the model and extract which layers require grad and which do not. Then it will only calculate the influence scores for the layers that require grad. The current implementation of the K-FAC method is only available for linear layers, and therefore if the model contains non-linear layers that require gradient the K-FAC method will raise a NotImplementedLayerRepresentationException.

A further improvement of the K-FAC method is the Eigenvalue Corrected K-FAC (EKFAC) method [@george_fast_2018], which allows to further re-fit the eigenvalues of the Hessian, thus providing a more accurate approximation. On top of the K-FAC method, the EKFAC method is implemented by setting `update_diagonal=True` when initialising [EkfacInfluence](pydvl/influence/torch/influence_function_model.py). The following code snippet shows how to use the EKFAC method to calculate the influence function of a model.
Upon initialization, the K-FAC method will parse the model and extract which
layers require grad and which do not. Then it will only calculate the influence
scores for the layers that require grad. The current implementation of the
K-FAC method is only available for linear layers, and therefore if the model
contains non-linear layers that require gradient the K-FAC method will raise a
NotImplementedLayerRepresentationException.

A further improvement of the K-FAC method is the Eigenvalue Corrected
K-FAC (EKFAC) method [@george_fast_2018], which allows to further re-fit the
eigenvalues of the Hessian, thus providing a more accurate approximation.
On top of the K-FAC method, the EKFAC method is implemented by setting
`update_diagonal=True` when initialising [EkfacInfluence
][pydvl.influence.torch.influence_function_model.EkfacInfluence].
The following code snippet shows how to use the EKFAC method to calculate the
influence function of a model.

```python
from pydvl.influence.torch import EkfacInfluence
Expand All @@ -129,10 +178,12 @@ if_model.fit(train_loader)
This approximation is based on a Nyström low-rank approximation of the form

\begin{align*}
H_{\text{nys}} &= (H\Omega)(\Omega^TH\Omega)^{+}(H\Omega)^T \\\
&= U \Lambda U^T
H_{\text{nys}} &= (H\Omega)(\Omega^TH\Omega)^{\dagger}(H\Omega)^T \\\
&= U \Lambda U^T,
\end{align*}

where $(\cdot)^{\dagger}$ denotes the [Moore-Penrose inverse](
https://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_inverse),
in combination with the [Sherman–Morrison–Woodbury formula](
https://en.wikipedia.org/wiki/Woodbury_matrix_identity) to calculate the
action of its inverse:
Expand All @@ -142,8 +193,8 @@ action of its inverse:
\frac{1}{\lambda}(I−UU^T)x,
\end{equation*}

see also [@hataya_nystrom_2023] and [@frangella_randomized_2021]. The essential parameter is the rank of the
approximation.
see also [@hataya_nystrom_2023] and [@frangella_randomized_2023]. The essential
parameter is the rank of the approximation.

```python
from pydvl.influence.torch import NystroemSketchInfluence
Expand All @@ -156,6 +207,7 @@ if_model = NystroemSketchInfluence(
if_model.fit(train_loader)
```

These implementations represent the calculation logic on in memory tensors. To scale up to large collection
of data, we map these influence function models over these collections. For a detailed discussion see the
These implementations represent the calculation logic on in memory tensors.
To scale up to large collection of data, we map these influence function models
over these collections. For a detailed discussion see the
documentation page [Scaling Computation](scaling_computation.md).
134 changes: 40 additions & 94 deletions notebooks/influence_wine.ipynb

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions src/pydvl/influence/torch/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,4 @@
LissaInfluence,
NystroemSketchInfluence,
)
from .pre_conditioner import JacobiPreConditioner, NystroemPreConditioner
78 changes: 56 additions & 22 deletions src/pydvl/influence/torch/functional.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,8 +48,11 @@
"create_per_sample_mixed_derivative_function",
"model_hessian_low_rank",
"LowRankProductRepresentation",
"randomized_nystroem_approximation",
"model_hessian_nystroem_approximation",
]


logger = logging.getLogger(__name__)


Expand Down Expand Up @@ -169,8 +172,10 @@ def create_empirical_loss_function(
on a given dataset. If we denote the model parameters with \( \theta \),
the resulting function approximates:
\[ f(\theta) = \frac{1}{N}\sum_{i=1}^N
\operatorname{loss}(y_i, \operatorname{model}(\theta, x_i)) \]
\[
f(\theta) = \frac{1}{N}\sum_{i=1}^N
\operatorname{loss}(y_i, \operatorname{model}(\theta, x_i))
\]
for a loss function $\operatorname{loss}$ and a model $\operatorname{model}$
with model parameters $\theta$, where $N$ is the number of all elements provided
Expand Down Expand Up @@ -848,19 +853,31 @@ def randomized_nystroem_approximation(
) -> LowRankProductRepresentation:
r"""
Given a matrix vector product function (representing a symmetric positive definite
matrix \(A\) ), computes a random Nyström low rank approximation of
\(A\) in factored form, i.e.
\[ A_{\text{nys}} = (A \Omega)(\Omega^T A \Omega)^{\Cross}(A \Omega)^T
= U \Sigma U^T\]
:param mat_mat_prod: A callable representing the matrix vector product
:param input_dim: dimension of the input for the matrix vector product
:param input_type:
:param rank: rank of the approximation
:param shift_func:
:param mat_vec_device: device where the matrix vector product has to be executed
:return: object containing, \(U\) and \(\Sigma\)
matrix $A$ ), computes a random Nyström low rank approximation of
$A$ in factored form, i.e.
$$ A_{\text{nys}} = (A \Omega)(\Omega^T A \Omega)^{\dagger}(A \Omega)^T
= U \Sigma U^T $$
where $\Omega$ is a standard normal random matrix.
Args:
mat_mat_prod: A callable representing the matrix vector product
input_dim: dimension of the input for the matrix vector product
input_type: data_type of inputs
rank: rank of the approximation
shift_func: optional function for computing the stabilizing shift in the
construction of the randomized nystroem approximation, defaults to
$$ \sqrt{\operatorname{\text{input_dim}}} \cdot
\varepsilon(\operatorname{\text{input_type}}) \cdot \|A\Omega\|_2,$$
where $\varepsilon(\operatorname{\text{input_type}})$ is the value of the
machine precision corresponding to the data type.
mat_vec_device: device where the matrix vector product has to be executed
Returns:
object containing, $U$ and $\Sigma$
"""

if shift_func is None:
Expand Down Expand Up @@ -924,14 +941,31 @@ def model_hessian_nystroem_approximation(
rank: int,
shift_func: Optional[Callable[[torch.Tensor], torch.Tensor]] = None,
) -> LowRankProductRepresentation:
"""
r"""
Given a model, loss and a data_loader, computes a random Nyström low rank approximation of
the corresponding Hessian matrix in factored form, i.e.
:param model:
:param loss:
:param data_loader:
:param rank:
:param shift_func:
:return:
$$ H_{\text{nys}} = (H \Omega)(\Omega^T H \Omega)^{+}(H \Omega)^T
= U \Sigma U^T $$
Args:
model: A PyTorch model instance. The Hessian will be calculated with respect to
this model's parameters.
loss : A callable that computes the loss.
data_loader: A DataLoader instance that provides the model's training data.
Used in calculating the Hessian-vector products.
rank: rank of the approximation
shift_func: optional function for computing the stabilizing shift in the
construction of the randomized nystroem approximation, defaults to
$$ \sqrt{\operatorname{\text{input_dim}}} \cdot
\varepsilon(\operatorname{\text{input_type}}) \cdot \|A\Omega\|_2,$$
where $\varepsilon(\operatorname{\text{input_type}})$ is the value of the
machine precision corresponding to the data type.
Returns:
object containing, $U$ and $\Sigma$
"""

model_hvp = create_hvp_function(
Expand Down
Loading

0 comments on commit 8a5399e

Please sign in to comment.