Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/506 nystroem pcg #507

Merged
merged 22 commits into from
Mar 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
60ad4e9
Extend functionality of CgInfluence:
Mar 4, 2024
8fd7ae9
Fix docstrings, extend documentation
Mar 6, 2024
0a89149
Use block conjugate gradient in notebook
Mar 6, 2024
6b1a8e8
Merge remote-tracking branch 'origin/develop' into feature/506-nystro…
Mar 11, 2024
0f6f5ff
Update CHANGELOG.md
Mar 11, 2024
fb6ff7d
Merge remote-tracking branch 'origin/develop' into feature/506-nystro…
schroedk Mar 12, 2024
43bbea5
Update entry in bibfile to reference to published version instead of …
schroedk Mar 12, 2024
b3731c5
Mention randomized Nyström preconditoining and stable block conjugate…
schroedk Mar 12, 2024
9f1545d
Mention block variant and pre-conditioning in documentation of cg method
schroedk Mar 12, 2024
828c0f8
Remove unused import in influence_wine notebook
schroedk Mar 19, 2024
f3343c5
Reformatting CHANGELOG.md to keep line limit
schroedk Mar 19, 2024
7e3e1da
Merge remote-tracking branch 'origin/develop' into feature/506-nystro…
schroedk Mar 19, 2024
4d2bad0
Update bibtex reference file
schroedk Mar 20, 2024
0780903
Refactor pre-conditioners:
schroedk Mar 20, 2024
dc366c7
Refactor CgInfluence:
schroedk Mar 20, 2024
ef96931
Import pre-conditioners in init of influence.torch
schroedk Mar 20, 2024
fe384de
Improve documentation of functions for computing a Nyström approximation
schroedk Mar 20, 2024
af4c407
Add a test for solving for a single righ hand side with the block var…
schroedk Mar 20, 2024
31f33e5
Improve documentation of influence function models, in particular CgI…
schroedk Mar 20, 2024
11acf2e
Merge remote-tracking branch 'origin/develop' into feature/506-nystro…
schroedk Mar 20, 2024
7615d29
Merge remote-tracking branch 'origin/develop' into feature/506-nystro…
schroedk Mar 20, 2024
d3a163a
Merge remote-tracking branch 'origin/develop' into feature/506-nystro…
schroedk Mar 20, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 19 additions & 10 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,23 @@

### Added

- Implement new method: `NystroemSketchInfluence` [PR #504](https://github.com/aai-institute/pyDVL/pull/504)
- Implement new method: `NystroemSketchInfluence`
[PR #504](https://github.com/aai-institute/pyDVL/pull/504)
- Add property `model_dtype` to instances of type `TorchInfluenceFunctionModel`
- Implement a preconditioned block variant of conjugate gradient
[PR #507](https://github.com/aai-institute/pyDVL/pull/507)

### Fixed

- Bug in `LissaInfluence`, when not using CPU device [PR #495](https://github.com/aai-institute/pyDVL/pull/495)
- Memory issue with `CgInfluence` and `ArnoldiInfluence`[PR #498](https://github.com/aai-institute/pyDVL/pull/498)
- Raising specific error message with install instruction, when trying to load `pydvl.utils.cache.memcached`
without `pymemcache` installed. If `pymemcache` is available, all symbols from
`pydvl.utils.cache.memcached` are available through `pydvl.utils.cache`[PR #509](https://github.com/aai-institute/pyDVL/pull/509)
- Bug in `LissaInfluence`, when not using CPU device
[PR #495](https://github.com/aai-institute/pyDVL/pull/495)
- Memory issue with `CgInfluence` and `ArnoldiInfluence`
[PR #498](https://github.com/aai-institute/pyDVL/pull/498)
- Raising specific error message with install instruction, when trying to load
`pydvl.utils.cache.memcached` without `pymemcache` installed.
If `pymemcache` is available, all symbols from `pydvl.utils.cache.memcached`
are available through `pydvl.utils.cache`
[PR #509](https://github.com/aai-institute/pyDVL/pull/509)

### Miscellaneous

Expand All @@ -35,9 +42,10 @@
### Fixed

- Bug in using `DaskInfluenceCalcualator` with `TorchnumpyConverter`
for single dimensional arrays [PR #485](https://github.com/aai-institute/pyDVL/pull/485)
- Fix implementations of `to` methods of `TorchInfluenceFunctionModel` implementations
[PR #487](https://github.com/aai-institute/pyDVL/pull/487)
for single dimensional arrays
[PR #485](https://github.com/aai-institute/pyDVL/pull/485)
- Fix implementations of `to` methods of `TorchInfluenceFunctionModel`
implementations [PR #487](https://github.com/aai-institute/pyDVL/pull/487)
- Fixed bug with checking for converged values in semivalues
[PR #341](https://github.com/appliedAI-Initiative/pyDVL/pull/341)

Expand All @@ -56,7 +64,8 @@
- New influence function interface `InfluenceFunctionModel`
- Data parallel computation with `DaskInfluenceCalculator`
[PR #26](https://github.com/aai-institute/pyDVL/issues/26)
- Sequential batch-wise computation and write to disk with `SequentialInfluenceCalculator`
- Sequential batch-wise computation and write to disk with
`SequentialInfluenceCalculator`
[PR #377](https://github.com/aai-institute/pyDVL/issues/377)
- Adapt notebooks to new influence abstractions
[PR #430](https://github.com/aai-institute/pyDVL/issues/430)
Expand Down
30 changes: 30 additions & 0 deletions docs/assets/pydvl.bib
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,23 @@ @article{agarwal_secondorder_2017
langid = {english}
}

@article{bekas_estimator_2007,
title = {An Estimator for the Diagonal of a Matrix},
author = {Bekas, C. and Kokiopoulou, E. and Saad, Y.},
date = {2007-11-01},
journaltitle = {Applied Numerical Mathematics},
shortjournal = {Applied Numerical Mathematics},
series = {Numerical {{Algorithms}}, {{Parallelism}} and {{Applications}} (2)},
volume = {57},
number = {11},
pages = {1214--1229},
issn = {0168-9274},
doi = {10.1016/j.apnum.2007.01.003},
url = {https://www.sciencedirect.com/science/article/pii/S0168927407000244},
urldate = {2024-03-19},
abstract = {A number of applications require to compute an approximation of the diagonal of a matrix when this matrix is not explicitly available but matrix–vector products with it are easy to evaluate. In some cases, it is the trace of the matrix rather than the diagonal that is needed. This paper describes methods for estimating diagonals and traces of matrices in these situations. The goal is to obtain a good estimate of the diagonal by applying only a small number of matrix–vector products, using selected vectors. We begin by considering the use of random test vectors and then explore special vectors obtained from Hadamard matrices. The methods are tested in the context of computational materials science to estimate the diagonal of the density matrix which holds the charge densities. Numerical experiments indicate that the diagonal estimator may offer an alternative method that in some cases can greatly reduce computational costs in electronic structures calculations.}
}

@article{benmerzoug_re_2023,
title = {[{{Re}}] {{If}} You like {{Shapley}}, Then You'll Love the Core},
author = {Benmerzoug, Anes and Delgado, Miguel de Benito},
Expand Down Expand Up @@ -345,6 +362,19 @@ @inproceedings{schoch_csshapley_2022
keywords = {notion}
}

@book{trefethen_numerical_1997,
title = {Numerical {{Linear Algebra}}},
author = {Trefethen, Lloyd N. and Bau, Iii, David},
date = {1997-01},
publisher = {{Society for Industrial and Applied Mathematics}},
location = {Philadelphia, PA},
doi = {10.1137/1.9780898719574},
url = {https://epubs.siam.org/doi/book/10.1137/1.9780898719574},
urldate = {2024-03-19},
isbn = {978-0-89871-361-9 978-0-89871-957-4},
langid = {english}
}

@inproceedings{wang_data_2022,
title = {Data {{Banzhaf}}: {{A Robust Data Valuation Framework}} for {{Machine Learning}}},
shorttitle = {Data {{Banzhaf}}},
Expand Down
90 changes: 71 additions & 19 deletions docs/influence/influence_function_model.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
In almost every practical application it is not possible to construct, even less
invert the complete Hessian in memory. pyDVL offers several implementations of the interface
[InfluenceFunctionModel][pydvl.influence.base_influence_function_model.InfluenceFunctionModel], which do not compute
the full Hessian (in contrast to [DirectInfluence][pydvl.influence.torch.influence_function_model.DirectInfluence]).
invert the complete Hessian in memory. pyDVL offers several implementations of
the interface [InfluenceFunctionModel
][pydvl.influence.base_influence_function_model.InfluenceFunctionModel],
which do not compute the full Hessian (in contrast to [DirectInfluence
][pydvl.influence.torch.influence_function_model.DirectInfluence]).


### Conjugate Gradient
Expand All @@ -11,26 +13,47 @@ method that does not require the explicit inversion of the Hessian. Instead, it
only requires the calculation of Hessian-vector products, making it a good
choice for large datasets or models with many parameters. It is nevertheless
much slower to converge than the direct inversion method and not as accurate.

More info on the theory of conjugate gradient can be found on
[Wikipedia](https://en.wikipedia.org/wiki/Conjugate_gradient_method).
[Wikipedia](https://en.wikipedia.org/wiki/Conjugate_gradient_method), or in
text books such as [@trefethen_numerical_1997, Lecture 38].

pyDVL also implements a stable block variant of the conjugate
gradient method, defined in [@ji_breakdownfree_2017], which solves several
right hand sides simultaneously.

Optionally, the user can provide a pre-conditioner to improve convergence, such
as a [Jacobi pre-conditioner
][pydvl.influence.torch.pre_conditioner.JacobiPreConditioner], which
is a simple [diagonal pre-conditioner](
https://en.wikipedia.org/wiki/Preconditioner#Jacobi_(or_diagonal)_preconditioner)
based on Hutchinson's diagonal estimator [@bekas_estimator_2007],
or a [Nyström approximation based pre-conditioner
][pydvl.influence.torch.pre_conditioner.NystroemPreConditioner],
described in [@frangella_randomized_2023].

```python
from pydvl.influence.torch import CgInfluence
from pydvl.influence.torch.pre_conditioner import NystroemPreConditioner

if_model = CgInfluence(
model,
loss,
hessian_regularization=0.0,
x0=None,
rtol=1e-7,
atol=1e-7,
maxiter=None,
use_block_cg=True,
pre_conditioner=NystroemPreConditioner(rank=10)
)
if_model.fit(train_loader)
```

The additional optional parameters `x0`, `rtol`, `atol`, and `maxiter` are
respectively the initial guess for the solution, the relative
tolerance, the absolute tolerance, and the maximum number of iterations.
The additional optional parameters `rtol`, `atol`, `maxiter`, `use_block_cg` and
`pre_conditioner` are respectively, the relative
tolerance, the absolute tolerance, the maximum number of iterations,
a flag indicating whether to use block variant of cg and an optional
pre-conditioner.


### Linear time Stochastic Second-Order Approximation (LiSSA)
Expand Down Expand Up @@ -62,6 +85,7 @@ if_model = LissaInfluence(
h0=None,
rtol=1e-4,
)
if_model.fit(train_loader)
```

with the additional optional parameters `maxiter`, `dampen`, `scale`, `h0`, and
Expand Down Expand Up @@ -94,25 +118,50 @@ if_model = ArnoldiInfluence(
rank_estimate=10,
tol=1e-6,
)
if_model.fit(train_loader)
```

### Eigenvalue Corrected K-FAC

K-FAC, short for Kronecker-Factored Approximate Curvature, is a method that approximates the Fisher information matrix [FIM](https://en.wikipedia.org/wiki/Fisher_information) of a model. It is possible to show that for classification models with appropriate loss functions the FIM is equal to the Hessian of the model’s loss over the dataset. In this restricted but nonetheless important context K-FAC offers an efficient way to approximate the Hessian and hence the influence scores.
K-FAC, short for Kronecker-Factored Approximate Curvature, is a method that
approximates the Fisher information matrix [FIM](https://en.wikipedia.org/wiki/Fisher_information) of a model.
It is possible to show that for classification models with appropriate loss
functions the FIM is equal to the Hessian of the model’s loss over the dataset.
In this restricted but nonetheless important context K-FAC offers an efficient
way to approximate the Hessian and hence the influence scores.
For more info and details refer to the original paper [@martens_optimizing_2015].

The K-FAC method is implemented in the class [EkfacInfluence](pydvl/influence/torch/influence_function_model.py). The following code snippet shows how to use the K-FAC method to calculate the influence function of a model. Note that, in contrast to the other methods for influence function calculation, K-FAC does not require the loss function as an input. This is because the current implementation is only applicable to classification models with a cross entropy loss function.
The K-FAC method is implemented in the class [EkfacInfluence
][pydvl.influence.torch.influence_function_model.EkfacInfluence].
The following code snippet shows how to use the K-FAC method to calculate the
influence function of a model. Note that, in contrast to the other methods for
influence function calculation, K-FAC does not require the loss function as an
input. This is because the current implementation is only applicable to
classification models with a cross entropy loss function.

```python
from pydvl.influence.torch import EkfacInfluence
if_model = EkfacInfluence(
model,
hessian_regularization=0.0,
)
if_model.fit(train_loader)
```
Upon initialization, the K-FAC method will parse the model and extract which layers require grad and which do not. Then it will only calculate the influence scores for the layers that require grad. The current implementation of the K-FAC method is only available for linear layers, and therefore if the model contains non-linear layers that require gradient the K-FAC method will raise a NotImplementedLayerRepresentationException.

A further improvement of the K-FAC method is the Eigenvalue Corrected K-FAC (EKFAC) method [@george_fast_2018], which allows to further re-fit the eigenvalues of the Hessian, thus providing a more accurate approximation. On top of the K-FAC method, the EKFAC method is implemented by setting `update_diagonal=True` when initialising [EkfacInfluence](pydvl/influence/torch/influence_function_model.py). The following code snippet shows how to use the EKFAC method to calculate the influence function of a model.
Upon initialization, the K-FAC method will parse the model and extract which
layers require grad and which do not. Then it will only calculate the influence
scores for the layers that require grad. The current implementation of the
K-FAC method is only available for linear layers, and therefore if the model
contains non-linear layers that require gradient the K-FAC method will raise a
NotImplementedLayerRepresentationException.

A further improvement of the K-FAC method is the Eigenvalue Corrected
K-FAC (EKFAC) method [@george_fast_2018], which allows to further re-fit the
eigenvalues of the Hessian, thus providing a more accurate approximation.
On top of the K-FAC method, the EKFAC method is implemented by setting
`update_diagonal=True` when initialising [EkfacInfluence
][pydvl.influence.torch.influence_function_model.EkfacInfluence].
The following code snippet shows how to use the EKFAC method to calculate the
influence function of a model.

```python
from pydvl.influence.torch import EkfacInfluence
Expand All @@ -129,10 +178,12 @@ if_model.fit(train_loader)
This approximation is based on a Nyström low-rank approximation of the form

\begin{align*}
H_{\text{nys}} &= (H\Omega)(\Omega^TH\Omega)^{+}(H\Omega)^T \\\
&= U \Lambda U^T
H_{\text{nys}} &= (H\Omega)(\Omega^TH\Omega)^{\dagger}(H\Omega)^T \\\
&= U \Lambda U^T,
\end{align*}

where $(\cdot)^{\dagger}$ denotes the [Moore-Penrose inverse](
https://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_inverse),
in combination with the [Sherman–Morrison–Woodbury formula](
https://en.wikipedia.org/wiki/Woodbury_matrix_identity) to calculate the
action of its inverse:
Expand All @@ -142,8 +193,8 @@ action of its inverse:
\frac{1}{\lambda}(I−UU^T)x,
\end{equation*}

see also [@hataya_nystrom_2023] and [@frangella_randomized_2021]. The essential parameter is the rank of the
approximation.
see also [@hataya_nystrom_2023] and [@frangella_randomized_2023]. The essential
parameter is the rank of the approximation.

```python
from pydvl.influence.torch import NystroemSketchInfluence
Expand All @@ -156,6 +207,7 @@ if_model = NystroemSketchInfluence(
if_model.fit(train_loader)
```

These implementations represent the calculation logic on in memory tensors. To scale up to large collection
of data, we map these influence function models over these collections. For a detailed discussion see the
These implementations represent the calculation logic on in memory tensors.
To scale up to large collection of data, we map these influence function models
over these collections. For a detailed discussion see the
documentation page [Scaling Computation](scaling_computation.md).
134 changes: 40 additions & 94 deletions notebooks/influence_wine.ipynb

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions src/pydvl/influence/torch/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,4 @@
LissaInfluence,
NystroemSketchInfluence,
)
from .pre_conditioner import JacobiPreConditioner, NystroemPreConditioner
78 changes: 56 additions & 22 deletions src/pydvl/influence/torch/functional.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,8 +48,11 @@
"create_per_sample_mixed_derivative_function",
"model_hessian_low_rank",
"LowRankProductRepresentation",
"randomized_nystroem_approximation",
"model_hessian_nystroem_approximation",
]


logger = logging.getLogger(__name__)


Expand Down Expand Up @@ -169,8 +172,10 @@ def create_empirical_loss_function(
on a given dataset. If we denote the model parameters with \( \theta \),
the resulting function approximates:

\[ f(\theta) = \frac{1}{N}\sum_{i=1}^N
\operatorname{loss}(y_i, \operatorname{model}(\theta, x_i)) \]
\[
f(\theta) = \frac{1}{N}\sum_{i=1}^N
\operatorname{loss}(y_i, \operatorname{model}(\theta, x_i))
\]

for a loss function $\operatorname{loss}$ and a model $\operatorname{model}$
with model parameters $\theta$, where $N$ is the number of all elements provided
Expand Down Expand Up @@ -848,19 +853,31 @@ def randomized_nystroem_approximation(
) -> LowRankProductRepresentation:
r"""
Given a matrix vector product function (representing a symmetric positive definite
matrix \(A\) ), computes a random Nyström low rank approximation of
\(A\) in factored form, i.e.

\[ A_{\text{nys}} = (A \Omega)(\Omega^T A \Omega)^{\Cross}(A \Omega)^T
= U \Sigma U^T\]

:param mat_mat_prod: A callable representing the matrix vector product
:param input_dim: dimension of the input for the matrix vector product
:param input_type:
:param rank: rank of the approximation
:param shift_func:
:param mat_vec_device: device where the matrix vector product has to be executed
:return: object containing, \(U\) and \(\Sigma\)
matrix $A$ ), computes a random Nyström low rank approximation of
$A$ in factored form, i.e.

$$ A_{\text{nys}} = (A \Omega)(\Omega^T A \Omega)^{\dagger}(A \Omega)^T
= U \Sigma U^T $$

where $\Omega$ is a standard normal random matrix.

Args:
mat_mat_prod: A callable representing the matrix vector product
input_dim: dimension of the input for the matrix vector product
input_type: data_type of inputs
rank: rank of the approximation
shift_func: optional function for computing the stabilizing shift in the
construction of the randomized nystroem approximation, defaults to

$$ \sqrt{\operatorname{\text{input_dim}}} \cdot
\varepsilon(\operatorname{\text{input_type}}) \cdot \|A\Omega\|_2,$$

where $\varepsilon(\operatorname{\text{input_type}})$ is the value of the
machine precision corresponding to the data type.
mat_vec_device: device where the matrix vector product has to be executed

Returns:
object containing, $U$ and $\Sigma$
"""

if shift_func is None:
Expand Down Expand Up @@ -924,14 +941,31 @@ def model_hessian_nystroem_approximation(
rank: int,
shift_func: Optional[Callable[[torch.Tensor], torch.Tensor]] = None,
) -> LowRankProductRepresentation:
"""
r"""
Given a model, loss and a data_loader, computes a random Nyström low rank approximation of
the corresponding Hessian matrix in factored form, i.e.

:param model:
:param loss:
:param data_loader:
:param rank:
:param shift_func:
:return:
$$ H_{\text{nys}} = (H \Omega)(\Omega^T H \Omega)^{+}(H \Omega)^T
= U \Sigma U^T $$

Args:
model: A PyTorch model instance. The Hessian will be calculated with respect to
this model's parameters.
loss : A callable that computes the loss.
data_loader: A DataLoader instance that provides the model's training data.
Used in calculating the Hessian-vector products.
rank: rank of the approximation
shift_func: optional function for computing the stabilizing shift in the
construction of the randomized nystroem approximation, defaults to

$$ \sqrt{\operatorname{\text{input_dim}}} \cdot
\varepsilon(\operatorname{\text{input_type}}) \cdot \|A\Omega\|_2,$$

where $\varepsilon(\operatorname{\text{input_type}})$ is the value of the
machine precision corresponding to the data type.

Returns:
object containing, $U$ and $\Sigma$
"""

model_hvp = create_hvp_function(
Expand Down
Loading
Loading