Skip to content

Commit

Permalink
Merge branch 'release/v0.9.1'
Browse files Browse the repository at this point in the history
  • Loading branch information
schroedk committed Apr 22, 2024
2 parents 786458c + f5bc6c9 commit 123d01f
Show file tree
Hide file tree
Showing 11 changed files with 169 additions and 180 deletions.
2 changes: 1 addition & 1 deletion .bumpversion.cfg
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[bumpversion]
current_version = 0.9.0
current_version = 0.9.1
commit = False
tag = False
allow_dirty = False
Expand Down
10 changes: 9 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,14 @@
# Changelog

## 0.9.0 🆕 New methods, better docs and bugfixes 📚🐞
## Unreleased

### Fixed

- `FutureWarning` for `ParallelConfig` constantly raised without actually
instantiating the object
[PR #562](https://github.com/aai-institute/pyDVL/pull/562)

## 0.9.0 - 🆕 New methods, better docs and bugfixes 📚🐞

### Added

Expand Down
6 changes: 3 additions & 3 deletions CITATION.cff
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,6 @@ keywords:
- Banzhaf index
license: LGPL-3.0
commit: 0e929ae121820b0014bf245da1b21032186768cb
version: v0.7.0
doi: 10.5281/zenodo.8311583
date-released: '2023-09-02'
version: v0.9.0
doi: 10.5281/zenodo.10966754
date-released: '2024-04-12'
267 changes: 109 additions & 158 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,8 @@
<a href="https://zenodo.org/badge/latestdoi/354117916"><img src="https://zenodo.org/badge/354117916.svg" alt="DOI"></a>
</p>

**pyDVL** collects algorithms for **Data Valuation** and **Influence Function** computation.

Refer to the [Methods](https://pydvl.org/devel/getting-started/methods/)
page of our documentation for a list of all implemented methods.
**pyDVL** collects algorithms for **Data Valuation** and **Influence Function**
computation. Here is the list of [all methods implemented](https://pydvl.org/devel/getting-started/methods/).

**Data Valuation** for machine learning is the task of assigning a scalar
to each element of a training set which reflects its contribution to the final
Expand All @@ -29,7 +27,7 @@ pyDVL focuses on model-dependent methods.

<div align="center" style="text-align:center;">
<img
width="70%"
width="60%"
align="center"
style="display: block; margin-left: auto; margin-right: auto;"
src="https://pydvl.org/devel/value/img/mclc-best-removal-10k-natural.svg"
Expand All @@ -48,7 +46,7 @@ of training samples over individual test points.

<div align="center" style="text-align:center;">
<img
width="70%"
width="60%"
align="center"
style="display: block; margin-left: auto; margin-right: auto;"
src="https://pydvl.org/devel/examples/img/influence_functions_example.png"
Expand Down Expand Up @@ -82,180 +80,133 @@ $ pip install pyDVL[influence]
```

For more instructions and information refer to [Installing pyDVL
](https://pydvl.org/stable/getting-started/#installation) in the
documentation.
](https://pydvl.org/stable/getting-started/#installation) in the documentation.

# Usage

In the following subsections, we will showcase the usage of pyDVL
for Data Valuation and Influence Functions using simple examples.

For more instructions and information refer to [Getting
Started](https://pydvl.org/stable/getting-started/first-steps/) in
the documentation.
We provide several examples for data valuation
(e.g. [Shapley Data Valuation](https://pydvl.org/stable/examples/shapley_basic_spotify/))
and for influence functions
(e.g. [Influence Functions for Neural Networks](https://pydvl.org/stable/examples/influence_imagenet/))
with details on the algorithms and their applications.
Please read [Getting
Started](https://pydvl.org/stable/getting-started/first-steps/) in the
documentation for more instructions. We provide several examples for data
valuation and for influence functions in our [Example
Gallery](https://pydvl.org/stable/examples/).

## Influence Functions

For influence computation, follow these steps:

1. Import the necessary packages (The exact packages depend on your specific use case).

```python
import torch
from torch import nn
from torch.utils.data import DataLoader, TensorDataset

from pydvl.influence.torch import DirectInfluence
from pydvl.influence.torch.util import NestedTorchCatAggregator, TorchNumpyConverter
from pydvl.influence import SequentialInfluenceCalculator
```

1. Import the necessary packages (the exact ones depend on your specific use case).
2. Create PyTorch data loaders for your train and test splits.

```python
input_dim = (5, 5, 5)
output_dim = 3
train_x = torch.rand((10, *input_dim))
train_y = torch.rand((10, output_dim))
test_x = torch.rand((5, *input_dim))
test_y = torch.rand((5, output_dim))

train_data_loader = DataLoader(TensorDataset(train_x, train_y), batch_size=2)
test_data_loader = DataLoader(TensorDataset(test_x, test_y), batch_size=1)
```

3. Instantiate your neural network model.

```python
nn_architecture = nn.Sequential(
nn.Conv2d(in_channels=5, out_channels=3, kernel_size=3),
nn.Flatten(),
nn.Linear(27, 3),
3. Instantiate your neural network model and define your loss function.
4. Instantiate an `InfluenceFunctionModel` and fit it to the training data
5. For small input data, you can call the `influences()` method on the fitted
instance. The result is a tensor of shape `(training samples, test samples)`
that contains at index `(i, j`) the influence of training sample `i` on
test sample `j`.
6. For larger datasets, wrap the model into a "calculator" and call methods on
it. This splits the computation into smaller chunks and allows for lazy
evaluation and out-of-core computation.

The higher the absolute value of the influence of a training sample
on a test sample, the more influential it is for the chosen test sample, model
and data loaders. The sign of the influence determines whether it is
useful (positive) or harmful (negative).

> **Note** pyDVL currently only support PyTorch for Influence Functions. We plan
> to add support for Jax next.
```python
import torch
from torch import nn
from torch.utils.data import DataLoader, TensorDataset

from pydvl.influence import SequentialInfluenceCalculator
from pydvl.influence.torch import DirectInfluence
from pydvl.influence.torch.util import (
NestedTorchCatAggregator,
TorchNumpyConverter,
)
```

4. Define your loss:

```python
loss = nn.MSELoss()
```

5. Instantiate an `InfluenceFunctionModel` and fit it to the training data

```python
infl_model = DirectInfluence(nn_architecture, loss, hessian_regularization=0.01)
infl_model = infl_model.fit(train_data_loader)
```
input_dim = (5, 5, 5)
output_dim = 3
train_x, train_y = torch.rand((10, *input_dim)), torch.rand((10, output_dim))
test_x, test_y = torch.rand((5, *input_dim)), torch.rand((5, output_dim))
train_data_loader = DataLoader(TensorDataset(train_x, train_y), batch_size=2)
test_data_loader = DataLoader(TensorDataset(test_x, test_y), batch_size=1)
model = nn.Sequential(
nn.Conv2d(in_channels=5, out_channels=3, kernel_size=3),
nn.Flatten(),
nn.Linear(27, 3),
)
loss = nn.MSELoss()

6. For small input data call influence method on the fitted instance.

```python
influences = infl_model.influences(test_x, test_y, train_x, train_y)
```
The result is a tensor of shape `(training samples x test samples)`
that contains at index `(i, j`) the influence of training sample `i` on
test sample `j`.
infl_model = DirectInfluence(model, loss, hessian_regularization=0.01)
infl_model = infl_model.fit(train_data_loader)

7. For larger data, wrap the model into a
calculator and call methods on the calculator.
```python
infl_calc = SequentialInfluenceCalculator(infl_model)

# Lazy object providing arrays batch-wise in a sequential manner
lazy_influences = infl_calc.influences(test_data_loader, train_data_loader)
# For small datasets, instantiate the full influence matrix:
influences = infl_model.influences(test_x, test_y, train_x, train_y)

# Trigger computation and pull results to memory
influences = lazy_influences.compute(aggregator=NestedTorchCatAggregator())
# For larger datasets, use the Influence calculators:
infl_calc = SequentialInfluenceCalculator(infl_model)

# Trigger computation and write results batch-wise to disk
lazy_influences.to_zarr("influences_result", TorchNumpyConverter())
```

# Lazy object providing arrays batch-wise in a sequential manner
lazy_influences = infl_calc.influences(test_data_loader, train_data_loader)

The higher the absolute value of the influence of a training sample
on a test sample, the more influential it is for the chosen test sample, model
and data loaders. The sign of the influence determines whether it is
useful (positive) or harmful (negative).
# Trigger computation and pull results to memory
influences = lazy_influences.compute(aggregator=NestedTorchCatAggregator())

> **Note** pyDVL currently only support PyTorch for Influence Functions.
> We are planning to add support for Jax and perhaps TensorFlow or even Keras.
# Trigger computation and write results batch-wise to disk
lazy_influences.to_zarr("influences_result", TorchNumpyConverter())
```

## Data Valuation

The steps required to compute data values for your samples are:

1. Import the necessary packages (The exact packages depend on your specific use case).

```python
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from pydvl.utils import Dataset, Scorer, Utility
from pydvl.value import (
compute_shapley_values,
ShapleyMode,
MaxUpdates,
)
```

1. Import the necessary packages (the exact ones will depend on your specific
use case).
2. Create a `Dataset` object with your train and test splits.

```python
data = Dataset.from_sklearn(
load_breast_cancer(),
train_size=10,
stratify_by_target=True,
random_state=16,
)
```

3. Create an instance of a `SupervisedModel` (basically any sklearn compatible
predictor).

```python
model = LogisticRegression()
```

4. Create a `Utility` object to wrap the Dataset, the model and a scoring
function.

```python
u = Utility(
model,
data,
Scorer("accuracy", default=0.0)
)
```

5. Use one of the methods defined in the library to compute the values.
In our example, we will use *Permutation Montecarlo Shapley*,
an approximate method for computing Data Shapley values.

```python
values = compute_shapley_values(
u,
mode=ShapleyMode.PermutationMontecarlo,
done=MaxUpdates(100),
seed=16,
progress=True
)
```
The result is a variable of type `ValuationResult` that contains
the indices and their values as well as other attributes.

The higher the value for an index, the more important it is for the chosen
model, dataset and scorer.

6. (Optional) Convert the valuation result to a dataframe and analyze and visualize the values.

```python
df = values.to_dataframe(column="data_value")
```
predictor), and wrap it in a `Utility` object together with the data and a
scoring function.
4. Use one of the methods defined in the library to compute the values. In the
example below, we will use *Permutation Montecarlo Shapley*, an approximate
method for computing Data Shapley values. The result is a variable of type
`ValuationResult` that contains the indices and their values as well as other
attributes.
5. Convert the valuation result to a dataframe, and analyze and visualize the
values.

The higher the value for an index, the more important it is for the chosen
model, dataset and scorer. Reciprocally, low-value points could be mislabelled,
or out-of-distribution, and dropping them can improve the model's performance.

```python
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression

from pydvl.utils import Dataset, Scorer, Utility
from pydvl.value import (MaxUpdates, RelativeTruncation,
permutation_montecarlo_shapley)

data = Dataset.from_sklearn(
load_breast_cancer(),
train_size=10,
stratify_by_target=True,
random_state=16,
)
model = LogisticRegression()
u = Utility(
model,
data,
Scorer("accuracy", default=0.0)
)
values = permutation_montecarlo_shapley(
u,
truncation=RelativeTruncation(u, 0.05),
done=MaxUpdates(1000),
seed=16,
progress=True
)
df = values.to_dataframe(column="data_value")
```

# Contributing

Expand Down
2 changes: 1 addition & 1 deletion requirements-notebooks.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,5 @@ distributed==2023.4.0
pillow==10.3.0
torch==2.0.1
torchvision==0.15.2
transformers==4.36.0
transformers==4.38.0
zarr==2.16.1
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
package_data={"pydvl": ["py.typed"]},
packages=find_packages(where="src"),
include_package_data=True,
version="0.9.0",
version="0.9.1",
description="The Python Data Valuation Library",
install_requires=[
line
Expand Down
2 changes: 1 addition & 1 deletion src/pydvl/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,4 @@
The two main modules you will want to look at are [value][pydvl.value] and
[influence][pydvl.influence].
"""
__version__ = "0.9.0"
__version__ = "0.9.1"
Loading

0 comments on commit 123d01f

Please sign in to comment.