Merge branch 'release/v0.9.1'

aai-institute · Apr 22, 2024 · 123d01f · 123d01f
2 parents 786458c + f5bc6c9
commit 123d01f
Show file tree

Hide file tree

Showing 11 changed files with 169 additions and 180 deletions.
diff --git a/.bumpversion.cfg b/.bumpversion.cfg
@@ -1,5 +1,5 @@
 [bumpversion]
-current_version = 0.9.0
+current_version = 0.9.1
 commit = False
 tag = False
 allow_dirty = False

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,6 +1,14 @@
 # Changelog
 
-## 0.9.0 🆕 New methods, better docs and bugfixes 📚🐞
+## Unreleased
+
+### Fixed
+
+- `FutureWarning` for `ParallelConfig` constantly raised without actually 
+  instantiating the object
+  [PR #562](https://github.com/aai-institute/pyDVL/pull/562)
+
+## 0.9.0 - 🆕 New methods, better docs and bugfixes 📚🐞
 
 ### Added
 

diff --git a/CITATION.cff b/CITATION.cff
@@ -27,6 +27,6 @@ keywords:
   - Banzhaf index
 license: LGPL-3.0
 commit: 0e929ae121820b0014bf245da1b21032186768cb
-version: v0.7.0
-doi: 10.5281/zenodo.8311583
-date-released: '2023-09-02'
+version: v0.9.0
+doi: 10.5281/zenodo.10966754
+date-released: '2024-04-12'
diff --git a/README.md b/README.md
@@ -16,10 +16,8 @@
     <a href="https://zenodo.org/badge/latestdoi/354117916"><img src="https://zenodo.org/badge/354117916.svg" alt="DOI"></a>
 </p>
 
-**pyDVL** collects algorithms for **Data Valuation** and **Influence Function** computation.
-
-Refer to the [Methods](https://pydvl.org/devel/getting-started/methods/)
-page of our documentation for a list of all implemented methods. 
+**pyDVL** collects algorithms for **Data Valuation** and **Influence Function**
+computation. Here is the list of [all methods implemented](https://pydvl.org/devel/getting-started/methods/).
 
 **Data Valuation** for machine learning is the task of assigning a scalar
 to each element of a training set which reflects its contribution to the final
@@ -29,7 +27,7 @@ pyDVL focuses on model-dependent methods.
 
 <div align="center" style="text-align:center;">
     <img
-        width="70%"
+        width="60%"
         align="center"
         style="display: block; margin-left: auto; margin-right: auto;"
         src="https://pydvl.org/devel/value/img/mclc-best-removal-10k-natural.svg"
@@ -48,7 +46,7 @@ of training samples over individual test points.
 
 <div align="center" style="text-align:center;">
     <img
-        width="70%"
+        width="60%"
         align="center"
         style="display: block; margin-left: auto; margin-right: auto;"
         src="https://pydvl.org/devel/examples/img/influence_functions_example.png"
@@ -82,180 +80,133 @@ $ pip install pyDVL[influence]
 ```
 
 For more instructions and information refer to [Installing pyDVL
-](https://pydvl.org/stable/getting-started/#installation) in the
-documentation.
+](https://pydvl.org/stable/getting-started/#installation) in the documentation.
 
 # Usage
 
-In the following subsections, we will showcase the usage of pyDVL
-for Data Valuation and Influence Functions using simple examples.
-
-For more instructions and information refer to [Getting
-Started](https://pydvl.org/stable/getting-started/first-steps/) in
-the documentation.
-We provide several examples for data valuation
-(e.g. [Shapley Data Valuation](https://pydvl.org/stable/examples/shapley_basic_spotify/))
-and for influence functions
-(e.g. [Influence Functions for Neural Networks](https://pydvl.org/stable/examples/influence_imagenet/))
-with details on the algorithms and their applications.
+Please read [Getting
+Started](https://pydvl.org/stable/getting-started/first-steps/) in the
+documentation for more instructions. We provide several examples for data
+valuation and for influence functions in our [Example
+Gallery](https://pydvl.org/stable/examples/).
 
 ## Influence Functions
 
-For influence computation, follow these steps:
-
-1. Import the necessary packages (The exact packages depend on your specific use case).
-
-   ```python
-   import torch
-   from torch import nn
-   from torch.utils.data import DataLoader, TensorDataset
-
-   from pydvl.influence.torch import DirectInfluence
-   from pydvl.influence.torch.util import NestedTorchCatAggregator, TorchNumpyConverter
-   from pydvl.influence import SequentialInfluenceCalculator
-   ```
-
+1. Import the necessary packages (the exact ones depend on your specific use case).
 2. Create PyTorch data loaders for your train and test splits.
-
-   ```python
-   input_dim = (5, 5, 5)
-   output_dim = 3
-   train_x = torch.rand((10, *input_dim))
-   train_y = torch.rand((10, output_dim))
-   test_x = torch.rand((5, *input_dim))
-   test_y = torch.rand((5, output_dim))
-
-   train_data_loader = DataLoader(TensorDataset(train_x, train_y), batch_size=2)
-   test_data_loader = DataLoader(TensorDataset(test_x, test_y), batch_size=1)
-   ```
-
-3. Instantiate your neural network model.
-
-   ```python
-   nn_architecture = nn.Sequential(
-     nn.Conv2d(in_channels=5, out_channels=3, kernel_size=3),
-     nn.Flatten(),
-     nn.Linear(27, 3),
+3. Instantiate your neural network model and define your loss function.
+4. Instantiate an `InfluenceFunctionModel` and fit it to the training data
+5. For small input data, you can call the `influences()` method on the fitted
+   instance. The result is a tensor of shape `(training samples, test samples)`
+   that contains at index `(i, j`) the influence of training sample `i` on
+   test sample `j`.
+6. For larger datasets, wrap the model into a "calculator" and call methods on
+   it. This splits the computation into smaller chunks and allows for lazy
+   evaluation and out-of-core computation.
+
+The higher the absolute value of the influence of a training sample
+on a test sample, the more influential it is for the chosen test sample, model
+and data loaders. The sign of the influence determines whether it is 
+useful (positive) or harmful (negative).
+
+> **Note** pyDVL currently only support PyTorch for Influence Functions. We plan
+> to add support for Jax next.
+
+```python
+import torch
+from torch import nn
+from torch.utils.data import DataLoader, TensorDataset
+
+from pydvl.influence import SequentialInfluenceCalculator
+from pydvl.influence.torch import DirectInfluence
+from pydvl.influence.torch.util import (
+   NestedTorchCatAggregator,
+   TorchNumpyConverter,
    )
-   ```
-
-4. Define your loss:
-
-   ```python
-   loss = nn.MSELoss()
-   ```
-
-5. Instantiate an `InfluenceFunctionModel` and fit it to the training data
 
-   ```python
-   infl_model = DirectInfluence(nn_architecture, loss, hessian_regularization=0.01)
-   infl_model = infl_model.fit(train_data_loader)
-   ```
+input_dim = (5, 5, 5)
+output_dim = 3
+train_x, train_y = torch.rand((10, *input_dim)), torch.rand((10, output_dim))
+test_x, test_y = torch.rand((5, *input_dim)), torch.rand((5, output_dim))
+train_data_loader = DataLoader(TensorDataset(train_x, train_y), batch_size=2)
+test_data_loader = DataLoader(TensorDataset(test_x, test_y), batch_size=1)
+model = nn.Sequential(
+  nn.Conv2d(in_channels=5, out_channels=3, kernel_size=3),
+  nn.Flatten(),
+  nn.Linear(27, 3),
+  )
+loss = nn.MSELoss()
 
-6. For small input data call influence method on the fitted instance. 
-
-   ```python
-   influences = infl_model.influences(test_x, test_y, train_x, train_y)
-   ```
-   The result is a tensor of shape `(training samples x test samples)`
-   that contains at index `(i, j`) the influence of training sample `i` on
-   test sample `j`.
+infl_model = DirectInfluence(model, loss, hessian_regularization=0.01)
+infl_model = infl_model.fit(train_data_loader)
 
-7. For larger data, wrap the model into a
-   calculator and call methods on the calculator.
-   ```python
-   infl_calc = SequentialInfluenceCalculator(infl_model)
-
-    # Lazy object providing arrays batch-wise in a sequential manner
-   lazy_influences = infl_calc.influences(test_data_loader, train_data_loader)
+# For small datasets, instantiate the full influence matrix:
+influences = infl_model.influences(test_x, test_y, train_x, train_y)
 
-   # Trigger computation and pull results to memory
-   influences = lazy_influences.compute(aggregator=NestedTorchCatAggregator())
+# For larger datasets, use the Influence calculators:
+infl_calc = SequentialInfluenceCalculator(infl_model)
 
-   # Trigger computation and write results batch-wise to disk
-   lazy_influences.to_zarr("influences_result", TorchNumpyConverter())
-   ```
-
+# Lazy object providing arrays batch-wise in a sequential manner
+lazy_influences = infl_calc.influences(test_data_loader, train_data_loader)
 
-   The higher the absolute value of the influence of a training sample
-   on a test sample, the more influential it is for the chosen test sample, model
-   and data loaders. The sign of the influence determines whether it is 
-   useful (positive) or harmful (negative).
+# Trigger computation and pull results to memory
+influences = lazy_influences.compute(aggregator=NestedTorchCatAggregator())
 
-> **Note** pyDVL currently only support PyTorch for Influence Functions. 
-> We are planning to add support for Jax and perhaps TensorFlow or even Keras.
+# Trigger computation and write results batch-wise to disk
+lazy_influences.to_zarr("influences_result", TorchNumpyConverter())
+```
 
 ## Data Valuation
 
 The steps required to compute data values for your samples are:
 
-1. Import the necessary packages (The exact packages depend on your specific use case).
-
-   ```python
-   import matplotlib.pyplot as plt
-   from sklearn.datasets import load_breast_cancer
-   from sklearn.linear_model import LogisticRegression
-   from pydvl.utils import Dataset, Scorer, Utility
-   from pydvl.value import (
-      compute_shapley_values,
-      ShapleyMode,
-      MaxUpdates,
-   )
-   ```
-
+1. Import the necessary packages (the exact ones will depend on your specific
+   use case).
 2. Create a `Dataset` object with your train and test splits.
-
-   ```python
-   data = Dataset.from_sklearn(
-       load_breast_cancer(),
-       train_size=10,
-       stratify_by_target=True,
-       random_state=16,
-   )
-   ```
-
 3. Create an instance of a `SupervisedModel` (basically any sklearn compatible
-   predictor).
-
-   ```python
-   model = LogisticRegression()
-   ```  
-
-4. Create a `Utility` object to wrap the Dataset, the model and a scoring
-   function.
-
-   ```python
-   u = Utility(
-      model,
-      data,
-      Scorer("accuracy", default=0.0)
-   )
-   ```
-
-5. Use one of the methods defined in the library to compute the values.
-   In our example, we will use *Permutation Montecarlo Shapley*,
-   an approximate method for computing Data Shapley values.
-
-   ```python
-   values = compute_shapley_values(
-      u,
-      mode=ShapleyMode.PermutationMontecarlo,
-      done=MaxUpdates(100),
-      seed=16,  
-      progress=True
-   )
-   ```
-   The result is a variable of type `ValuationResult` that contains
-   the indices and their values as well as other attributes.
-
-   The higher the value for an index, the more important it is for the chosen
-   model, dataset and scorer.
-
-6. (Optional) Convert the valuation result to a dataframe and analyze and visualize the values.
-
-   ```python
-   df = values.to_dataframe(column="data_value")
-   ```
+   predictor), and wrap it in a `Utility` object together with the data and a
+   scoring function.
+4. Use one of the methods defined in the library to compute the values. In the
+   example below, we will use *Permutation Montecarlo Shapley*, an approximate
+   method for computing Data Shapley values. The result is a variable of type
+   `ValuationResult` that contains the indices and their values as well as other
+   attributes.
+5. Convert the valuation result to a dataframe, and analyze and visualize the
+   values.
+
+The higher the value for an index, the more important it is for the chosen
+model, dataset and scorer. Reciprocally, low-value points could be mislabelled,
+or out-of-distribution, and dropping them can improve the model's performance.
+
+```python
+from sklearn.datasets import load_breast_cancer
+from sklearn.linear_model import LogisticRegression
+
+from pydvl.utils import Dataset, Scorer, Utility
+from pydvl.value import (MaxUpdates, RelativeTruncation,
+                         permutation_montecarlo_shapley)
+
+data = Dataset.from_sklearn(
+  load_breast_cancer(),
+  train_size=10,
+  stratify_by_target=True,
+  random_state=16,
+  )
+model = LogisticRegression()
+u = Utility(
+  model,
+  data,
+  Scorer("accuracy", default=0.0)
+  )
+values = permutation_montecarlo_shapley(
+  u,
+  truncation=RelativeTruncation(u, 0.05),
+  done=MaxUpdates(1000),
+  seed=16,
+  progress=True
+  )
+df = values.to_dataframe(column="data_value")
+```
 
 # Contributing
 

diff --git a/requirements-notebooks.txt b/requirements-notebooks.txt
@@ -3,5 +3,5 @@ distributed==2023.4.0
 pillow==10.3.0
 torch==2.0.1
 torchvision==0.15.2
-transformers==4.36.0
+transformers==4.38.0
 zarr==2.16.1
diff --git a/setup.py b/setup.py
@@ -12,7 +12,7 @@
     package_data={"pydvl": ["py.typed"]},
     packages=find_packages(where="src"),
     include_package_data=True,
-    version="0.9.0",
+    version="0.9.1",
     description="The Python Data Valuation Library",
     install_requires=[
         line

diff --git a/src/pydvl/__init__.py b/src/pydvl/__init__.py
@@ -7,4 +7,4 @@
 The two main modules you will want to look at are [value][pydvl.value] and
 [influence][pydvl.influence].
 """
-__version__ = "0.9.0"
+__version__ = "0.9.1"