diff --git a/README.md b/README.md
index 2f7296018..97cf5f22d 100644
--- a/README.md
+++ b/README.md
@@ -16,10 +16,8 @@
-**pyDVL** collects algorithms for **Data Valuation** and **Influence Function** computation.
-
-Refer to the [Methods](https://pydvl.org/devel/getting-started/methods/)
-page of our documentation for a list of all implemented methods.
+**pyDVL** collects algorithms for **Data Valuation** and **Influence Function**
+computation. Here is the list of [all methods implemented](https://pydvl.org/devel/getting-started/methods/).
**Data Valuation** for machine learning is the task of assigning a scalar
to each element of a training set which reflects its contribution to the final
@@ -29,7 +27,7 @@ pyDVL focuses on model-dependent methods.
**Note** pyDVL currently only support PyTorch for Influence Functions. We plan
+> to add support for Jax next.
+
+```python
+import torch
+from torch import nn
+from torch.utils.data import DataLoader, TensorDataset
+
+from pydvl.influence import SequentialInfluenceCalculator
+from pydvl.influence.torch import DirectInfluence
+from pydvl.influence.torch.util import (
+ NestedTorchCatAggregator,
+ TorchNumpyConverter,
)
- ```
-
-4. Define your loss:
-
- ```python
- loss = nn.MSELoss()
- ```
-
-5. Instantiate an `InfluenceFunctionModel` and fit it to the training data
- ```python
- infl_model = DirectInfluence(nn_architecture, loss, hessian_regularization=0.01)
- infl_model = infl_model.fit(train_data_loader)
- ```
+input_dim = (5, 5, 5)
+output_dim = 3
+train_x, train_y = torch.rand((10, *input_dim)), torch.rand((10, output_dim))
+test_x, test_y = torch.rand((5, *input_dim)), torch.rand((5, output_dim))
+train_data_loader = DataLoader(TensorDataset(train_x, train_y), batch_size=2)
+test_data_loader = DataLoader(TensorDataset(test_x, test_y), batch_size=1)
+model = nn.Sequential(
+ nn.Conv2d(in_channels=5, out_channels=3, kernel_size=3),
+ nn.Flatten(),
+ nn.Linear(27, 3),
+ )
+loss = nn.MSELoss()
-6. For small input data call influence method on the fitted instance.
-
- ```python
- influences = infl_model.influences(test_x, test_y, train_x, train_y)
- ```
- The result is a tensor of shape `(training samples x test samples)`
- that contains at index `(i, j`) the influence of training sample `i` on
- test sample `j`.
+infl_model = DirectInfluence(model, loss, hessian_regularization=0.01)
+infl_model = infl_model.fit(train_data_loader)
-7. For larger data, wrap the model into a
- calculator and call methods on the calculator.
- ```python
- infl_calc = SequentialInfluenceCalculator(infl_model)
-
- # Lazy object providing arrays batch-wise in a sequential manner
- lazy_influences = infl_calc.influences(test_data_loader, train_data_loader)
+# For small datasets, instantiate the full influence matrix:
+influences = infl_model.influences(test_x, test_y, train_x, train_y)
- # Trigger computation and pull results to memory
- influences = lazy_influences.compute(aggregator=NestedTorchCatAggregator())
+# For larger datasets, use the Influence calculators:
+infl_calc = SequentialInfluenceCalculator(infl_model)
- # Trigger computation and write results batch-wise to disk
- lazy_influences.to_zarr("influences_result", TorchNumpyConverter())
- ```
-
+# Lazy object providing arrays batch-wise in a sequential manner
+lazy_influences = infl_calc.influences(test_data_loader, train_data_loader)
- The higher the absolute value of the influence of a training sample
- on a test sample, the more influential it is for the chosen test sample, model
- and data loaders. The sign of the influence determines whether it is
- useful (positive) or harmful (negative).
+# Trigger computation and pull results to memory
+influences = lazy_influences.compute(aggregator=NestedTorchCatAggregator())
-> **Note** pyDVL currently only support PyTorch for Influence Functions.
-> We are planning to add support for Jax and perhaps TensorFlow or even Keras.
+# Trigger computation and write results batch-wise to disk
+lazy_influences.to_zarr("influences_result", TorchNumpyConverter())
+```
## Data Valuation
The steps required to compute data values for your samples are:
-1. Import the necessary packages (The exact packages depend on your specific use case).
-
- ```python
- import matplotlib.pyplot as plt
- from sklearn.datasets import load_breast_cancer
- from sklearn.linear_model import LogisticRegression
- from pydvl.utils import Dataset, Scorer, Utility
- from pydvl.value import (
- compute_shapley_values,
- ShapleyMode,
- MaxUpdates,
- )
- ```
-
+1. Import the necessary packages (the exact ones will depend on your specific
+ use case).
2. Create a `Dataset` object with your train and test splits.
-
- ```python
- data = Dataset.from_sklearn(
- load_breast_cancer(),
- train_size=10,
- stratify_by_target=True,
- random_state=16,
- )
- ```
-
3. Create an instance of a `SupervisedModel` (basically any sklearn compatible
- predictor).
-
- ```python
- model = LogisticRegression()
- ```
-
-4. Create a `Utility` object to wrap the Dataset, the model and a scoring
- function.
-
- ```python
- u = Utility(
- model,
- data,
- Scorer("accuracy", default=0.0)
- )
- ```
-
-5. Use one of the methods defined in the library to compute the values.
- In our example, we will use *Permutation Montecarlo Shapley*,
- an approximate method for computing Data Shapley values.
-
- ```python
- values = compute_shapley_values(
- u,
- mode=ShapleyMode.PermutationMontecarlo,
- done=MaxUpdates(100),
- seed=16,
- progress=True
- )
- ```
- The result is a variable of type `ValuationResult` that contains
- the indices and their values as well as other attributes.
-
- The higher the value for an index, the more important it is for the chosen
- model, dataset and scorer.
-
-6. (Optional) Convert the valuation result to a dataframe and analyze and visualize the values.
-
- ```python
- df = values.to_dataframe(column="data_value")
- ```
+ predictor), and wrap it in a `Utility` object together with the data and a
+ scoring function.
+4. Use one of the methods defined in the library to compute the values. In the
+ example below, we will use *Permutation Montecarlo Shapley*, an approximate
+ method for computing Data Shapley values. The result is a variable of type
+ `ValuationResult` that contains the indices and their values as well as other
+ attributes.
+5. Convert the valuation result to a dataframe, and analyze and visualize the
+ values.
+
+The higher the value for an index, the more important it is for the chosen
+model, dataset and scorer. Reciprocally, low-value points could be mislabelled,
+or out-of-distribution, and dropping them can improve the model's performance.
+
+```python
+from sklearn.datasets import load_breast_cancer
+from sklearn.linear_model import LogisticRegression
+
+from pydvl.utils import Dataset, Scorer, Utility
+from pydvl.value import (MaxUpdates, RelativeTruncation,
+ permutation_montecarlo_shapley)
+
+data = Dataset.from_sklearn(
+ load_breast_cancer(),
+ train_size=10,
+ stratify_by_target=True,
+ random_state=16,
+ )
+model = LogisticRegression()
+u = Utility(
+ model,
+ data,
+ Scorer("accuracy", default=0.0)
+ )
+values = permutation_montecarlo_shapley(
+ u,
+ truncation=RelativeTruncation(u, 0.05),
+ done=MaxUpdates(1000),
+ seed=16,
+ progress=True
+ )
+df = values.to_dataframe(column="data_value")
+```
# Contributing