diff --git a/README.md b/README.md index 2f7296018..97cf5f22d 100644 --- a/README.md +++ b/README.md @@ -16,10 +16,8 @@ DOI

-**pyDVL** collects algorithms for **Data Valuation** and **Influence Function** computation. - -Refer to the [Methods](https://pydvl.org/devel/getting-started/methods/) -page of our documentation for a list of all implemented methods. +**pyDVL** collects algorithms for **Data Valuation** and **Influence Function** +computation. Here is the list of [all methods implemented](https://pydvl.org/devel/getting-started/methods/). **Data Valuation** for machine learning is the task of assigning a scalar to each element of a training set which reflects its contribution to the final @@ -29,7 +27,7 @@ pyDVL focuses on model-dependent methods.
**Note** pyDVL currently only support PyTorch for Influence Functions. We plan +> to add support for Jax next. + +```python +import torch +from torch import nn +from torch.utils.data import DataLoader, TensorDataset + +from pydvl.influence import SequentialInfluenceCalculator +from pydvl.influence.torch import DirectInfluence +from pydvl.influence.torch.util import ( + NestedTorchCatAggregator, + TorchNumpyConverter, ) - ``` - -4. Define your loss: - - ```python - loss = nn.MSELoss() - ``` - -5. Instantiate an `InfluenceFunctionModel` and fit it to the training data - ```python - infl_model = DirectInfluence(nn_architecture, loss, hessian_regularization=0.01) - infl_model = infl_model.fit(train_data_loader) - ``` +input_dim = (5, 5, 5) +output_dim = 3 +train_x, train_y = torch.rand((10, *input_dim)), torch.rand((10, output_dim)) +test_x, test_y = torch.rand((5, *input_dim)), torch.rand((5, output_dim)) +train_data_loader = DataLoader(TensorDataset(train_x, train_y), batch_size=2) +test_data_loader = DataLoader(TensorDataset(test_x, test_y), batch_size=1) +model = nn.Sequential( + nn.Conv2d(in_channels=5, out_channels=3, kernel_size=3), + nn.Flatten(), + nn.Linear(27, 3), + ) +loss = nn.MSELoss() -6. For small input data call influence method on the fitted instance. - - ```python - influences = infl_model.influences(test_x, test_y, train_x, train_y) - ``` - The result is a tensor of shape `(training samples x test samples)` - that contains at index `(i, j`) the influence of training sample `i` on - test sample `j`. +infl_model = DirectInfluence(model, loss, hessian_regularization=0.01) +infl_model = infl_model.fit(train_data_loader) -7. For larger data, wrap the model into a - calculator and call methods on the calculator. - ```python - infl_calc = SequentialInfluenceCalculator(infl_model) - - # Lazy object providing arrays batch-wise in a sequential manner - lazy_influences = infl_calc.influences(test_data_loader, train_data_loader) +# For small datasets, instantiate the full influence matrix: +influences = infl_model.influences(test_x, test_y, train_x, train_y) - # Trigger computation and pull results to memory - influences = lazy_influences.compute(aggregator=NestedTorchCatAggregator()) +# For larger datasets, use the Influence calculators: +infl_calc = SequentialInfluenceCalculator(infl_model) - # Trigger computation and write results batch-wise to disk - lazy_influences.to_zarr("influences_result", TorchNumpyConverter()) - ``` - +# Lazy object providing arrays batch-wise in a sequential manner +lazy_influences = infl_calc.influences(test_data_loader, train_data_loader) - The higher the absolute value of the influence of a training sample - on a test sample, the more influential it is for the chosen test sample, model - and data loaders. The sign of the influence determines whether it is - useful (positive) or harmful (negative). +# Trigger computation and pull results to memory +influences = lazy_influences.compute(aggregator=NestedTorchCatAggregator()) -> **Note** pyDVL currently only support PyTorch for Influence Functions. -> We are planning to add support for Jax and perhaps TensorFlow or even Keras. +# Trigger computation and write results batch-wise to disk +lazy_influences.to_zarr("influences_result", TorchNumpyConverter()) +``` ## Data Valuation The steps required to compute data values for your samples are: -1. Import the necessary packages (The exact packages depend on your specific use case). - - ```python - import matplotlib.pyplot as plt - from sklearn.datasets import load_breast_cancer - from sklearn.linear_model import LogisticRegression - from pydvl.utils import Dataset, Scorer, Utility - from pydvl.value import ( - compute_shapley_values, - ShapleyMode, - MaxUpdates, - ) - ``` - +1. Import the necessary packages (the exact ones will depend on your specific + use case). 2. Create a `Dataset` object with your train and test splits. - - ```python - data = Dataset.from_sklearn( - load_breast_cancer(), - train_size=10, - stratify_by_target=True, - random_state=16, - ) - ``` - 3. Create an instance of a `SupervisedModel` (basically any sklearn compatible - predictor). - - ```python - model = LogisticRegression() - ``` - -4. Create a `Utility` object to wrap the Dataset, the model and a scoring - function. - - ```python - u = Utility( - model, - data, - Scorer("accuracy", default=0.0) - ) - ``` - -5. Use one of the methods defined in the library to compute the values. - In our example, we will use *Permutation Montecarlo Shapley*, - an approximate method for computing Data Shapley values. - - ```python - values = compute_shapley_values( - u, - mode=ShapleyMode.PermutationMontecarlo, - done=MaxUpdates(100), - seed=16, - progress=True - ) - ``` - The result is a variable of type `ValuationResult` that contains - the indices and their values as well as other attributes. - - The higher the value for an index, the more important it is for the chosen - model, dataset and scorer. - -6. (Optional) Convert the valuation result to a dataframe and analyze and visualize the values. - - ```python - df = values.to_dataframe(column="data_value") - ``` + predictor), and wrap it in a `Utility` object together with the data and a + scoring function. +4. Use one of the methods defined in the library to compute the values. In the + example below, we will use *Permutation Montecarlo Shapley*, an approximate + method for computing Data Shapley values. The result is a variable of type + `ValuationResult` that contains the indices and their values as well as other + attributes. +5. Convert the valuation result to a dataframe, and analyze and visualize the + values. + +The higher the value for an index, the more important it is for the chosen +model, dataset and scorer. Reciprocally, low-value points could be mislabelled, +or out-of-distribution, and dropping them can improve the model's performance. + +```python +from sklearn.datasets import load_breast_cancer +from sklearn.linear_model import LogisticRegression + +from pydvl.utils import Dataset, Scorer, Utility +from pydvl.value import (MaxUpdates, RelativeTruncation, + permutation_montecarlo_shapley) + +data = Dataset.from_sklearn( + load_breast_cancer(), + train_size=10, + stratify_by_target=True, + random_state=16, + ) +model = LogisticRegression() +u = Utility( + model, + data, + Scorer("accuracy", default=0.0) + ) +values = permutation_montecarlo_shapley( + u, + truncation=RelativeTruncation(u, 0.05), + done=MaxUpdates(1000), + seed=16, + progress=True + ) +df = values.to_dataframe(column="data_value") +``` # Contributing