Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A bit of cleanup #432

Merged
merged 13 commits into from
Sep 18, 2023
13 changes: 0 additions & 13 deletions .readthedocs.yaml

This file was deleted.

11 changes: 6 additions & 5 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,15 +35,15 @@ library. E.g. with venv:
```shell script
python -m venv ./venv
. venv/bin/activate # `venv\Scripts\activate` in windows
pip install -r requirements-dev.txt
pip install -r requirements-dev.txt -r requirements-docs.txt
```

With conda:

```shell script
conda create -n pydvl python=3.8
conda activate pydvl
pip install -r requirements-dev.txt
pip install -r requirements-dev.txt -r requirements-docs.txt
```

A very convenient way of working with your library during development is to
Expand All @@ -54,11 +54,12 @@ pip install -e .
```

In order to build the documentation locally (which is done as part of the tox
suite) you will need [pandoc](https://pandoc.org/). Under Ubuntu it can be
installed with:
suite) [pandoc](https://pandoc.org/) is required. Except for OSX, it should be installed
automatically as a dependency with `requirements-docs.txt`. Under OSX you can
install pandoc (you'll need at least version 2.11) with:

```shell script
sudo apt-get update -yq && apt-get install -yq pandoc
brew install pandoc
```

Remember to mark all autogenerated directories as excluded in your IDE. In
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ documentation.

For influence computation, follow these steps:

1. Wrap your model and loss in a `TorchTwiceDifferential` object
1. Wrap your model and loss in a `TorchTwiceDifferentiable` object
2. Compute influence factors by providing training data and inversion method

Using the conjugate gradient algorithm, this would look like:
Expand Down
2 changes: 0 additions & 2 deletions apt-cache/.gitignore

This file was deleted.

1 change: 0 additions & 1 deletion build_scripts/copy_changelog.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
import logging
import os
import shutil
from pathlib import Path

import mkdocs.plugins
Expand Down
1 change: 0 additions & 1 deletion build_scripts/copy_notebooks.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
import logging
import os
import shutil
from pathlib import Path

import mkdocs.plugins
Expand Down
24 changes: 13 additions & 11 deletions docs/value/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,21 +15,23 @@ alias:
training set which reflects its contribution to the final performance of some
model trained on it. Some methods attempt to be model-agnostic, but in most
cases the model is an integral part of the method. In these cases, this number
not an intrinsic property of the element of interest, but typically a function
of three factors:
is not an intrinsic property of the element of interest, but typically a
function of three factors:

1. The dataset $D$, or more generally, the distribution it was sampled
from (with this we mean that *value* would ideally be the (expected)
contribution of a data point to any random set $D$ sampled from the same
distribution).
1. The dataset $D$, or more generally, the distribution it was sampled from: In
some cases one only cares about values wrt. a given data set, in others
value would ideally be the (expected) contribution of a data point to any
random set $D$ sampled from the same distribution. pyDVL implements methods
of the first kind.

2. The algorithm $\mathcal{A}$ mapping the data $D$ to some estimator $f$
in a model class $\mathcal{F}$. E.g. MSE minimization to find the parameters
of a linear model.
2. The algorithm $\mathcal{A}$ mapping the data $D$ to some estimator $f$ in a
model class $\mathcal{F}$. E.g. MSE minimization to find the parameters of a
linear model.

3. The performance metric of interest $u$ for the problem. When value depends on
a model, it must be measured in some way which uses it. E.g. the $R^2$ score or
the negative MSE over a test set.
a model, it must be measured in some way which uses it. E.g. the $R^2$ score
or the negative MSE over a test set. This metric will be computed over a
held-out valuation set.

pyDVL collects algorithms for the computation of data values in this sense,
mostly those derived from cooperative game theory. The methods can be found in
Expand Down
2 changes: 0 additions & 2 deletions notebooks/support/torch.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,6 @@

logger = logging.getLogger(__name__)

from numpy.typing import NDArray

MODEL_PATH = Path().resolve().parent / "data" / "models"


Expand Down
3 changes: 2 additions & 1 deletion requirements-docs.txt
Original file line number Diff line number Diff line change
Expand Up @@ -15,5 +15,6 @@ mkdocs-material
mkdocs-section-index
mkdocs-macros-plugin
neoteroi-mkdocs # Needed for card grid on home page
pypandoc
pypandoc; sys_platform == 'darwin'
pypandoc_binary; sys_platform != 'darwin'
GitPython
2 changes: 1 addition & 1 deletion src/pydvl/utils/caching.py
Original file line number Diff line number Diff line change
Expand Up @@ -181,7 +181,7 @@ def memcached(
not cached.
allow_repeated_evaluations: If `True`, repeated calls to a function
with the same arguments will be allowed and outputs averaged until the
running standard deviation of the mean stabilises below
running standard deviation of the mean stabilizes below
`rtol_stderr * mean`.
rtol_stderr: relative tolerance for repeated evaluations. More precisely,
[memcached()][pydvl.utils.caching.memcached] will stop evaluating the function once the
Expand Down
11 changes: 4 additions & 7 deletions src/pydvl/utils/numeric.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,17 +71,16 @@ def num_samples_permutation_hoeffding(eps: float, delta: float, u_range: float)


def random_subset(
s: NDArray[T],
q: float = 0.5,
seed: Optional[Seed] = None,
s: NDArray[T], q: float = 0.5, seed: Optional[Seed] = None
) -> NDArray[T]:
"""Returns one subset at random from ``s``.

Args:
s: set to sample from
q: Sampling probability for elements. The default 0.5 yields a
uniform distribution over the power set of s.
seed: Either an instance of a numpy random number generator or a seed for it.
seed: Either an instance of a numpy random number generator or a seed
for it.

Returns:
The subset
Expand Down Expand Up @@ -135,9 +134,7 @@ def random_powerset(


def random_subset_of_size(
s: NDArray[T],
size: int,
seed: Optional[Seed] = None,
s: NDArray[T], size: int, seed: Optional[Seed] = None
) -> NDArray[T]:
"""Samples a random subset of given size uniformly from the powerset
of `s`.
Expand Down
4 changes: 4 additions & 0 deletions src/pydvl/utils/score.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@
This module provides a [Scorer][pydvl.utils.score.Scorer] class that wraps
scoring functions with additional information.

Scorers are the fundamental building block of many data valuation methods. They
are typically used by the [Utility][pydvl.utils.utility.Utility] class to
evaluate the quality of a model when trained on subsets of the training data.

Scorers can be constructed in the same way as in scikit-learn: either from
known strings or from a callable. Greater values must be better. If they are not,
a negated version can be used, see scikit-learn's
Expand Down
24 changes: 24 additions & 0 deletions src/pydvl/utils/types.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,12 +46,35 @@ class SupervisedModel(Protocol):
"""

def fit(self, x: NDArray, y: NDArray):
"""Fit the model to the data

Args:
x: Independent variables
y: Dependent variable
"""
pass

def predict(self, x: NDArray) -> NDArray:
"""Compute predictions for the input

Args:
x: Independent variables for which to compute predictions

Returns:
Predictions for the input
"""
pass

def score(self, x: NDArray, y: NDArray) -> float:
"""Compute the score of the model given test data

Args:
x: Independent variables
y: Dependent variable

Returns:
The score of the model on `(x, y)`
"""
pass


Expand All @@ -77,6 +100,7 @@ def __call__(cls, *args, **kwargs):
)

def create(cls, *args: Any, **kwargs: Any):
"""Create an instance of the class"""
return super().__call__(*args, **kwargs)


Expand Down
48 changes: 23 additions & 25 deletions src/pydvl/utils/utility.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,20 +6,21 @@
[Utility][pydvl.utils.utility.Utility] holds information about model,
data and scoring function (the latter being what one usually understands
under *utility* in the general definition of Shapley value).
It is automatically cached across machines.
It is automatically cached across machines when the
[cache is configured][setting-up-the-cache] and it is enabled upon construction.

[DataUtilityLearning][pydvl.utils.utility.DataUtilityLearning] adds support
for learning the scoring function to avoid repeated re-training
of the model to compute the score.

This module also contains Utility classes for toy games that are used
This module also contains derived `Utility` classes for toy games that are used
for testing and for demonstration purposes.

## References

[^1]: <a name="wang_improving_2022"></a>Wang, T., Yang, Y. and Jia, R., 2021.
[Improving cooperative game theory-based data valuation via data utility learning](https://arxiv.org/abs/2107.06336).
arXiv preprint arXiv:2107.06336.
[Improving cooperative game theory-based data valuation via data utility
learning](https://arxiv.org/abs/2107.06336). arXiv preprint arXiv:2107.06336.

"""
import logging
Expand Down Expand Up @@ -49,27 +50,24 @@ class Utility:

An instance of `Utility` holds the triple of model, dataset and scoring
function which determines the value of data points. This is used for the
computation of
[all game-theoretic values][game-theoretical-methods]
like [Shapley values][pydvl.value.shapley] and
[the Least Core][pydvl.value.least_core].
computation of [all game-theoretic values][game-theoretical-methods] like
[Shapley values][pydvl.value.shapley] and [the Least
Core][pydvl.value.least_core].

The Utility expect the model to fulfill
the [SupervisedModel][pydvl.utils.types.SupervisedModel] interface i.e.
The Utility expect the model to fulfill the
[SupervisedModel][pydvl.utils.types.SupervisedModel] interface i.e.
to have `fit()`, `predict()`, and `score()` methods.

When calling the utility, the model will be
[cloned](https://scikit-learn.org/stable/modules/generated/sklearn.base
.clone.html)
[cloned](https://scikit-learn.org/stable/modules/generated/sklearn.base.clone.html)
if it is a Sci-Kit Learn model, otherwise a copy is created using
`deepcopy()` from the builtin [copy](https://docs.python.org/3/
library/copy.html) module.
[copy.deepcopy][]

Since evaluating the scoring function requires retraining the model
and that can be time-consuming, this class wraps it and caches
the results of each execution. Caching is available both locally
and across nodes, but must always be enabled for your
project first, see [Setting up the cache][setting-up-the-cache].
Since evaluating the scoring function requires retraining the model and that
can be time-consuming, this class wraps it and caches the results of each
execution. Caching is available both locally and across nodes, but must
always be enabled for your project first, see [Setting up the
cache][setting-up-the-cache].

Attributes:
model: The supervised model.
Expand All @@ -86,13 +84,13 @@ class Utility:
or [GroupedDataset][pydvl.utils.dataset.GroupedDataset] instance.
scorer: A scoring object. If None, the `score()` method of the model
will be used. See [score][pydvl.utils.score] for ways to create
and compose scorers, in particular how to set default values and ranges.
For convenience, a string can be passed, which will be used to construct
a [Scorer][pydvl.utils.score.Scorer].
and compose scorers, in particular how to set default values and
ranges. For convenience, a string can be passed, which will be used
to construct a [Scorer][pydvl.utils.score.Scorer].
default_score: As a convenience when no `scorer` object is passed
(where a default value can be provided), this argument also allows to set
the default score for models that have not been fit, e.g. when too little
data is passed, or errors arise.
(where a default value can be provided), this argument also allows
to set the default score for models that have not been fit, e.g.
when too little data is passed, or errors arise.
score_range: As with `default_score`, this is a convenience argument for
when no `scorer` argument is provided, to set the numerical range
of the score function. Some Monte Carlo methods can use this to
Expand Down
2 changes: 1 addition & 1 deletion src/pydvl/value/result.py
Original file line number Diff line number Diff line change
Expand Up @@ -471,7 +471,7 @@ def __repr__(self) -> str:
f"values={np.array_str(self.values, precision=4, suppress_small=True)},"
f"indices={np.array_str(self.indices)},"
f"names={np.array_str(self.names)},"
f"counts={np.array_str(self.counts)},"
f"counts={np.array_str(self.counts)}"
)
for k, v in self._extra_values.items():
repr_string += f", {k}={v}"
Expand Down
6 changes: 3 additions & 3 deletions src/pydvl/value/semivalues.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,15 +74,15 @@
## References

[^1]: <a name="ghorbani_data_2019"></a>Ghorbani, A., Zou, J., 2019.
[Data Shapley: Equitable Valuation of Data for Machine Learning](http://proceedings.mlr.press/v97/ghorbani19c.html).
[Data Shapley: Equitable Valuation of Data for Machine Learning](https://proceedings.mlr.press/v97/ghorbani19c.html).
In: Proceedings of the 36th International Conference on Machine Learning, PMLR, pp. 2242–2251.

[^2]: <a name="kwon_beta_2022"></a>Kwon, Y. and Zou, J., 2022.
[Beta Shapley: A Unified and Noise-reduced Data Valuation Framework for Machine Learning](http://arxiv.org/abs/2110.14049).
[Beta Shapley: A Unified and Noise-reduced Data Valuation Framework for Machine Learning](https://arxiv.org/abs/2110.14049).
In: Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS) 2022, Vol. 151. PMLR, Valencia, Spain.

[^3]: <a name="wang_data_2022"></a>Wang, J.T. and Jia, R., 2022.
[Data Banzhaf: A Robust Data Valuation Framework for Machine Learning](http://arxiv.org/abs/2205.15466).
[Data Banzhaf: A Robust Data Valuation Framework for Machine Learning](https://arxiv.org/abs/2205.15466).
ArXiv preprint arXiv:2205.15466.
"""
from __future__ import annotations
Expand Down
2 changes: 1 addition & 1 deletion src/pydvl/value/shapley/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
permutation_exact_shapley,
)
from pydvl.value.shapley.owen import OwenAlgorithm, owen_sampling_shapley
from pydvl.value.shapley.truncated import NoTruncation, truncated_montecarlo_shapley
from pydvl.value.shapley.truncated import NoTruncation
from pydvl.value.shapley.types import ShapleyMode
from pydvl.value.stopping import MaxUpdates, StoppingCriterion

Expand Down
2 changes: 1 addition & 1 deletion src/pydvl/value/shapley/montecarlo.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@
## References

[^1]: <a name="ghorbani_data_2019"></a>Ghorbani, A., Zou, J., 2019.
[Data Shapley: Equitable Valuation of Data for Machine Learning](http://proceedings.mlr.press/v97/ghorbani19c.html).
[Data Shapley: Equitable Valuation of Data for Machine Learning](https://proceedings.mlr.press/v97/ghorbani19c.html).
In: Proceedings of the 36th International Conference on Machine Learning, PMLR, pp. 2242–2251.

"""
Expand Down
2 changes: 1 addition & 1 deletion src/pydvl/value/shapley/truncated.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
## References

[^1]: <a name="ghorbani_data_2019"></a>Ghorbani, A., Zou, J., 2019.
[Data Shapley: Equitable Valuation of Data for Machine Learning](http://proceedings.mlr.press/v97/ghorbani19c.html).
[Data Shapley: Equitable Valuation of Data for Machine Learning](https://proceedings.mlr.press/v97/ghorbani19c.html).
In: Proceedings of the 36th International Conference on Machine Learning, PMLR, pp. 2242–2251.

"""
Expand Down
2 changes: 1 addition & 1 deletion src/pydvl/value/stopping.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@
## References

[^1]: <a name="ghorbani_data_2019"></a>Ghorbani, A., Zou, J., 2019.
[Data Shapley: Equitable Valuation of Data for Machine Learning](http://proceedings.mlr.press/v97/ghorbani19c.html).
[Data Shapley: Equitable Valuation of Data for Machine Learning](https://proceedings.mlr.press/v97/ghorbani19c.html).
In: Proceedings of the 36th International Conference on Machine Learning, PMLR, pp. 2242–2251.
"""

Expand Down
1 change: 0 additions & 1 deletion tests/influence/test_torch_differentiable.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@
module.
"""

import itertools
from typing import List, Tuple

import numpy as np
Expand Down
1 change: 0 additions & 1 deletion tests/utils/test_numeric.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
import numpy as np
import pytest
from numpy._typing import NDArray

from pydvl.utils.numeric import (
powerset,
Expand Down
5 changes: 1 addition & 4 deletions tests/utils/test_parallel.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,10 +118,7 @@ def test_chunkification(parallel_config, data, n_chunks, expected_chunks):
map_reduce_job = MapReduceJob([], map_func=lambda x: x, config=parallel_config)
chunks = list(map_reduce_job._chunkify(data, n_chunks))
for x, y in zip(chunks, expected_chunks):
if not isinstance(x, np.ndarray):
assert x == y
else:
assert (x == y).all()
assert np.all(x == y)


def test_map_reduce_job_partial_map_and_reduce_func(parallel_config):
Expand Down
Loading
Loading