Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup Parallel Backend #529

Merged
merged 21 commits into from
Mar 26, 2024
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
44158ca
Remove unused parallel module inside utils package
AnesBenmerzoug Mar 20, 2024
4375fdf
Set lowest joblib version requirement to 1.3.0
AnesBenmerzoug Mar 20, 2024
4569be3
Simplify parallel backend configuration and deprecate ParallelConfig …
AnesBenmerzoug Mar 20, 2024
02ee559
Add more details to parallelization section in documentation
AnesBenmerzoug Mar 20, 2024
33e6194
Docstring fixes
AnesBenmerzoug Mar 20, 2024
a863d40
Move parallelization and caching docs to separate advanced usage page
AnesBenmerzoug Mar 20, 2024
58332c0
Move parallel tests to separate directory, make sure to initialize ray
AnesBenmerzoug Mar 21, 2024
4c68ef6
Add joblib only parallel_config fixture to test_caching module
AnesBenmerzoug Mar 21, 2024
e5a5046
Improve error message when failing to connect to memcached during tes…
AnesBenmerzoug Mar 21, 2024
2ec85d2
Set time_threshold to 0 for test_parallel_jobs test
AnesBenmerzoug Mar 21, 2024
0ec4afa
Skip hoeffding bound test for combinatorial shapley
AnesBenmerzoug Mar 21, 2024
90dcf30
Update test durations
AnesBenmerzoug Mar 22, 2024
b85c2d0
Merge branch 'develop' into cleanup/parallel-backend
AnesBenmerzoug Mar 23, 2024
2d6c3a4
Add ray docs inventory
AnesBenmerzoug Mar 23, 2024
fc9b5fc
Apply PR comments' suggestions
AnesBenmerzoug Mar 23, 2024
5c8ad12
Skip hoeffding bound test
AnesBenmerzoug Mar 23, 2024
f132490
More cleanup and improvements
AnesBenmerzoug Mar 23, 2024
997e593
Add comment above ParallelConfig to make it easier to find and remove…
AnesBenmerzoug Mar 26, 2024
2e7c003
Reference to scaling computation page for influences from advanced us…
AnesBenmerzoug Mar 26, 2024
af3c3ff
Make section titles consistent
AnesBenmerzoug Mar 26, 2024
dc7f738
Merge branch 'develop' into cleanup/parallel-backend
AnesBenmerzoug Mar 26, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,222 changes: 678 additions & 544 deletions .test_durations

Large diffs are not rendered by default.

217 changes: 217 additions & 0 deletions docs/getting-started/advanced-usage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,217 @@
---
title: Advanced usage
alias:
name: advanced-usage
text: Advanced usage
---

# Advanced usage

Besides the dos and don'ts of data valuation itself, which are the subject of
the examples and the documentation of each method, there are two main things to
keep in mind when using pyDVL namely Parallelization and Caching.

## Parallelization { #setting-up-parallelization }

pyDVL uses parallelization to scale and speed up computations. It does so
using one of Dask, Ray or Joblib. The first is used in
the [influence][pydvl.influence] package whereas the other two
are used in the [value][pydvl.value] package.

### Data Valuation

For data valuation, pyDVL uses [joblib](https://joblib.readthedocs.io/en/latest/) for local
parallelization (within one machine) and supports using
[Ray](https://ray.io) for distributed parallelization (across multiple machines).

The former works out of the box but for the latter you will need to install
additional dependencies (see [Extras][installation-extras])
and to provide a running cluster (or run ray in local mode).

!!! info

As of v0.9.0 pyDVL does not allow requesting resources per task sent to the
cluster, so you will need to make sure that each worker has enough resources to
handle the tasks it receives. A data valuation task using game-theoretic methods
will typically make a copy of the whole model and dataset to each worker, even
if the re-training only happens on a subset of the data. This means that you
should make sure that each worker has enough memory to handle the whole dataset.

#### Joblib

Please follow the instructions in Joblib's documentation
for all possible configuration options that you can pass to the
[parallel_config][joblib.parallel_config] context manager.

To use the joblib parallel backend with the `loky` backend and verbosity set to `100`
to compute exact shapley values you would use:

```python
import joblib
from pydvl.parallel import ParallelConfig
from pydvl.value.shapley import combinatorial_exact_shapley
from pydvl.utils.utility import Utility

config = ParallelConfig(backend="joblib")
u = Utility(...)

with joblib.parallel_config(backend="loky", verbose=100):
combinatorial_exact_shapley(u, config=config)
```

#### Ray

Please follow the instructions in Ray's documentation to
[set up a remote cluster](https://docs.ray.io/en/latest/cluster/key-concepts.html).
You could alternatively use a local cluster and in that case you don't have to set
anything up.

Before starting a computation, you should initialize ray by calling
[`ray.init`][ray.init] with the appropriate parameters:

To set up and start a local ray cluster with 4 CPUs you would use:

```python
import ray

ray.init(num_cpus=4)
```

Whereas for a remote ray cluster you would use:

```python
import ray

address = "<Hypothetical Ray Cluster IP Address>"
ray.init(address)
```

To use the ray parallel backend to compute exact shapley values you would use:

```python
import ray
from pydvl.parallel import ParallelConfig
from pydvl.value.shapley import combinatorial_exact_shapley
from pydvl.utils.utility import Utility

ray.init()
config = ParallelConfig(backend="ray")
u = Utility(...)
combinatorial_exact_shapley(u, config=config)
```

### Influence Functions

#### Dask

For influence functions, pyDVL uses [Dask](https://docs.dask.org/en/stable/)
to scale out the computation.

```python
import torch
from pydvl.influence import DaskInfluenceCalculator
from pydvl.influence.torch import CgInfluence
from pydvl.influence.torch.util import TorchNumpyConverter
from distributed import Client

infl_model = CgInfluence(...)

client = Client(n_workers=4, threads_per_worker=1)
infl_calc = DaskInfluenceCalculator(
infl_model,
TorchNumpyConverter(device=torch.device("cpu")),
client
)
# da_influences is a dask.array.Array
da_influences = infl_calc.influences(...)

# trigger computation and write chunks to disk in parallel
da_influences.to_zarr("path/or/url")
```

Refer to the documentation of the [DaskInfluenceCalculator]
[pydvl.influence.influence_calculator.DaskInfluenceCalculator] for more details.

## Caching { #getting-started-cache }

PyDVL can cache (memoize) the computation of the utility function
and speed up some computations for data valuation.
It is however disabled by default.
When it is enabled it takes into account the data indices passed as argument
and the utility function wrapped into the
[Utility][pydvl.utils.utility.Utility] object. This means that
care must be taken when reusing the same utility function with different data,
see the documentation for the [caching package][pydvl.utils.caching] for more
information.

In general, caching won't play a major role in the computation of Shapley values
because the probability of sampling the same subset twice, and hence needing
the same utility function computation, is very low. However, it can be very
useful when comparing methods that use the same utility function, or when
running multiple experiments with the same data.

pyDVL supports 3 different caching backends:

- [InMemoryCacheBackend][pydvl.utils.caching.memory.InMemoryCacheBackend]:
an in-memory cache backend that uses a dictionary to store and retrieve
cached values. This is used to share cached values between threads
in a single process.

- [DiskCacheBackend][pydvl.utils.caching.disk.DiskCacheBackend]:
a disk-based cache backend that uses pickled values written to and read from disk.
This is used to share cached values between processes in a single machine.
- [MemcachedCacheBackend][pydvl.utils.caching.memcached.MemcachedCacheBackend]:
a [Memcached](https://memcached.org/)-based cache backend that uses pickled values written to
and read from a Memcached server. This is used to share cached values
between processes across multiple machines.

??? info "Memcached extras"

The Memcached backend requires optional dependencies.
See [Extras][installation-extras] for more information.

As an example, here's how one would use the disk-based cached backend
with a utility:

```python
from pydvl.utils.caching.disk import DiskCacheBackend
from pydvl.utils.utility import Utility

cache_backend = DiskCacheBackend()
u = Utility(..., cache_backend=cache_backend)
```

Please refer to the documentation and examples of each backend class for more details.

!!! tip "When is the cache really necessary?"
Crucially, semi-value computations with the
[PermutationSampler][pydvl.value.sampler.PermutationSampler] require caching
to be enabled, or they will take twice as long as the direct implementation
in [compute_shapley_values][pydvl.value.shapley.compute_shapley_values].

!!! tip "Using the cache"
Continue reading about the cache in the documentation
for the [caching package][pydvl.utils.caching].

### Setting up the Memcached cache { #setting-up-memcached }

[Memcached](https://memcached.org/) is an in-memory key-value store accessible
over the network. pyDVL can use it to cache the computation of the utility function
and speed up some computations (in particular, semi-value computations with the
[PermutationSampler][pydvl.value.sampler.PermutationSampler] but other methods
may benefit as well).

You can either install it as a package or run it inside a docker container (the
simplest). For installation instructions, refer to the [Getting
started](https://github.com/memcached/memcached/wiki#getting-started) section in
memcached's wiki. Then you can run it with:

```shell
memcached -u user
```

To run memcached inside a container in daemon mode instead, use:

```shell
docker container run -d --rm -p 11211:11211 memcached:latest
```
107 changes: 5 additions & 102 deletions docs/getting-started/first-steps.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ alias:
Make sure you have read [[getting-started#installation]] before using the library.
In particular read about which extra dependencies you may need.

## Main concepts
## Main Concepts
AnesBenmerzoug marked this conversation as resolved.
Show resolved Hide resolved

pyDVL aims to be a repository of production-ready, reference implementations of
algorithms for data valuation and influence functions. Even though we only
Expand All @@ -22,7 +22,7 @@ should be enough to get you started.
computation and related methods.
* [[influence-function]] for instructions on how to compute influence functions.

## Running the examples
## Running the Examples

If you are somewhat familiar with the concepts of data valuation, you can start
by browsing our worked-out examples illustrating pyDVL's capabilities either:
Expand All @@ -34,104 +34,7 @@ by browsing our worked-out examples illustrating pyDVL's capabilities either:
have to install jupyter first manually since it's not a dependency of the
library.

## Advanced usage
## Advanced Usage

Besides the dos and don'ts of data valuation itself, which are the subject of
the examples and the documentation of each method, there are two main things to
keep in mind when using pyDVL.

### Caching { #getting-started-cache }

PyDVL can cache (memoize) the computation of the utility function
and speed up some computations for data valuation.
It is however disabled by default.
When it is enabled it takes into account the data indices passed as argument
and the utility function wrapped into the
[Utility][pydvl.utils.utility.Utility] object. This means that
care must be taken when reusing the same utility function with different data,
see the documentation for the [caching package][pydvl.utils.caching] for more
information.

In general, caching won't play a major role in the computation of Shapley values
because the probability of sampling the same subset twice, and hence needing
the same utility function computation, is very low. However, it can be very
useful when comparing methods that use the same utility function, or when
running multiple experiments with the same data.

pyDVL supports 3 different caching backends:

- [InMemoryCacheBackend][pydvl.utils.caching.memory.InMemoryCacheBackend]:
an in-memory cache backend that uses a dictionary to store and retrieve
cached values. This is used to share cached values between threads
in a single process.
- [DiskCacheBackend][pydvl.utils.caching.disk.DiskCacheBackend]:
a disk-based cache backend that uses pickled values written to and read from disk.
This is used to share cached values between processes in a single machine.
- [MemcachedCacheBackend][pydvl.utils.caching.memcached.MemcachedCacheBackend]:
a [Memcached](https://memcached.org/)-based cache backend that uses pickled
values written to and read from a Memcached server. This is used to share
cached values between processes across multiple machines. Note that this
backend requires optional dependencies, see [Extras][installation-extras].

!!! tip "When is the cache really necessary?"
Crucially, semi-value computations with the
[PermutationSampler][pydvl.value.sampler.PermutationSampler] require caching
to be enabled, or they will take twice as long as the direct implementation
in [compute_shapley_values][pydvl.value.shapley.compute_shapley_values].

!!! tip "Using the cache"
Continue reading about the cache in the documentation
for the [caching package][pydvl.utils.caching].

#### Setting up the Memcached cache { #setting-up-memcached }

[Memcached](https://memcached.org/) is an in-memory key-value store accessible
over the network. pyDVL can use it to cache the computation of the utility function
and speed up some computations (in particular, semi-value computations with the
[PermutationSampler][pydvl.value.sampler.PermutationSampler] but other methods
may benefit as well).

You can either install it as a package or run it inside a docker container (the
simplest). For installation instructions, refer to the [Getting
started](https://github.com/memcached/memcached/wiki#getting-started) section in
memcached's wiki. Then you can run it with:

```shell
memcached -u user
```

To run memcached inside a container in daemon mode instead, use:

```shell
docker container run -d --rm -p 11211:11211 memcached:latest
```

### Parallelization { #setting-up-parallelization }

pyDVL uses [joblib](https://joblib.readthedocs.io/en/latest/) for local
parallelization (within one machine) and supports using
[Ray](https://ray.io) for distributed parallelization (across multiple machines).

The former works out of the box but for the latter you will need to install
additional dependencies (see [Extras][installation-extras] )
and to provide a running cluster (or run ray in local mode).

As of v0.8.1 pyDVL does not allow requesting resources per task sent to the
cluster, so you will need to make sure that each worker has enough resources to
handle the tasks it receives. A data valuation task using game-theoretic methods
will typically make a copy of the whole model and dataset to each worker, even
if the re-training only happens on a subset of the data. This means that you
should make sure that each worker has enough memory to handle the whole dataset.

#### Ray

Please follow the instructions in Ray's documentation to set up a cluster.
Once you have a running cluster, you can use it by passing the address
of the head node to parallel methods via [ParallelConfig][pydvl.parallel.config.ParallelConfig].

For a local ray cluster you would use:

```python
from pydvl.parallel.config import ParallelConfig
config = ParallelConfig(backend="ray")
```
Refer to the [[advanced-usage]] page for explanations on how to enable
and use parallelization and caching.
2 changes: 1 addition & 1 deletion docs/getting-started/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,4 +104,4 @@ pip install pyDVL[memcached]

This installs [pymemcache](https://github.com/pinterest/pymemcache)
additionally. Be aware that you still have to start a memcached server manually.
See [Setting up the Memcached cache](first-steps/#setting-up-memcached).
See [Setting up the Memcached cache][setting-up-memcached].
2 changes: 1 addition & 1 deletion docs/value/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -224,7 +224,7 @@ from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.datasets import load_iris

dataset = Dataset.from_sklearn(load_iris())
u = Utility(LogisticRegression(), dataset, enable_cache=False)
u = Utility(LogisticRegression(), dataset)
training_budget = 3
wrapped_u = DataUtilityLearning(u, training_budget, LinearRegression())

Expand Down
3 changes: 2 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ nav:
- Applications: getting-started/applications.md
- Benchmarking: getting-started/benchmarking.md
- Methods: getting-started/methods.md
- Advanced usage: getting-started/advanced-usage.md
- Data Valuation:
- value/index.md
- Shapley values: value/shapley.md
Expand Down Expand Up @@ -73,7 +74,6 @@ plugins:
canonical_version: stable
- section-index
- alias:
use_relative_link: true
verbose: true
- gen-files:
scripts:
Expand Down Expand Up @@ -109,6 +109,7 @@ plugins:
- https://joblib.readthedocs.io/en/stable/objects.inv
- https://docs.dask.org/en/latest/objects.inv
- https://distributed.dask.org/en/latest/objects.inv
- https://docs.ray.io/en/latest/objects.inv
paths: [ src ] # search packages in the src folder
options:
heading_level: 1
Expand Down
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ pandas>=1.3
scikit-learn
scipy>=1.7.0
cvxpy>=1.3.0
joblib
joblib>=1.3.0
cloudpickle
tqdm
matplotlib
Loading
Loading