aai-institute · AnesBenmerzoug · Mar 26, 2024 · Mar 20, 2024 · Mar 20, 2024 · Mar 20, 2024
diff --git a/.test_durations b/.test_durations
diff --git a/docs/getting-started/advanced-usage.md b/docs/getting-started/advanced-usage.md
@@ -0,0 +1,217 @@
+---
+title: Advanced usage
+alias: 
+  name: advanced-usage
+  text: Advanced usage
+---
+
+# Advanced usage
+
+Besides the dos and don'ts of data valuation itself, which are the subject of
+the examples and the documentation of each method, there are two main things to
+keep in mind when using pyDVL namely Parallelization and Caching.
+
+## Parallelization { #setting-up-parallelization }
+
+pyDVL uses parallelization to scale and speed up computations. It does so
+using one of Dask, Ray or Joblib. The first is used in
+the [influence][pydvl.influence] package whereas the other two
+are used in the [value][pydvl.value] package. 
+
+### Data Valuation
+
+For data valuation, pyDVL uses [joblib](https://joblib.readthedocs.io/en/latest/) for local
+parallelization (within one machine) and supports using
+[Ray](https://ray.io) for distributed parallelization (across multiple machines).
+
+The former works out of the box but for the latter you will need to install
+additional dependencies (see [Extras][installation-extras])
+and to provide a running cluster (or run ray in local mode).
+
+!!! info
+
+    As of v0.9.0 pyDVL does not allow requesting resources per task sent to the
+    cluster, so you will need to make sure that each worker has enough resources to
+    handle the tasks it receives. A data valuation task using game-theoretic methods
+    will typically make a copy of the whole model and dataset to each worker, even
+    if the re-training only happens on a subset of the data. This means that you
+    should make sure that each worker has enough memory to handle the whole dataset.
+
+#### Joblib
+
+Please follow the instructions in Joblib's documentation
+for all possible configuration options that you can pass to the
+[parallel_config][joblib.parallel_config] context manager.
+
+To use the joblib parallel backend with the `loky` backend and verbosity set to `100`
+to compute exact shapley values you would use:
+
+```python
+import joblib
+from pydvl.parallel import ParallelConfig
+from pydvl.value.shapley import combinatorial_exact_shapley
+from pydvl.utils.utility import Utility
+
+config = ParallelConfig(backend="joblib") 
+u = Utility(...)
+
+with joblib.parallel_config(backend="loky", verbose=100):
+    combinatorial_exact_shapley(u, config=config)
+```
+
+#### Ray
+
+Please follow the instructions in Ray's documentation to
+[set up a remote cluster](https://docs.ray.io/en/latest/cluster/key-concepts.html).
+You could alternatively use a local cluster and in that case you don't have to set
+anything up.
+
+Before starting a computation, you should initialize ray by calling 
+[`ray.init`][ray.init] with the appropriate parameters:
+
+To set up and start a local ray cluster with 4 CPUs you would use:
+
+```python
+import ray
+
+ray.init(num_cpus=4)
+```
+
+Whereas for a remote ray cluster you would use:
+
+```python
+import ray
+
+address = "<Hypothetical Ray Cluster IP Address>"
+ray.init(address)
+```
+
+To use the ray parallel backend to compute exact shapley values you would use:
+
+```python
+import ray
+from pydvl.parallel import ParallelConfig
+from pydvl.value.shapley import combinatorial_exact_shapley
+from pydvl.utils.utility import Utility
+
+ray.init()
+config = ParallelConfig(backend="ray")
+u = Utility(...)
+combinatorial_exact_shapley(u, config=config)
+```
+
+### Influence Functions
+
+#### Dask
+
+For influence functions, pyDVL uses [Dask](https://docs.dask.org/en/stable/)
+to scale out the computation.
+
+```python
+import torch
+from pydvl.influence import DaskInfluenceCalculator
+from pydvl.influence.torch import CgInfluence
+from pydvl.influence.torch.util import TorchNumpyConverter
+from distributed import Client
+
+infl_model = CgInfluence(...)
+
+client = Client(n_workers=4, threads_per_worker=1)
+infl_calc = DaskInfluenceCalculator(
+    infl_model,
+    TorchNumpyConverter(device=torch.device("cpu")),
+    client
+)
+# da_influences is a dask.array.Array
+da_influences = infl_calc.influences(...)
+
+# trigger computation and write chunks to disk in parallel
+da_influences.to_zarr("path/or/url")
+```
+
+Refer to the documentation of the [DaskInfluenceCalculator]
+[pydvl.influence.influence_calculator.DaskInfluenceCalculator] for more details.
+
+## Caching { #getting-started-cache }
+
+PyDVL can cache (memoize) the computation of the utility function
+and speed up some computations for data valuation.
+It is however disabled by default.
+When it is enabled it takes into account the data indices passed as argument
+and the utility function wrapped into the
+[Utility][pydvl.utils.utility.Utility] object. This means that
+care must be taken when reusing the same utility function with different data,
+see the documentation for the [caching package][pydvl.utils.caching] for more
+information.
+
+In general, caching won't play a major role in the computation of Shapley values
+because the probability of sampling the same subset twice, and hence needing
+the same utility function computation, is very low. However, it can be very
+useful when comparing methods that use the same utility function, or when
+running multiple experiments with the same data.
+
+pyDVL supports 3 different caching backends:
+
+- [InMemoryCacheBackend][pydvl.utils.caching.memory.InMemoryCacheBackend]:
+  an in-memory cache backend that uses a dictionary to store and retrieve
+  cached values. This is used to share cached values between threads
+  in a single process.
+
+- [DiskCacheBackend][pydvl.utils.caching.disk.DiskCacheBackend]:
+  a disk-based cache backend that uses pickled values written to and read from disk.  
+  This is used to share cached values between processes in a single machine.
+- [MemcachedCacheBackend][pydvl.utils.caching.memcached.MemcachedCacheBackend]:
+  a [Memcached](https://memcached.org/)-based cache backend that uses pickled values written to
+  and read from a Memcached server. This is used to share cached values
+  between processes across multiple machines.
+
+    ??? info "Memcached extras"
+
+         The Memcached backend requires optional dependencies.
+         See [Extras][installation-extras] for more information.
+
+As an example, here's how one would use the disk-based cached backend
+with a utility:
+
+```python
+from pydvl.utils.caching.disk import DiskCacheBackend
+from pydvl.utils.utility import Utility
+
+cache_backend = DiskCacheBackend()
+u = Utility(..., cache_backend=cache_backend)
+```
+
+Please refer to the documentation and examples of each backend class for more details.
+
+!!! tip "When is the cache really necessary?"
+    Crucially, semi-value computations with the
+    [PermutationSampler][pydvl.value.sampler.PermutationSampler] require caching
+    to be enabled, or they will take twice as long as the direct implementation
+    in [compute_shapley_values][pydvl.value.shapley.compute_shapley_values].
+
+!!! tip "Using the cache"
+    Continue reading about the cache in the documentation
+    for the [caching package][pydvl.utils.caching].
+
+### Setting up the Memcached cache { #setting-up-memcached }
+
+[Memcached](https://memcached.org/) is an in-memory key-value store accessible
+over the network. pyDVL can use it to cache the computation of the utility function
+and speed up some computations (in particular, semi-value computations with the
+[PermutationSampler][pydvl.value.sampler.PermutationSampler] but other methods
+may benefit as well).
+
+You can either install it as a package or run it inside a docker container (the
+simplest). For installation instructions, refer to the [Getting
+started](https://github.com/memcached/memcached/wiki#getting-started) section in
+memcached's wiki. Then you can run it with:
+
+```shell
+memcached -u user
+```
+
+To run memcached inside a container in daemon mode instead, use:
+
+```shell
+docker container run -d --rm -p 11211:11211 memcached:latest
+```
diff --git a/docs/getting-started/first-steps.md b/docs/getting-started/first-steps.md
@@ -11,7 +11,7 @@ alias:
     Make sure you have read [[getting-started#installation]] before using the library. 
     In particular read about which extra dependencies you may need.
 
-## Main concepts
+## Main Concepts
 
 pyDVL aims to be a repository of production-ready, reference implementations of
 algorithms for data valuation and influence functions. Even though we only
@@ -22,7 +22,7 @@ should be enough to get you started.
   computation and related methods.
 * [[influence-function]] for instructions on how to compute influence functions.
 
-## Running the examples
+## Running the Examples
 
 If you are somewhat familiar with the concepts of data valuation, you can start
 by browsing our worked-out examples illustrating pyDVL's capabilities either:
@@ -34,104 +34,7 @@ by browsing our worked-out examples illustrating pyDVL's capabilities either:
   have to install jupyter first manually since it's not a dependency of the
   library.
 
-## Advanced usage
+## Advanced Usage
 
-Besides the dos and don'ts of data valuation itself, which are the subject of
-the examples and the documentation of each method, there are two main things to
-keep in mind when using pyDVL.
-
-### Caching { #getting-started-cache }
-
-PyDVL can cache (memoize) the computation of the utility function
-and speed up some computations for data valuation.
-It is however disabled by default.
-When it is enabled it takes into account the data indices passed as argument
-and the utility function wrapped into the
-[Utility][pydvl.utils.utility.Utility] object. This means that
-care must be taken when reusing the same utility function with different data,
-see the documentation for the [caching package][pydvl.utils.caching] for more
-information.
-
-In general, caching won't play a major role in the computation of Shapley values
-because the probability of sampling the same subset twice, and hence needing
-the same utility function computation, is very low. However, it can be very
-useful when comparing methods that use the same utility function, or when
-running multiple experiments with the same data.
-
-pyDVL supports 3 different caching backends:
-
-- [InMemoryCacheBackend][pydvl.utils.caching.memory.InMemoryCacheBackend]:
-  an in-memory cache backend that uses a dictionary to store and retrieve
-  cached values. This is used to share cached values between threads
-  in a single process.
-- [DiskCacheBackend][pydvl.utils.caching.disk.DiskCacheBackend]:
-  a disk-based cache backend that uses pickled values written to and read from disk.  
-  This is used to share cached values between processes in a single machine.
-- [MemcachedCacheBackend][pydvl.utils.caching.memcached.MemcachedCacheBackend]:
-  a [Memcached](https://memcached.org/)-based cache backend that uses pickled
-  values written to and read from a Memcached server. This is used to share
-  cached values between processes across multiple machines. Note that this
-  backend requires optional dependencies, see [Extras][installation-extras].
-
-!!! tip "When is the cache really necessary?"
-    Crucially, semi-value computations with the
-    [PermutationSampler][pydvl.value.sampler.PermutationSampler] require caching
-    to be enabled, or they will take twice as long as the direct implementation
-    in [compute_shapley_values][pydvl.value.shapley.compute_shapley_values].
-
-!!! tip "Using the cache"
-    Continue reading about the cache in the documentation
-    for the [caching package][pydvl.utils.caching].
-
-#### Setting up the Memcached cache { #setting-up-memcached }
-
-[Memcached](https://memcached.org/) is an in-memory key-value store accessible
-over the network. pyDVL can use it to cache the computation of the utility function
-and speed up some computations (in particular, semi-value computations with the
-[PermutationSampler][pydvl.value.sampler.PermutationSampler] but other methods
-may benefit as well).
-
-You can either install it as a package or run it inside a docker container (the
-simplest). For installation instructions, refer to the [Getting
-started](https://github.com/memcached/memcached/wiki#getting-started) section in
-memcached's wiki. Then you can run it with:
-
-```shell
-memcached -u user
-```
-
-To run memcached inside a container in daemon mode instead, use:
-
-```shell
-docker container run -d --rm -p 11211:11211 memcached:latest
-```
-
-### Parallelization { #setting-up-parallelization }
-
-pyDVL uses [joblib](https://joblib.readthedocs.io/en/latest/) for local
-parallelization (within one machine) and supports using
-[Ray](https://ray.io) for distributed parallelization (across multiple machines).
-
-The former works out of the box but for the latter you will need to install
-additional dependencies (see [Extras][installation-extras] )
-and to provide a running cluster (or run ray in local mode).
-
-As of v0.8.1 pyDVL does not allow requesting resources per task sent to the
-cluster, so you will need to make sure that each worker has enough resources to
-handle the tasks it receives. A data valuation task using game-theoretic methods
-will typically make a copy of the whole model and dataset to each worker, even
-if the re-training only happens on a subset of the data. This means that you
-should make sure that each worker has enough memory to handle the whole dataset.
-
-#### Ray
-
-Please follow the instructions in Ray's documentation to set up a cluster.
-Once you have a running cluster, you can use it by passing the address
-of the head node to parallel methods via [ParallelConfig][pydvl.parallel.config.ParallelConfig].
-
-For a local ray cluster you would use:
-
-```python
-from pydvl.parallel.config import ParallelConfig
-config = ParallelConfig(backend="ray") 
-```
+Refer to the [[advanced-usage]] page for explanations on how to enable
+and use parallelization and caching.
diff --git a/docs/getting-started/index.md b/docs/getting-started/index.md
@@ -104,4 +104,4 @@ pip install pyDVL[memcached]
 
 This installs [pymemcache](https://github.com/pinterest/pymemcache)
 additionally. Be aware that you still have to start a memcached server manually.
-See [Setting up the Memcached cache](first-steps/#setting-up-memcached).
+See [Setting up the Memcached cache][setting-up-memcached].
diff --git a/docs/value/index.md b/docs/value/index.md
@@ -224,7 +224,7 @@ from sklearn.linear_model import LinearRegression, LogisticRegression
 from sklearn.datasets import load_iris
 
 dataset = Dataset.from_sklearn(load_iris())
-u = Utility(LogisticRegression(), dataset, enable_cache=False)
+u = Utility(LogisticRegression(), dataset)
 training_budget = 3
 wrapped_u = DataUtilityLearning(u, training_budget, LinearRegression())
 

diff --git a/mkdocs.yml b/mkdocs.yml
@@ -14,6 +14,7 @@ nav:
     - Applications: getting-started/applications.md
     - Benchmarking: getting-started/benchmarking.md
     - Methods: getting-started/methods.md
+    - Advanced usage: getting-started/advanced-usage.md
   - Data Valuation:
     - value/index.md
     - Shapley values: value/shapley.md
@@ -73,7 +74,6 @@ plugins:
       canonical_version: stable
   - section-index
   - alias:
-      use_relative_link: true
       verbose: true
   - gen-files:
       scripts:
@@ -109,6 +109,7 @@ plugins:
             - https://joblib.readthedocs.io/en/stable/objects.inv
             - https://docs.dask.org/en/latest/objects.inv
             - https://distributed.dask.org/en/latest/objects.inv
+            - https://docs.ray.io/en/latest/objects.inv
           paths: [ src ]  # search packages in the src folder
           options:
             heading_level: 1

diff --git a/requirements.txt b/requirements.txt
@@ -4,7 +4,7 @@ pandas>=1.3
 scikit-learn
 scipy>=1.7.0
 cvxpy>=1.3.0
-joblib
+joblib>=1.3.0
 cloudpickle
 tqdm
 matplotlib