Merge branch 'develop' into 259-implement-class-wise-shapley

aai-institute · Sep 22, 2023 · 1691281 · 1691281
2 parents 6deaea3 + 43690b0
commit 1691281
Show file tree

Hide file tree

Showing 37 changed files with 1,007 additions and 502 deletions.
diff --git a/.gitignore b/.gitignore
@@ -139,6 +139,7 @@ pylint.html
 # Saved data
 runs/
 data/models/
+*.pkl
 
 # Docs
 docs_build
diff --git a/.readthedocs.yaml b/.readthedocs.yaml
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,7 +3,8 @@
 ## Unreleased
 
 - Implementation of Data-OOB by @BastienZim
-  [PR #426](https://github.com/aai-institute/pyDVL/pull/426)
+  [PR #426](https://github.com/aai-institute/pyDVL/pull/426), 
+  [PR $431](https://github.com/aai-institute/pyDVL/pull/431)
 - Refactoring of parallel module. Old imports will stop working in v0.9.0
   [PR #421](https://github.com/aai-institute/pyDVL/pull/421)
 
@@ -26,6 +27,10 @@ randomness.
   `pydvl.value.semivalues`. Introduced new type `Seed` and conversion function 
   `ensure_seed_sequence`.
   [PR #396](https://github.com/aai-institute/pyDVL/pull/396)
+- Added `batch_size` parameter to `compute_banzhaf_semivalues`,
+  `compute_beta_shapley_semivalues`, `compute_shapley_semivalues` and
+  `compute_generic_semivalues`.
+  [PR #428](https://github.com/aai-institute/pyDVL/pull/428)
 
 ### Changed
 
@@ -247,3 +252,4 @@ It contains:
 - Parallelization of computations with Ray
 - Documentation
 - Notebooks containing examples of different use cases
+
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -35,15 +35,15 @@ library. E.g. with venv:
 ```shell script
 python -m venv ./venv
 . venv/bin/activate  # `venv\Scripts\activate` in windows
-pip install -r requirements-dev.txt
+pip install -r requirements-dev.txt -r requirements-docs.txt
 ```
 
 With conda:
 
 ```shell script
 conda create -n pydvl python=3.8
 conda activate pydvl
-pip install -r requirements-dev.txt
+pip install -r requirements-dev.txt -r requirements-docs.txt
 ```
 
 A very convenient way of working with your library during development is to
@@ -54,11 +54,12 @@ pip install -e .
 ```
 
 In order to build the documentation locally (which is done as part of the tox
-suite) you will need [pandoc](https://pandoc.org/). Under Ubuntu it can be
-installed with:
+suite) [pandoc](https://pandoc.org/) is required. Except for OSX, it should be installed
+automatically as a dependency with `requirements-docs.txt`. Under OSX you can
+install pandoc (you'll need at least version 2.11) with:
 
 ```shell script
-sudo apt-get update -yq && apt-get install -yq pandoc
+brew install pandoc
 ```
 
 Remember to mark all autogenerated directories as excluded in your IDE. In
@@ -151,7 +152,7 @@ cells which are then hidden in the documentation.
 
 In order to do this, cells are marked with tags understood by the mkdocs
 plugin [`mkdocs-jupyter`](https://github.com/danielfrg/mkdocs-jupyter#readme),
-namely adding the following to the relevant cells:
+namely adding the following to the metadata of the relevant cells:
 
 ```yaml
 "tags": [

diff --git a/README.md b/README.md
@@ -111,7 +111,7 @@ documentation.
 
 For influence computation, follow these steps:
 
-1. Wrap your model and loss in a `TorchTwiceDifferential` object
+1. Wrap your model and loss in a `TorchTwiceDifferentiable` object
 2. Compute influence factors by providing training data and inversion method
 
 Using the conjugate gradient algorithm, this would look like:

diff --git a/apt-cache/.gitignore b/apt-cache/.gitignore
diff --git a/build_scripts/copy_changelog.py b/build_scripts/copy_changelog.py
@@ -1,6 +1,5 @@
 import logging
 import os
-import shutil
 from pathlib import Path
 
 import mkdocs.plugins

diff --git a/build_scripts/copy_notebooks.py b/build_scripts/copy_notebooks.py
@@ -1,6 +1,5 @@
 import logging
 import os
-import shutil
 from pathlib import Path
 
 import mkdocs.plugins

diff --git a/docs/css/extra.css b/docs/css/extra.css
@@ -49,7 +49,12 @@ a.autorefs-external:hover::after {
 }
 
 .md-typeset h2 {
-  font-size: 1.7em;
+  font-size: 1.3em;
+  font-weight: 300;
+}
+
+.md-typeset h3 {
+  font-size: 1.1em;
   font-weight: 300;
 }
 
@@ -77,12 +82,6 @@ a.autorefs-external:hover::after {
     user-select: none;
 }
 
-/* Nicer style of headers in generated API */
-h2 code {
-    font-size: large!important;
-    background-color: inherit!important;
-}
-
 /* Remove cell input and output prompt */
 .jp-InputArea-prompt, .jp-OutputArea-prompt {
     display: none !important;

diff --git a/docs/value/index.md b/docs/value/index.md
@@ -15,21 +15,23 @@ alias:
 training set which reflects its contribution to the final performance of some
 model trained on it. Some methods attempt to be model-agnostic, but in most
 cases the model is an integral part of the method. In these cases, this number
-not an intrinsic property of the element of interest, but typically a function
-of three factors:
+is not an intrinsic property of the element of interest, but typically a
+function of three factors:
 
-1. The dataset $D$, or more generally, the distribution it was sampled
-   from (with this we mean that *value* would ideally be the (expected)
-   contribution of a data point to any random set $D$ sampled from the same
-   distribution).
+1. The dataset $D$, or more generally, the distribution it was sampled from: In
+   some cases one only cares about values wrt. a given data set, in others
+   value would ideally be the (expected) contribution of a data point to any
+   random set $D$ sampled from the same distribution. pyDVL implements methods
+   of the first kind.
 
-2. The algorithm $\mathcal{A}$ mapping the data $D$ to some estimator $f$
-   in a model class $\mathcal{F}$. E.g. MSE minimization to find the parameters
-   of a linear model.
+2. The algorithm $\mathcal{A}$ mapping the data $D$ to some estimator $f$ in a
+   model class $\mathcal{F}$. E.g. MSE minimization to find the parameters of a
+   linear model.
 
 3. The performance metric of interest $u$ for the problem. When value depends on
-   a model, it must be measured in some way which uses it. E.g. the $R^2$ score or
-   the negative MSE over a test set.
+   a model, it must be measured in some way which uses it. E.g. the $R^2$ score
+   or the negative MSE over a test set. This metric will be computed over a
+   held-out valuation set.
 
 pyDVL collects algorithms for the computation of data values in this sense,
 mostly those derived from cooperative game theory. The methods can be found in