aai-institute · AnesBenmerzoug · Mar 19, 2024 · Mar 18, 2024 · Mar 18, 2024 · Mar 18, 2024
diff --git a/.github/workflows/main.yaml b/.github/workflows/main.yaml
@@ -101,6 +101,8 @@ jobs:
       group: publish
     steps:
       - uses: actions/checkout@v4
+        with:
+          fetch-depth: 0
       - name: Setup Python 3.8
         uses: ./.github/actions/python
         with:

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -19,6 +19,7 @@
 
 - Bump versions of CI actions to avoid warnings [PR #502](https://github.com/aai-institute/pyDVL/pull/502)
 - Add Python Version 3.11 to supported versions [PR #510](https://github.com/aai-institute/pyDVL/pull/510)
+- Documentation improvements and cleanup [PR #521](https://github.com/aai-institute/pyDVL/pull/521)
 
 ## 0.8.1 - 🆕 🏗  New method and notebook, Games with exact shapley values, bug fixes and cleanup
 

diff --git a/README.md b/README.md
@@ -18,6 +18,9 @@
 
 **pyDVL** collects algorithms for **Data Valuation** and **Influence Function** computation.
 
+Refer to the [Methods](https://pydvl.org/stable/methods/)
+page of our documentation for a list of all implemented methods. 
+
 **Data Valuation** for machine learning is the task of assigning a scalar
 to each element of a training set which reflects its contribution to the final
 performance or outcome of some model trained on it. Some concepts of
@@ -29,7 +32,7 @@ pyDVL focuses on model-dependent methods.
         width="70%"
         align="center"
         style="display: block; margin-left: auto; margin-right: auto;"
-        src="docs/value/img/mclc-best-removal-10k-natural.svg"
+        src="https://raw.githubusercontent.com/aai-institute/pyDVL/develop/docs/value/img/mclc-best-removal-10k-natural.svg"
         alt="best sample removal"
     />
     <p align="center" style="text-align:center;">
@@ -48,7 +51,7 @@ of training samples over individual test points.
         width="70%"
         align="center"
         style="display: block; margin-left: auto; margin-right: auto;"
-        src="docs/assets/influence_functions_example.png"
+        src="https://raw.githubusercontent.com/aai-institute/pyDVL/develop/docs/assets/influence_functions_example.png"
         alt="best sample removal"
     />
     <p align="center" style="text-align:center;">
@@ -256,72 +259,6 @@ Please open new issues for bugs, feature requests and extensions. You can read
 about the structure of the project, the toolchain and workflow in the [guide for
 contributions](CONTRIBUTING.md).
 
-# Papers
-
-We currently implement the following papers:
-
-## Data Valuation
-
-- Castro, Javier, Daniel Gómez, and Juan Tejada. [Polynomial Calculation of the
-  Shapley Value Based on Sampling](https://doi.org/10.1016/j.cor.2008.04.004).
-  Computers & Operations Research, Selected papers presented at the Tenth
-  International Symposium on Locational Decisions (ISOLDE X), 36, no. 5 (May 1,
-  2009): 1726–30.
-- Ghorbani, Amirata, and James Zou. [Data Shapley: Equitable Valuation of Data
-  for Machine Learning](http://proceedings.mlr.press/v97/ghorbani19c.html). In
-  International Conference on Machine Learning, 2242–51. PMLR, 2019.
-- Wang, Tianhao, Yu Yang, and Ruoxi Jia. 
-  [Improving Cooperative Game Theory-Based Data Valuation via Data Utility
-  Learning](https://doi.org/10.48550/arXiv.2107.06336). arXiv, 2022.
-- Jia, Ruoxi, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo
-  Li, Ce Zhang, Costas Spanos, and Dawn Song. [Efficient Task-Specific Data
-  Valuation for Nearest Neighbor Algorithms](https://doi.org/10.14778/3342263.3342637).
-  Proceedings of the VLDB Endowment 12, no. 11 (1 July 2019): 1610–23.
-- Okhrati, Ramin, and Aldo Lipani. [A Multilinear Sampling Algorithm to Estimate
-  Shapley Values](https://doi.org/10.1109/ICPR48806.2021.9412511). In 25th
-  International Conference on Pattern Recognition (ICPR 2020), 7992–99. IEEE,
-  2021.
-- Yan, T., and Procaccia, A. D. [If You Like Shapley Then You’ll Love the
-  Core](https://ojs.aaai.org/index.php/AAAI/article/view/16721). Proceedings of
-  the AAAI Conference on Artificial Intelligence, 35(6) (2021): 5751-5759.
-- Jia, Ruoxi, David Dao, Boxin Wang, Frances Ann Hubis, Nick Hynes, Nezihe Merve
-  Gürel, Bo Li, Ce Zhang, Dawn Song, and Costas J. Spanos. [Towards Efficient
-  Data Valuation Based on the Shapley Value](http://proceedings.mlr.press/v89/jia19a.html).
-  In 22nd International Conference on Artificial Intelligence and Statistics,
-  1167–76. PMLR, 2019.
-- Wang, Jiachen T., and Ruoxi Jia. [Data Banzhaf: A Robust Data Valuation
-  Framework for Machine Learning](https://doi.org/10.48550/arXiv.2205.15466).
-  arXiv, October 22, 2022.
-- Kwon, Yongchan, and James Zou. [Beta Shapley: A Unified and Noise-Reduced Data
-  Valuation Framework for Machine Learning](http://arxiv.org/abs/2110.14049).
-  In Proceedings of the 25th International Conference on Artificial Intelligence
-  and Statistics (AISTATS) 2022, Vol. 151. Valencia, Spain: PMLR, 2022.
-- Kwon, Yongchan, and James Zou. [Data-OOB: Out-of-Bag Estimate as a Simple and
-  Efficient Data Value](https://proceedings.mlr.press/v202/kwon23e.html). In
-  Proceedings of the 40th International Conference on Machine Learning, 18135–52.
-  PMLR, 2023.
-- Schoch, Stephanie, Haifeng Xu, and Yangfeng Ji. [CS-Shapley: Class-Wise
-  Shapley Values for Data Valuation in
-  Classification](https://openreview.net/forum?id=KTOcrOR5mQ9). In Proc. of the
-  Thirty-Sixth Conference on Neural Information Processing Systems (NeurIPS).
-  New Orleans, Louisiana, USA, 2022.
-
-## Influence Functions
-
-- Koh, Pang Wei, and Percy Liang. [Understanding Black-Box Predictions via
-  Influence Functions](http://proceedings.mlr.press/v70/koh17a.html). In
-  Proceedings of the 34th International Conference on Machine Learning,
-  70:1885–94. Sydney, Australia: PMLR, 2017.
-- Naman Agarwal, Brian Bullins, and Elad Hazan, [Second-Order Stochastic Optimization
-  for Machine Learning in Linear Time](https://www.jmlr.org/papers/v18/16-491.html),
-  Journal of Machine Learning Research 18 (2017): 1-40.
-- Schioppa, Andrea, Polina Zablotskaia, David Vilar, and Artem Sokolov. 
-  [Scaling Up Influence Functions](http://arxiv.org/abs/2112.03052). 
-  In Proceedings of the AAAI-22. arXiv, 2021.
-- James Martens, Roger Grosse, [Optimizing Neural Networks with Kronecker-factored Approximate Curvature](https://arxiv.org/abs/1503.05671), International Conference on Machine Learning (ICML), 2015.
-- George, Thomas, César Laurent, Xavier Bouthillier, Nicolas Ballas, Pascal Vincent, [Fast Approximate Natural Gradient Descent in a Kronecker-factored Eigenbasis](https://arxiv.org/abs/1806.03884), Advances in Neural Information Processing Systems 31,2018.
-- Hataya, Ryuichiro and Yamada, Makoto, [Nystrom Method for Accurate and Scalable Implicit Differentiation](https://proceedings.mlr.press/v206/hataya23a.html), Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, 2023
-
 # License
 
 pyDVL is distributed under

diff --git a/build_scripts/modify_binder_link.py b/build_scripts/modify_binder_link.py
@@ -13,6 +13,7 @@
 from pathlib import Path
 from typing import TYPE_CHECKING, Literal, Optional
 
+from bs4 import BeautifulSoup
 from git import Repo
 from mkdocs.plugins import Config, event_priority
 
@@ -43,23 +44,41 @@ def on_startup(command: Literal["build", "gh-deploy", "serve"], dirty: bool) ->
 
 
 @event_priority(-50)
-def on_page_markdown(
-    markdown: str, page: "Page", config: Config, files: "Files"
+def on_page_content(
+    html: str, page: "Page", config: Config, files: "Files"
 ) -> Optional[str]:
     if "examples" not in page.url:
         return
     logger.info(
         f"Replacing binder link with link to notebook in repository for notebooks in {page.url}"
     )
+
     repo_name = config["repo_name"]
     root_dir = Path(config["docs_dir"]).parent
     notebooks_dir = root_dir / "notebooks"
     notebook_filename = Path(page.file.src_path).name
     file_path = (notebooks_dir / notebook_filename).relative_to(root_dir)
+
+    soup = BeautifulSoup(html, features="html.parser")
+    binder_anchor = None
+    for a in soup.find_all("a", href=True, limit=5):
+        if BINDER_BASE_URL in a["href"]:
+            binder_anchor = a
+            break
+    if binder_anchor is None:
+        logger.warning(f"Binder link was not found in notebook {file_path}")
+        return
+
     url_path = f"%2Ftree%2F{file_path}"
     binder_url = f"{BINDER_BASE_URL}/gh/{repo_name}/{branch_name}?urlpath={url_path}"
-    binder_link = f"{BINDER_LOGO_WITHOUT_CAPTION}({binder_url})"
     logger.info(f"New binder url: {binder_url}")
-    logger.info(f"Using regex: {BINDER_LINK_PATTERN}")
-    markdown = re.sub(BINDER_LINK_PATTERN, binder_link, markdown)
-    return markdown
+
+    binder_anchor["href"] = binder_url
+    binder_img = binder_anchor.find("img")
+    binder_img["style"] = "margin: auto; display: block; width: 7rem"
+    binder_img_caption = binder_anchor.find("figcaption")
+    binder_img_caption.decompose()
+
+    html = soup.prettify()
+
+    return html
diff --git a/docs/assets/pydvl.bib b/docs/assets/pydvl.bib
@@ -47,18 +47,21 @@ @article{castro_polynomial_2009
   keywords = {notion}
 }
 
-@online{frangella_randomized_2021,
+@article{frangella_randomized_2023,
   title = {Randomized {{Nyström Preconditioning}}},
   author = {Frangella, Zachary and Tropp, Joel A. and Udell, Madeleine},
-  date = {2021-12-17},
-  eprint = {2110.02820},
-  eprinttype = {arxiv},
-  eprintclass = {cs, math},
-  doi = {10.48550/arXiv.2110.02820},
-  url = {https://arxiv.org/abs/2110.02820},
-  urldate = {2023-06-04},
-  abstract = {This paper introduces the Nystr\textbackslash "om PCG algorithm for solving a symmetric positive-definite linear system. The algorithm applies the randomized Nystr\textbackslash "om method to form a low-rank approximation of the matrix, which leads to an efficient preconditioner that can be deployed with the conjugate gradient algorithm. Theoretical analysis shows that preconditioned system has constant condition number as soon as the rank of the approximation is comparable with the number of effective degrees of freedom in the matrix. The paper also develops adaptive methods that provably achieve similar performance without knowledge of the effective dimension. Numerical tests show that Nystr\textbackslash "om PCG can rapidly solve large linear systems that arise in data analysis problems, and it surpasses several competing methods from the literature.},
-  pubstate = {preprint}
+  date = {2023-06-30},
+  journaltitle = {SIAM Journal on Matrix Analysis and Applications},
+  shortjournal = {SIAM J. Matrix Anal. Appl.},
+  volume = {44},
+  number = {2},
+  pages = {718--752},
+  publisher = {{Society for Industrial and Applied Mathematics}},
+  issn = {0895-4798},
+  doi = {10.1137/21M1466244},
+  url = {https://epubs.siam.org/doi/abs/10.1137/21M1466244},
+  urldate = {2024-03-12},
+  abstract = {Randomized methods are becoming increasingly popular in numerical linear algebra. However, few attempts have been made to use them in developing preconditioners. Our interest lies in solving large-scale sparse symmetric positive definite linear systems of equations, where the system matrix is preordered to doubly bordered block diagonal form (for example, using a nested dissection ordering). We investigate the use of randomized methods to construct high-quality preconditioners. In particular, we propose a new and efficient approach that employs Nyström's method for computing low rank approximations to develop robust algebraic two-level preconditioners. Construction of the new preconditioners involves iteratively solving a smaller but denser symmetric positive definite Schur complement system with multiple right-hand sides. Numerical experiments on problems coming from a range of application areas demonstrate that this inner system can be solved cheaply using block conjugate gradients and that using a large convergence tolerance to limit the cost does not adversely affect the quality of the resulting Nyström--Schur two-level preconditioner.}
 }
 
 @inproceedings{george_fast_2018,
@@ -342,6 +345,22 @@ @inproceedings{schoch_csshapley_2022
   keywords = {notion}
 }
 
+@inproceedings{wang_data_2022,
+  title = {Data {{Banzhaf}}: {{A Robust Data Valuation Framework}} for {{Machine Learning}}},
+  shorttitle = {Data {{Banzhaf}}},
+  booktitle = {Proceedings of {{The}} 26th {{International Conference}} on {{Artificial Intelligence}} and {{Statistics}}},
+  author = {Wang, Jiachen T. and Jia, Ruoxi},
+  date = {2023-04-11},
+  pages = {6388--6421},
+  publisher = {PMLR},
+  issn = {2640-3498},
+  url = {https://proceedings.mlr.press/v206/wang23e.html},
+  urldate = {2024-02-15},
+  abstract = {Data valuation has wide use cases in machine learning, including improving data quality and creating economic incentives for data sharing. This paper studies the robustness of data valuation to noisy model performance scores. Particularly, we find that the inherent randomness of the widely used stochastic gradient descent can cause existing data value notions (e.g., the Shapley value and the Leave-one-out error) to produce inconsistent data value rankings across different runs. To address this challenge, we introduce the concept of safety margin, which measures the robustness of a data value notion. We show that the Banzhaf value, a famous value notion that originated from cooperative game theory literature, achieves the largest safety margin among all semivalues (a class of value notions that satisfy crucial properties entailed by ML applications and include the famous Shapley value and Leave-one-out error). We propose an algorithm to efficiently estimate the Banzhaf value based on the Maximum Sample Reuse (MSR) principle. Our evaluation demonstrates that the Banzhaf value outperforms the existing semivalue-based data value notions on several ML tasks such as learning with weighted samples and noisy label detection. Overall, our study suggests that when the underlying ML algorithm is stochastic, the Banzhaf value is a promising alternative to the other semivalue-based data value schemes given its computational advantage and ability to robustly differentiate data quality.},
+  eventtitle = {International {{Conference}} on {{Artificial Intelligence}} and {{Statistics}}},
+  langid = {english}
+}
+
 @inproceedings{wang_improving_2022,
   title = {Improving {{Cooperative Game Theory-based Data Valuation}} via {{Data Utility Learning}}},
   author = {Wang, Tianhao and Yang, Yu and Jia, Ruoxi},

diff --git a/docs/index.md b/docs/index.md
@@ -11,6 +11,9 @@ distributed caching of results.
 If you're a first time user of pyDVL, we recommend you to go through the
 [[installation]] and [[first-steps]] guides in the Getting Started section.
 
+If you're looking for a list of implemented methods refer to the
+[[methods]] page.
+
 <div class="grid cards" markdown>
 
 -   :fontawesome-solid-toolbox:{ .lg .middle } __Installation__

diff --git a/docs/methods.md b/docs/methods.md
@@ -0,0 +1,65 @@
+---
+title: Methods
+alias: 
+  name: methods
+  text: Methods
+---
+
+We currently implement the following methods:
+
+## Data Valuation
+
+- [**LOO**][pydvl.value.loo.compute_loo].
+
+- [**Permutation Shapley**][pydvl.value.shapley.montecarlo.permutation_montecarlo_shapley]
+  (also called **ApproxShapley**) [@castro_polynomial_2009].
+
+- [**TMCS**][pydvl.value.shapley.compute_shapley_values]
+  [@ghorbani_data_2019].
+
+- [**Data Banzhaf**][pydvl.value.semivalues.compute_banzhaf_semivalues]
+  [@wang_data_2022].
+
+- [**Beta Shapley**][pydvl.value.semivalues.compute_beta_shapley_semivalues]
+  [@kwon_beta_2022].
+
+- [**CS-Shapley**][pydvl.value.shapley.classwise.compute_classwise_shapley_values]
+  [@schoch_csshapley_2022].
+
+- [**Least Core**][pydvl.value.least_core.montecarlo.montecarlo_least_core]
+  [@yan_if_2021].
+
+- [**Owen Sampling**][pydvl.value.shapley.owen.owen_sampling_shapley]
+  [@okhrati_multilinear_2021].
+
+- [**Data Utility Learning**][pydvl.utils.utility.DataUtilityLearning]
+  [@wang_improving_2022].
+
+- [**kNN-Shapley**][pydvl.value.shapley.knn.knn_shapley]
+  [@jia_efficient_2019a].
+
+- [**Group Testing**][pydvl.value.shapley.gt.group_testing_shapley]
+  [@jia_efficient_2019]
+
+- [**Data-OOB**][pydvl.value.oob.compute_data_oob]
+  [@kwon_dataoob_2023].
+
+## Influence Functions
+
+- [**CG Influence**][pydvl.influence.torch.CgInfluence].
+  [@koh_understanding_2017].
+
+- [**Direct Influence**][pydvl.influence.torch.DirectInfluence]
+  [@koh_understanding_2017].
+
+- [**LiSSA**][pydvl.influence.torch.LissaInfluence]
+  [@agarwal_secondorder_2017].
+
+- [**Arnoldi Influence**][pydvl.influence.torch.ArnoldiInfluence]
+  [@schioppa_scaling_2021].
+
+- [**EKFAC Influence**][pydvl.influence.torch.EkfacInfluence]
+  [@george_fast_2018;@martens_optimizing_2015].
+
+- [**Nyström Influence**][pydvl.influence.torch.NystroemSketchInfluence]
+  [@hataya_nystrom_2023].
diff --git a/docs_includes/abbreviations.md b/docs_includes/abbreviations.md
@@ -19,3 +19,6 @@
 *[SV]: Shapley Value
 *[TMCS]: Truncated Monte Carlo Shapley
 *[WAD]: Weighted Accuracy Drop
+*[OOB]: Out-of-Bag
+*[CG]: Conjugate Gradient
+*[EKFAC]: Eigenvalue-corrected Kronecker Factorization
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -8,6 +8,7 @@ remote_branch: gh-pages
 
 nav:
   - Home: index.md
+  - Methods: methods.md
   - Getting Started:
     - Installation: getting-started/installation.md
     - First steps: getting-started/first-steps.md
@@ -205,6 +206,7 @@ markdown_extensions:
   - abbr
   - admonition
   - attr_list
+  - def_list
   - footnotes
   - markdown_captions
   - md_in_html

diff --git a/requirements-docs.txt b/requirements-docs.txt
@@ -1,10 +1,10 @@
 mike
 markdown-captions
 mkdocs==1.5.3
-mkdocstrings[python]>=0.18
+mkdocstrings[python]>=0.24
 mkdocs-alias-plugin>=0.6.0
 mkdocs-autorefs
-mkdocs-bibtex
+mkdocs-bibtex>=2.14.1
 mkdocs-gen-files
 mkdocs-git-revision-date-localized-plugin
 mkdocs-glightbox
@@ -17,3 +17,5 @@ mkdocs-macros-plugin
 pypandoc; sys_platform == 'darwin'
 pypandoc_binary; sys_platform != 'darwin'
 GitPython
+# Use for the binder link hook
+beautifulsoup4