From d971af7385ff7024a9c3037a97e1f646ef7b4951 Mon Sep 17 00:00:00 2001 From: Anes Benmerzoug Date: Mon, 18 Mar 2024 12:46:36 +0100 Subject: [PATCH 1/9] Use absolute URLs for images in readme --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index d1f562033..4e854f197 100644 --- a/README.md +++ b/README.md @@ -29,7 +29,7 @@ pyDVL focuses on model-dependent methods. width="70%" align="center" style="display: block; margin-left: auto; margin-right: auto;" - src="docs/value/img/mclc-best-removal-10k-natural.svg" + src="https://raw.githubusercontent.com/aai-institute/pyDVL/develop/docs/value/img/mclc-best-removal-10k-natural.svg" alt="best sample removal" />

@@ -48,7 +48,7 @@ of training samples over individual test points. width="70%" align="center" style="display: block; margin-left: auto; margin-right: auto;" - src="docs/assets/influence_functions_example.png" + src="https://raw.githubusercontent.com/aai-institute/pyDVL/develop/docs/assets/influence_functions_example.png" alt="best sample removal" />

From 553785033001e5bf2ba1e93fa7997d7b074dfebb Mon Sep 17 00:00:00 2001 From: Anes Benmerzoug Date: Mon, 18 Mar 2024 12:54:43 +0100 Subject: [PATCH 2/9] Set fetch-depth to 0 for CI jobs that builds docs This is needed in order to get correct last update timestamps --- .github/workflows/main.yaml | 2 ++ 1 file changed, 2 insertions(+) diff --git a/.github/workflows/main.yaml b/.github/workflows/main.yaml index 0c897375e..266068ec9 100644 --- a/.github/workflows/main.yaml +++ b/.github/workflows/main.yaml @@ -101,6 +101,8 @@ jobs: group: publish steps: - uses: actions/checkout@v4 + with: + fetch-depth: 0 - name: Setup Python 3.8 uses: ./.github/actions/python with: From 80dadc8ce819650d814122be05b13659398c3e7d Mon Sep 17 00:00:00 2001 From: Anes Benmerzoug Date: Mon, 18 Mar 2024 13:03:04 +0100 Subject: [PATCH 3/9] Increase lower bound of mkdocstrings version --- requirements-docs.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements-docs.txt b/requirements-docs.txt index ca554638c..66ec4b40b 100644 --- a/requirements-docs.txt +++ b/requirements-docs.txt @@ -1,7 +1,7 @@ mike markdown-captions mkdocs==1.5.3 -mkdocstrings[python]>=0.18 +mkdocstrings[python]>=0.24 mkdocs-alias-plugin>=0.6.0 mkdocs-autorefs mkdocs-bibtex From aa8cd70f6623fe2dfbcc9e9d83cf8c3dd4ad6f26 Mon Sep 17 00:00:00 2001 From: Anes Benmerzoug Date: Mon, 18 Mar 2024 14:58:10 +0100 Subject: [PATCH 4/9] Increase lower bound of mkdocs-bibtex version --- requirements-docs.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements-docs.txt b/requirements-docs.txt index 66ec4b40b..f91a92d5f 100644 --- a/requirements-docs.txt +++ b/requirements-docs.txt @@ -4,7 +4,7 @@ mkdocs==1.5.3 mkdocstrings[python]>=0.24 mkdocs-alias-plugin>=0.6.0 mkdocs-autorefs -mkdocs-bibtex +mkdocs-bibtex>=2.14.1 mkdocs-gen-files mkdocs-git-revision-date-localized-plugin mkdocs-glightbox From 2a71a923059605031c54fccc786913e6e79e74ba Mon Sep 17 00:00:00 2001 From: Anes Benmerzoug Date: Mon, 18 Mar 2024 14:58:25 +0100 Subject: [PATCH 5/9] Add a few more abbreviations --- docs_includes/abbreviations.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/docs_includes/abbreviations.md b/docs_includes/abbreviations.md index a89425885..b5e837007 100644 --- a/docs_includes/abbreviations.md +++ b/docs_includes/abbreviations.md @@ -19,3 +19,6 @@ *[SV]: Shapley Value *[TMCS]: Truncated Monte Carlo Shapley *[WAD]: Weighted Accuracy Drop +*[OOB]: Out-of-Bag +*[CG]: Conjugate Gradient +*[EKFAC]: Eigenvalue-corrected Kronecker Factorization From 30d95d273e7a32f9560a73b0dc7a6816cd03a626 Mon Sep 17 00:00:00 2001 From: Anes Benmerzoug Date: Mon, 18 Mar 2024 15:03:32 +0100 Subject: [PATCH 6/9] Move papers from readme to documentation --- README.md | 69 +++---------------------------------------------- docs/index.md | 3 +++ docs/methods.md | 65 ++++++++++++++++++++++++++++++++++++++++++++++ mkdocs.yml | 2 ++ 4 files changed, 73 insertions(+), 66 deletions(-) create mode 100644 docs/methods.md diff --git a/README.md b/README.md index 4e854f197..27f965f84 100644 --- a/README.md +++ b/README.md @@ -18,6 +18,9 @@ **pyDVL** collects algorithms for **Data Valuation** and **Influence Function** computation. +Refer to the [Methods](https://pydvl.org/stable/methods/) +page of our documentation for a list of all implemented methods. + **Data Valuation** for machine learning is the task of assigning a scalar to each element of a training set which reflects its contribution to the final performance or outcome of some model trained on it. Some concepts of @@ -256,72 +259,6 @@ Please open new issues for bugs, feature requests and extensions. You can read about the structure of the project, the toolchain and workflow in the [guide for contributions](CONTRIBUTING.md). -# Papers - -We currently implement the following papers: - -## Data Valuation - -- Castro, Javier, Daniel Gómez, and Juan Tejada. [Polynomial Calculation of the - Shapley Value Based on Sampling](https://doi.org/10.1016/j.cor.2008.04.004). - Computers & Operations Research, Selected papers presented at the Tenth - International Symposium on Locational Decisions (ISOLDE X), 36, no. 5 (May 1, - 2009): 1726–30. -- Ghorbani, Amirata, and James Zou. [Data Shapley: Equitable Valuation of Data - for Machine Learning](http://proceedings.mlr.press/v97/ghorbani19c.html). In - International Conference on Machine Learning, 2242–51. PMLR, 2019. -- Wang, Tianhao, Yu Yang, and Ruoxi Jia. - [Improving Cooperative Game Theory-Based Data Valuation via Data Utility - Learning](https://doi.org/10.48550/arXiv.2107.06336). arXiv, 2022. -- Jia, Ruoxi, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo - Li, Ce Zhang, Costas Spanos, and Dawn Song. [Efficient Task-Specific Data - Valuation for Nearest Neighbor Algorithms](https://doi.org/10.14778/3342263.3342637). - Proceedings of the VLDB Endowment 12, no. 11 (1 July 2019): 1610–23. -- Okhrati, Ramin, and Aldo Lipani. [A Multilinear Sampling Algorithm to Estimate - Shapley Values](https://doi.org/10.1109/ICPR48806.2021.9412511). In 25th - International Conference on Pattern Recognition (ICPR 2020), 7992–99. IEEE, - 2021. -- Yan, T., and Procaccia, A. D. [If You Like Shapley Then You’ll Love the - Core](https://ojs.aaai.org/index.php/AAAI/article/view/16721). Proceedings of - the AAAI Conference on Artificial Intelligence, 35(6) (2021): 5751-5759. -- Jia, Ruoxi, David Dao, Boxin Wang, Frances Ann Hubis, Nick Hynes, Nezihe Merve - Gürel, Bo Li, Ce Zhang, Dawn Song, and Costas J. Spanos. [Towards Efficient - Data Valuation Based on the Shapley Value](http://proceedings.mlr.press/v89/jia19a.html). - In 22nd International Conference on Artificial Intelligence and Statistics, - 1167–76. PMLR, 2019. -- Wang, Jiachen T., and Ruoxi Jia. [Data Banzhaf: A Robust Data Valuation - Framework for Machine Learning](https://doi.org/10.48550/arXiv.2205.15466). - arXiv, October 22, 2022. -- Kwon, Yongchan, and James Zou. [Beta Shapley: A Unified and Noise-Reduced Data - Valuation Framework for Machine Learning](http://arxiv.org/abs/2110.14049). - In Proceedings of the 25th International Conference on Artificial Intelligence - and Statistics (AISTATS) 2022, Vol. 151. Valencia, Spain: PMLR, 2022. -- Kwon, Yongchan, and James Zou. [Data-OOB: Out-of-Bag Estimate as a Simple and - Efficient Data Value](https://proceedings.mlr.press/v202/kwon23e.html). In - Proceedings of the 40th International Conference on Machine Learning, 18135–52. - PMLR, 2023. -- Schoch, Stephanie, Haifeng Xu, and Yangfeng Ji. [CS-Shapley: Class-Wise - Shapley Values for Data Valuation in - Classification](https://openreview.net/forum?id=KTOcrOR5mQ9). In Proc. of the - Thirty-Sixth Conference on Neural Information Processing Systems (NeurIPS). - New Orleans, Louisiana, USA, 2022. - -## Influence Functions - -- Koh, Pang Wei, and Percy Liang. [Understanding Black-Box Predictions via - Influence Functions](http://proceedings.mlr.press/v70/koh17a.html). In - Proceedings of the 34th International Conference on Machine Learning, - 70:1885–94. Sydney, Australia: PMLR, 2017. -- Naman Agarwal, Brian Bullins, and Elad Hazan, [Second-Order Stochastic Optimization - for Machine Learning in Linear Time](https://www.jmlr.org/papers/v18/16-491.html), - Journal of Machine Learning Research 18 (2017): 1-40. -- Schioppa, Andrea, Polina Zablotskaia, David Vilar, and Artem Sokolov. - [Scaling Up Influence Functions](http://arxiv.org/abs/2112.03052). - In Proceedings of the AAAI-22. arXiv, 2021. -- James Martens, Roger Grosse, [Optimizing Neural Networks with Kronecker-factored Approximate Curvature](https://arxiv.org/abs/1503.05671), International Conference on Machine Learning (ICML), 2015. -- George, Thomas, César Laurent, Xavier Bouthillier, Nicolas Ballas, Pascal Vincent, [Fast Approximate Natural Gradient Descent in a Kronecker-factored Eigenbasis](https://arxiv.org/abs/1806.03884), Advances in Neural Information Processing Systems 31,2018. -- Hataya, Ryuichiro and Yamada, Makoto, [Nystrom Method for Accurate and Scalable Implicit Differentiation](https://proceedings.mlr.press/v206/hataya23a.html), Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, 2023 - # License pyDVL is distributed under diff --git a/docs/index.md b/docs/index.md index c77b2c980..0fc92ba09 100644 --- a/docs/index.md +++ b/docs/index.md @@ -11,6 +11,9 @@ distributed caching of results. If you're a first time user of pyDVL, we recommend you to go through the [[installation]] and [[first-steps]] guides in the Getting Started section. +If you're looking for a list of implemented methods refer to the +[[methods]] page. +

- :fontawesome-solid-toolbox:{ .lg .middle } __Installation__ diff --git a/docs/methods.md b/docs/methods.md new file mode 100644 index 000000000..bea2f92d6 --- /dev/null +++ b/docs/methods.md @@ -0,0 +1,65 @@ +--- +title: Methods +alias: + name: methods + text: Methods +--- + +We currently implement the following methods: + +## Data Valuation + +- [**LOO**][pydvl.value.loo.compute_loo]. + +- [**Permutation Shapley**][pydvl.value.shapley.montecarlo.permutation_montecarlo_shapley] + (also called **ApproxShapley**) [@castro_polynomial_2009]. + +- [**TMCS**][pydvl.value.shapley.compute_shapley_values] + [@ghorbani_data_2019]. + +- [**Data Banzhaf**][pydvl.value.semivalues.compute_banzhaf_semivalues] + [@wang_data_2022]. + +- [**Beta Shapley**][pydvl.value.semivalues.compute_beta_shapley_semivalues] + [@kwon_beta_2022]. + +- [**CS-Shapley**][pydvl.value.shapley.classwise.compute_classwise_shapley_values] + [@schoch_csshapley_2022]. + +- [**Least Core**][pydvl.value.least_core.montecarlo.montecarlo_least_core] + [@yan_if_2021]. + +- [**Owen Sampling**][pydvl.value.shapley.owen.owen_sampling_shapley] + [@okhrati_multilinear_2021]. + +- [**Data Utility Learning**][pydvl.utils.utility.DataUtilityLearning] + [@wang_improving_2022]. + +- [**kNN-Shapley**][pydvl.value.shapley.knn.knn_shapley] + [@jia_efficient_2019a]. + +- [**Group Testing**][pydvl.value.shapley.gt.group_testing_shapley] + [@jia_efficient_2019] + +- [**Data-OOB**][pydvl.value.oob.compute_data_oob] + [@kwon_dataoob_2023]. + +## Influence Functions + +- [**CG Influence**][pydvl.influence.torch.CgInfluence]. + [@koh_understanding_2017]. + +- [**Direct Influence**][pydvl.influence.torch.DirectInfluence] + [@koh_understanding_2017]. + +- [**LiSSA**][pydvl.influence.torch.LissaInfluence] + [@agarwal_secondorder_2017]. + +- [**Arnoldi Influence**][pydvl.influence.torch.ArnoldiInfluence] + [@schioppa_scaling_2021]. + +- [**EKFAC Influence**][pydvl.influence.torch.EkfacInfluence] + [@george_fast_2018;@martens_optimizing_2015]. + +- [**Nyström Influence**][pydvl.influence.torch.NystroemSketchInfluence] + [@hataya_nystrom_2023]. diff --git a/mkdocs.yml b/mkdocs.yml index 7ae01a1cb..cb51559ed 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -8,6 +8,7 @@ remote_branch: gh-pages nav: - Home: index.md + - Methods: methods.md - Getting Started: - Installation: getting-started/installation.md - First steps: getting-started/first-steps.md @@ -205,6 +206,7 @@ markdown_extensions: - abbr - admonition - attr_list + - def_list - footnotes - markdown_captions - md_in_html From a1847af23289141faccbc228b36c43886e6e0136 Mon Sep 17 00:00:00 2001 From: Anes Benmerzoug Date: Mon, 18 Mar 2024 15:18:09 +0100 Subject: [PATCH 7/9] Add missing data banzhaf reference to bibliography --- docs/assets/pydvl.bib | 39 +++++++++++++++++++++++++++++---------- 1 file changed, 29 insertions(+), 10 deletions(-) diff --git a/docs/assets/pydvl.bib b/docs/assets/pydvl.bib index dacf52e53..c9583ce7d 100644 --- a/docs/assets/pydvl.bib +++ b/docs/assets/pydvl.bib @@ -47,18 +47,21 @@ @article{castro_polynomial_2009 keywords = {notion} } -@online{frangella_randomized_2021, +@article{frangella_randomized_2023, title = {Randomized {{Nyström Preconditioning}}}, author = {Frangella, Zachary and Tropp, Joel A. and Udell, Madeleine}, - date = {2021-12-17}, - eprint = {2110.02820}, - eprinttype = {arxiv}, - eprintclass = {cs, math}, - doi = {10.48550/arXiv.2110.02820}, - url = {https://arxiv.org/abs/2110.02820}, - urldate = {2023-06-04}, - abstract = {This paper introduces the Nystr\textbackslash "om PCG algorithm for solving a symmetric positive-definite linear system. The algorithm applies the randomized Nystr\textbackslash "om method to form a low-rank approximation of the matrix, which leads to an efficient preconditioner that can be deployed with the conjugate gradient algorithm. Theoretical analysis shows that preconditioned system has constant condition number as soon as the rank of the approximation is comparable with the number of effective degrees of freedom in the matrix. The paper also develops adaptive methods that provably achieve similar performance without knowledge of the effective dimension. Numerical tests show that Nystr\textbackslash "om PCG can rapidly solve large linear systems that arise in data analysis problems, and it surpasses several competing methods from the literature.}, - pubstate = {preprint} + date = {2023-06-30}, + journaltitle = {SIAM Journal on Matrix Analysis and Applications}, + shortjournal = {SIAM J. Matrix Anal. Appl.}, + volume = {44}, + number = {2}, + pages = {718--752}, + publisher = {{Society for Industrial and Applied Mathematics}}, + issn = {0895-4798}, + doi = {10.1137/21M1466244}, + url = {https://epubs.siam.org/doi/abs/10.1137/21M1466244}, + urldate = {2024-03-12}, + abstract = {Randomized methods are becoming increasingly popular in numerical linear algebra. However, few attempts have been made to use them in developing preconditioners. Our interest lies in solving large-scale sparse symmetric positive definite linear systems of equations, where the system matrix is preordered to doubly bordered block diagonal form (for example, using a nested dissection ordering). We investigate the use of randomized methods to construct high-quality preconditioners. In particular, we propose a new and efficient approach that employs Nyström's method for computing low rank approximations to develop robust algebraic two-level preconditioners. Construction of the new preconditioners involves iteratively solving a smaller but denser symmetric positive definite Schur complement system with multiple right-hand sides. Numerical experiments on problems coming from a range of application areas demonstrate that this inner system can be solved cheaply using block conjugate gradients and that using a large convergence tolerance to limit the cost does not adversely affect the quality of the resulting Nyström--Schur two-level preconditioner.} } @inproceedings{george_fast_2018, @@ -342,6 +345,22 @@ @inproceedings{schoch_csshapley_2022 keywords = {notion} } +@inproceedings{wang_data_2022, + title = {Data {{Banzhaf}}: {{A Robust Data Valuation Framework}} for {{Machine Learning}}}, + shorttitle = {Data {{Banzhaf}}}, + booktitle = {Proceedings of {{The}} 26th {{International Conference}} on {{Artificial Intelligence}} and {{Statistics}}}, + author = {Wang, Jiachen T. and Jia, Ruoxi}, + date = {2023-04-11}, + pages = {6388--6421}, + publisher = {PMLR}, + issn = {2640-3498}, + url = {https://proceedings.mlr.press/v206/wang23e.html}, + urldate = {2024-02-15}, + abstract = {Data valuation has wide use cases in machine learning, including improving data quality and creating economic incentives for data sharing. This paper studies the robustness of data valuation to noisy model performance scores. Particularly, we find that the inherent randomness of the widely used stochastic gradient descent can cause existing data value notions (e.g., the Shapley value and the Leave-one-out error) to produce inconsistent data value rankings across different runs. To address this challenge, we introduce the concept of safety margin, which measures the robustness of a data value notion. We show that the Banzhaf value, a famous value notion that originated from cooperative game theory literature, achieves the largest safety margin among all semivalues (a class of value notions that satisfy crucial properties entailed by ML applications and include the famous Shapley value and Leave-one-out error). We propose an algorithm to efficiently estimate the Banzhaf value based on the Maximum Sample Reuse (MSR) principle. Our evaluation demonstrates that the Banzhaf value outperforms the existing semivalue-based data value notions on several ML tasks such as learning with weighted samples and noisy label detection. Overall, our study suggests that when the underlying ML algorithm is stochastic, the Banzhaf value is a promising alternative to the other semivalue-based data value schemes given its computational advantage and ability to robustly differentiate data quality.}, + eventtitle = {International {{Conference}} on {{Artificial Intelligence}} and {{Statistics}}}, + langid = {english} +} + @inproceedings{wang_improving_2022, title = {Improving {{Cooperative Game Theory-based Data Valuation}} via {{Data Utility Learning}}}, author = {Wang, Tianhao and Yang, Yu and Jia, Ruoxi}, From 0fee9649da364b7e9bf4962437624de7420a50ac Mon Sep 17 00:00:00 2001 From: Anes Benmerzoug Date: Mon, 18 Mar 2024 15:45:13 +0100 Subject: [PATCH 8/9] Improve mkdocs binder link hook --- build_scripts/modify_binder_link.py | 31 +++++++++++++++++++++++------ requirements-docs.txt | 2 ++ 2 files changed, 27 insertions(+), 6 deletions(-) diff --git a/build_scripts/modify_binder_link.py b/build_scripts/modify_binder_link.py index a01da10b5..eb09ea02b 100644 --- a/build_scripts/modify_binder_link.py +++ b/build_scripts/modify_binder_link.py @@ -13,6 +13,7 @@ from pathlib import Path from typing import TYPE_CHECKING, Literal, Optional +from bs4 import BeautifulSoup from git import Repo from mkdocs.plugins import Config, event_priority @@ -43,23 +44,41 @@ def on_startup(command: Literal["build", "gh-deploy", "serve"], dirty: bool) -> @event_priority(-50) -def on_page_markdown( - markdown: str, page: "Page", config: Config, files: "Files" +def on_page_content( + html: str, page: "Page", config: Config, files: "Files" ) -> Optional[str]: if "examples" not in page.url: return logger.info( f"Replacing binder link with link to notebook in repository for notebooks in {page.url}" ) + repo_name = config["repo_name"] root_dir = Path(config["docs_dir"]).parent notebooks_dir = root_dir / "notebooks" notebook_filename = Path(page.file.src_path).name file_path = (notebooks_dir / notebook_filename).relative_to(root_dir) + + soup = BeautifulSoup(html, features="html.parser") + binder_anchor = None + for a in soup.find_all("a", href=True, limit=5): + if BINDER_BASE_URL in a["href"]: + binder_anchor = a + break + if binder_anchor is None: + logger.warning(f"Binder link was not found in notebook {file_path}") + return + url_path = f"%2Ftree%2F{file_path}" binder_url = f"{BINDER_BASE_URL}/gh/{repo_name}/{branch_name}?urlpath={url_path}" - binder_link = f"{BINDER_LOGO_WITHOUT_CAPTION}({binder_url})" logger.info(f"New binder url: {binder_url}") - logger.info(f"Using regex: {BINDER_LINK_PATTERN}") - markdown = re.sub(BINDER_LINK_PATTERN, binder_link, markdown) - return markdown + + binder_anchor["href"] = binder_url + binder_img = binder_anchor.find("img") + binder_img["style"] = "margin: auto; display: block; width: 7rem" + binder_img_caption = binder_anchor.find("figcaption") + binder_img_caption.decompose() + + html = soup.prettify() + + return html diff --git a/requirements-docs.txt b/requirements-docs.txt index f91a92d5f..9e24cf80d 100644 --- a/requirements-docs.txt +++ b/requirements-docs.txt @@ -17,3 +17,5 @@ mkdocs-macros-plugin pypandoc; sys_platform == 'darwin' pypandoc_binary; sys_platform != 'darwin' GitPython +# Use for the binder link hook +beautifulsoup4 From a68a5f332a0f165bc2b26d4b0e56005f297619ab Mon Sep 17 00:00:00 2001 From: Anes Benmerzoug Date: Mon, 18 Mar 2024 15:50:16 +0100 Subject: [PATCH 9/9] Update changelog --- CHANGELOG.md | 1 + 1 file changed, 1 insertion(+) diff --git a/CHANGELOG.md b/CHANGELOG.md index c9ad58374..7bcbde1da 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -19,6 +19,7 @@ - Bump versions of CI actions to avoid warnings [PR #502](https://github.com/aai-institute/pyDVL/pull/502) - Add Python Version 3.11 to supported versions [PR #510](https://github.com/aai-institute/pyDVL/pull/510) +- Documentation improvements and cleanup [PR #521](https://github.com/aai-institute/pyDVL/pull/521) ## 0.8.1 - 🆕 🏗 New method and notebook, Games with exact shapley values, bug fixes and cleanup