Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation fix and cleanup #521

Merged
merged 9 commits into from
Mar 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .github/workflows/main.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,8 @@ jobs:
group: publish
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
schroedk marked this conversation as resolved.
Show resolved Hide resolved
- name: Setup Python 3.8
uses: ./.github/actions/python
with:
Expand Down
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@

- Bump versions of CI actions to avoid warnings [PR #502](https://github.com/aai-institute/pyDVL/pull/502)
- Add Python Version 3.11 to supported versions [PR #510](https://github.com/aai-institute/pyDVL/pull/510)
- Documentation improvements and cleanup [PR #521](https://github.com/aai-institute/pyDVL/pull/521)

## 0.8.1 - 🆕 🏗 New method and notebook, Games with exact shapley values, bug fixes and cleanup

Expand Down
73 changes: 5 additions & 68 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,9 @@

**pyDVL** collects algorithms for **Data Valuation** and **Influence Function** computation.

Refer to the [Methods](https://pydvl.org/stable/methods/)
page of our documentation for a list of all implemented methods.

**Data Valuation** for machine learning is the task of assigning a scalar
to each element of a training set which reflects its contribution to the final
performance or outcome of some model trained on it. Some concepts of
Expand All @@ -29,7 +32,7 @@ pyDVL focuses on model-dependent methods.
width="70%"
align="center"
style="display: block; margin-left: auto; margin-right: auto;"
src="docs/value/img/mclc-best-removal-10k-natural.svg"
src="https://raw.githubusercontent.com/aai-institute/pyDVL/develop/docs/value/img/mclc-best-removal-10k-natural.svg"
alt="best sample removal"
/>
<p align="center" style="text-align:center;">
Expand All @@ -48,7 +51,7 @@ of training samples over individual test points.
width="70%"
align="center"
style="display: block; margin-left: auto; margin-right: auto;"
src="docs/assets/influence_functions_example.png"
src="https://raw.githubusercontent.com/aai-institute/pyDVL/develop/docs/assets/influence_functions_example.png"
alt="best sample removal"
/>
<p align="center" style="text-align:center;">
Expand Down Expand Up @@ -256,72 +259,6 @@ Please open new issues for bugs, feature requests and extensions. You can read
about the structure of the project, the toolchain and workflow in the [guide for
contributions](CONTRIBUTING.md).

# Papers

We currently implement the following papers:

## Data Valuation

- Castro, Javier, Daniel Gómez, and Juan Tejada. [Polynomial Calculation of the
Shapley Value Based on Sampling](https://doi.org/10.1016/j.cor.2008.04.004).
Computers & Operations Research, Selected papers presented at the Tenth
International Symposium on Locational Decisions (ISOLDE X), 36, no. 5 (May 1,
2009): 1726–30.
- Ghorbani, Amirata, and James Zou. [Data Shapley: Equitable Valuation of Data
for Machine Learning](http://proceedings.mlr.press/v97/ghorbani19c.html). In
International Conference on Machine Learning, 2242–51. PMLR, 2019.
- Wang, Tianhao, Yu Yang, and Ruoxi Jia.
[Improving Cooperative Game Theory-Based Data Valuation via Data Utility
Learning](https://doi.org/10.48550/arXiv.2107.06336). arXiv, 2022.
- Jia, Ruoxi, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo
Li, Ce Zhang, Costas Spanos, and Dawn Song. [Efficient Task-Specific Data
Valuation for Nearest Neighbor Algorithms](https://doi.org/10.14778/3342263.3342637).
Proceedings of the VLDB Endowment 12, no. 11 (1 July 2019): 1610–23.
- Okhrati, Ramin, and Aldo Lipani. [A Multilinear Sampling Algorithm to Estimate
Shapley Values](https://doi.org/10.1109/ICPR48806.2021.9412511). In 25th
International Conference on Pattern Recognition (ICPR 2020), 7992–99. IEEE,
2021.
- Yan, T., and Procaccia, A. D. [If You Like Shapley Then You’ll Love the
Core](https://ojs.aaai.org/index.php/AAAI/article/view/16721). Proceedings of
the AAAI Conference on Artificial Intelligence, 35(6) (2021): 5751-5759.
- Jia, Ruoxi, David Dao, Boxin Wang, Frances Ann Hubis, Nick Hynes, Nezihe Merve
Gürel, Bo Li, Ce Zhang, Dawn Song, and Costas J. Spanos. [Towards Efficient
Data Valuation Based on the Shapley Value](http://proceedings.mlr.press/v89/jia19a.html).
In 22nd International Conference on Artificial Intelligence and Statistics,
1167–76. PMLR, 2019.
- Wang, Jiachen T., and Ruoxi Jia. [Data Banzhaf: A Robust Data Valuation
Framework for Machine Learning](https://doi.org/10.48550/arXiv.2205.15466).
arXiv, October 22, 2022.
- Kwon, Yongchan, and James Zou. [Beta Shapley: A Unified and Noise-Reduced Data
Valuation Framework for Machine Learning](http://arxiv.org/abs/2110.14049).
In Proceedings of the 25th International Conference on Artificial Intelligence
and Statistics (AISTATS) 2022, Vol. 151. Valencia, Spain: PMLR, 2022.
- Kwon, Yongchan, and James Zou. [Data-OOB: Out-of-Bag Estimate as a Simple and
Efficient Data Value](https://proceedings.mlr.press/v202/kwon23e.html). In
Proceedings of the 40th International Conference on Machine Learning, 18135–52.
PMLR, 2023.
- Schoch, Stephanie, Haifeng Xu, and Yangfeng Ji. [CS-Shapley: Class-Wise
Shapley Values for Data Valuation in
Classification](https://openreview.net/forum?id=KTOcrOR5mQ9). In Proc. of the
Thirty-Sixth Conference on Neural Information Processing Systems (NeurIPS).
New Orleans, Louisiana, USA, 2022.

## Influence Functions

- Koh, Pang Wei, and Percy Liang. [Understanding Black-Box Predictions via
Influence Functions](http://proceedings.mlr.press/v70/koh17a.html). In
Proceedings of the 34th International Conference on Machine Learning,
70:1885–94. Sydney, Australia: PMLR, 2017.
- Naman Agarwal, Brian Bullins, and Elad Hazan, [Second-Order Stochastic Optimization
for Machine Learning in Linear Time](https://www.jmlr.org/papers/v18/16-491.html),
Journal of Machine Learning Research 18 (2017): 1-40.
- Schioppa, Andrea, Polina Zablotskaia, David Vilar, and Artem Sokolov.
[Scaling Up Influence Functions](http://arxiv.org/abs/2112.03052).
In Proceedings of the AAAI-22. arXiv, 2021.
- James Martens, Roger Grosse, [Optimizing Neural Networks with Kronecker-factored Approximate Curvature](https://arxiv.org/abs/1503.05671), International Conference on Machine Learning (ICML), 2015.
- George, Thomas, César Laurent, Xavier Bouthillier, Nicolas Ballas, Pascal Vincent, [Fast Approximate Natural Gradient Descent in a Kronecker-factored Eigenbasis](https://arxiv.org/abs/1806.03884), Advances in Neural Information Processing Systems 31,2018.
- Hataya, Ryuichiro and Yamada, Makoto, [Nystrom Method for Accurate and Scalable Implicit Differentiation](https://proceedings.mlr.press/v206/hataya23a.html), Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, 2023

# License

pyDVL is distributed under
Expand Down
31 changes: 25 additions & 6 deletions build_scripts/modify_binder_link.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from pathlib import Path
from typing import TYPE_CHECKING, Literal, Optional

from bs4 import BeautifulSoup
from git import Repo
from mkdocs.plugins import Config, event_priority

Expand Down Expand Up @@ -43,23 +44,41 @@ def on_startup(command: Literal["build", "gh-deploy", "serve"], dirty: bool) ->


@event_priority(-50)
def on_page_markdown(
markdown: str, page: "Page", config: Config, files: "Files"
def on_page_content(
html: str, page: "Page", config: Config, files: "Files"
) -> Optional[str]:
if "examples" not in page.url:
return
logger.info(
f"Replacing binder link with link to notebook in repository for notebooks in {page.url}"
)

repo_name = config["repo_name"]
root_dir = Path(config["docs_dir"]).parent
notebooks_dir = root_dir / "notebooks"
notebook_filename = Path(page.file.src_path).name
file_path = (notebooks_dir / notebook_filename).relative_to(root_dir)

soup = BeautifulSoup(html, features="html.parser")
binder_anchor = None
for a in soup.find_all("a", href=True, limit=5):
if BINDER_BASE_URL in a["href"]:
binder_anchor = a
break
if binder_anchor is None:
logger.warning(f"Binder link was not found in notebook {file_path}")
return

url_path = f"%2Ftree%2F{file_path}"
binder_url = f"{BINDER_BASE_URL}/gh/{repo_name}/{branch_name}?urlpath={url_path}"
binder_link = f"{BINDER_LOGO_WITHOUT_CAPTION}({binder_url})"
logger.info(f"New binder url: {binder_url}")
logger.info(f"Using regex: {BINDER_LINK_PATTERN}")
markdown = re.sub(BINDER_LINK_PATTERN, binder_link, markdown)
return markdown

binder_anchor["href"] = binder_url
binder_img = binder_anchor.find("img")
binder_img["style"] = "margin: auto; display: block; width: 7rem"
binder_img_caption = binder_anchor.find("figcaption")
binder_img_caption.decompose()

html = soup.prettify()

return html
39 changes: 29 additions & 10 deletions docs/assets/pydvl.bib
Original file line number Diff line number Diff line change
Expand Up @@ -47,18 +47,21 @@ @article{castro_polynomial_2009
keywords = {notion}
}

@online{frangella_randomized_2021,
@article{frangella_randomized_2023,
title = {Randomized {{Nyström Preconditioning}}},
author = {Frangella, Zachary and Tropp, Joel A. and Udell, Madeleine},
date = {2021-12-17},
eprint = {2110.02820},
eprinttype = {arxiv},
eprintclass = {cs, math},
doi = {10.48550/arXiv.2110.02820},
url = {https://arxiv.org/abs/2110.02820},
urldate = {2023-06-04},
abstract = {This paper introduces the Nystr\textbackslash "om PCG algorithm for solving a symmetric positive-definite linear system. The algorithm applies the randomized Nystr\textbackslash "om method to form a low-rank approximation of the matrix, which leads to an efficient preconditioner that can be deployed with the conjugate gradient algorithm. Theoretical analysis shows that preconditioned system has constant condition number as soon as the rank of the approximation is comparable with the number of effective degrees of freedom in the matrix. The paper also develops adaptive methods that provably achieve similar performance without knowledge of the effective dimension. Numerical tests show that Nystr\textbackslash "om PCG can rapidly solve large linear systems that arise in data analysis problems, and it surpasses several competing methods from the literature.},
pubstate = {preprint}
date = {2023-06-30},
journaltitle = {SIAM Journal on Matrix Analysis and Applications},
shortjournal = {SIAM J. Matrix Anal. Appl.},
volume = {44},
number = {2},
pages = {718--752},
publisher = {{Society for Industrial and Applied Mathematics}},
issn = {0895-4798},
doi = {10.1137/21M1466244},
url = {https://epubs.siam.org/doi/abs/10.1137/21M1466244},
urldate = {2024-03-12},
abstract = {Randomized methods are becoming increasingly popular in numerical linear algebra. However, few attempts have been made to use them in developing preconditioners. Our interest lies in solving large-scale sparse symmetric positive definite linear systems of equations, where the system matrix is preordered to doubly bordered block diagonal form (for example, using a nested dissection ordering). We investigate the use of randomized methods to construct high-quality preconditioners. In particular, we propose a new and efficient approach that employs Nyström's method for computing low rank approximations to develop robust algebraic two-level preconditioners. Construction of the new preconditioners involves iteratively solving a smaller but denser symmetric positive definite Schur complement system with multiple right-hand sides. Numerical experiments on problems coming from a range of application areas demonstrate that this inner system can be solved cheaply using block conjugate gradients and that using a large convergence tolerance to limit the cost does not adversely affect the quality of the resulting Nyström--Schur two-level preconditioner.}
}

@inproceedings{george_fast_2018,
Expand Down Expand Up @@ -342,6 +345,22 @@ @inproceedings{schoch_csshapley_2022
keywords = {notion}
}

@inproceedings{wang_data_2022,
title = {Data {{Banzhaf}}: {{A Robust Data Valuation Framework}} for {{Machine Learning}}},
shorttitle = {Data {{Banzhaf}}},
booktitle = {Proceedings of {{The}} 26th {{International Conference}} on {{Artificial Intelligence}} and {{Statistics}}},
author = {Wang, Jiachen T. and Jia, Ruoxi},
date = {2023-04-11},
pages = {6388--6421},
publisher = {PMLR},
issn = {2640-3498},
url = {https://proceedings.mlr.press/v206/wang23e.html},
urldate = {2024-02-15},
abstract = {Data valuation has wide use cases in machine learning, including improving data quality and creating economic incentives for data sharing. This paper studies the robustness of data valuation to noisy model performance scores. Particularly, we find that the inherent randomness of the widely used stochastic gradient descent can cause existing data value notions (e.g., the Shapley value and the Leave-one-out error) to produce inconsistent data value rankings across different runs. To address this challenge, we introduce the concept of safety margin, which measures the robustness of a data value notion. We show that the Banzhaf value, a famous value notion that originated from cooperative game theory literature, achieves the largest safety margin among all semivalues (a class of value notions that satisfy crucial properties entailed by ML applications and include the famous Shapley value and Leave-one-out error). We propose an algorithm to efficiently estimate the Banzhaf value based on the Maximum Sample Reuse (MSR) principle. Our evaluation demonstrates that the Banzhaf value outperforms the existing semivalue-based data value notions on several ML tasks such as learning with weighted samples and noisy label detection. Overall, our study suggests that when the underlying ML algorithm is stochastic, the Banzhaf value is a promising alternative to the other semivalue-based data value schemes given its computational advantage and ability to robustly differentiate data quality.},
eventtitle = {International {{Conference}} on {{Artificial Intelligence}} and {{Statistics}}},
langid = {english}
}

@inproceedings{wang_improving_2022,
title = {Improving {{Cooperative Game Theory-based Data Valuation}} via {{Data Utility Learning}}},
author = {Wang, Tianhao and Yang, Yu and Jia, Ruoxi},
Expand Down
3 changes: 3 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,9 @@ distributed caching of results.
If you're a first time user of pyDVL, we recommend you to go through the
[[installation]] and [[first-steps]] guides in the Getting Started section.

If you're looking for a list of implemented methods refer to the
[[methods]] page.

<div class="grid cards" markdown>

- :fontawesome-solid-toolbox:{ .lg .middle } __Installation__
Expand Down
65 changes: 65 additions & 0 deletions docs/methods.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
---
title: Methods
alias:
name: methods
text: Methods
---

We currently implement the following methods:

## Data Valuation

- [**LOO**][pydvl.value.loo.compute_loo].

- [**Permutation Shapley**][pydvl.value.shapley.montecarlo.permutation_montecarlo_shapley]
(also called **ApproxShapley**) [@castro_polynomial_2009].

- [**TMCS**][pydvl.value.shapley.compute_shapley_values]
[@ghorbani_data_2019].

- [**Data Banzhaf**][pydvl.value.semivalues.compute_banzhaf_semivalues]
[@wang_data_2022].

- [**Beta Shapley**][pydvl.value.semivalues.compute_beta_shapley_semivalues]
[@kwon_beta_2022].

- [**CS-Shapley**][pydvl.value.shapley.classwise.compute_classwise_shapley_values]
[@schoch_csshapley_2022].

- [**Least Core**][pydvl.value.least_core.montecarlo.montecarlo_least_core]
[@yan_if_2021].

- [**Owen Sampling**][pydvl.value.shapley.owen.owen_sampling_shapley]
[@okhrati_multilinear_2021].

- [**Data Utility Learning**][pydvl.utils.utility.DataUtilityLearning]
[@wang_improving_2022].

- [**kNN-Shapley**][pydvl.value.shapley.knn.knn_shapley]
[@jia_efficient_2019a].

- [**Group Testing**][pydvl.value.shapley.gt.group_testing_shapley]
[@jia_efficient_2019]

- [**Data-OOB**][pydvl.value.oob.compute_data_oob]
[@kwon_dataoob_2023].

## Influence Functions

- [**CG Influence**][pydvl.influence.torch.CgInfluence].
[@koh_understanding_2017].

- [**Direct Influence**][pydvl.influence.torch.DirectInfluence]
[@koh_understanding_2017].

- [**LiSSA**][pydvl.influence.torch.LissaInfluence]
[@agarwal_secondorder_2017].

- [**Arnoldi Influence**][pydvl.influence.torch.ArnoldiInfluence]
[@schioppa_scaling_2021].

- [**EKFAC Influence**][pydvl.influence.torch.EkfacInfluence]
[@george_fast_2018;@martens_optimizing_2015].

- [**Nyström Influence**][pydvl.influence.torch.NystroemSketchInfluence]
[@hataya_nystrom_2023].
3 changes: 3 additions & 0 deletions docs_includes/abbreviations.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,3 +19,6 @@
*[SV]: Shapley Value
*[TMCS]: Truncated Monte Carlo Shapley
*[WAD]: Weighted Accuracy Drop
*[OOB]: Out-of-Bag
*[CG]: Conjugate Gradient
*[EKFAC]: Eigenvalue-corrected Kronecker Factorization
2 changes: 2 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ remote_branch: gh-pages

nav:
- Home: index.md
- Methods: methods.md
- Getting Started:
- Installation: getting-started/installation.md
- First steps: getting-started/first-steps.md
Expand Down Expand Up @@ -205,6 +206,7 @@ markdown_extensions:
- abbr
- admonition
- attr_list
- def_list
- footnotes
- markdown_captions
- md_in_html
Expand Down
6 changes: 4 additions & 2 deletions requirements-docs.txt
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
mike
markdown-captions
mkdocs==1.5.3
mkdocstrings[python]>=0.18
mkdocstrings[python]>=0.24
mkdocs-alias-plugin>=0.6.0
mkdocs-autorefs
mkdocs-bibtex
mkdocs-bibtex>=2.14.1
mkdocs-gen-files
mkdocs-git-revision-date-localized-plugin
mkdocs-glightbox
Expand All @@ -17,3 +17,5 @@ mkdocs-macros-plugin
pypandoc; sys_platform == 'darwin'
pypandoc_binary; sys_platform != 'darwin'
GitPython
# Use for the binder link hook
beautifulsoup4
Loading