Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSR method for Banzhaf #520

Merged
merged 47 commits into from
Apr 12, 2024
Merged
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
837cbeb
WIP: MSR method for Banzhaf
mdbenito May 18, 2023
800a36a
wip: adapt to new structure of semivalue computation
jakobkruse1 Mar 19, 2024
f579644
linting
jakobkruse1 Mar 19, 2024
699e9cb
wip: refactor to future processor
jakobkruse1 Mar 21, 2024
1f913c8
linting
jakobkruse1 Mar 21, 2024
f657b88
fix type hints
jakobkruse1 Mar 21, 2024
3564ea6
fix type hints
jakobkruse1 Mar 21, 2024
c97821a
fix some bugs to achieve correct banzhaf semivalues
jakobkruse1 Mar 26, 2024
7e7286c
chore and documentation tasks
jakobkruse1 Mar 27, 2024
a34f3aa
Merge branch 'develop' into feature/msr-banzhaf
jakobkruse1 Mar 27, 2024
bfe5637
test msr banzhaf methods
jakobkruse1 Mar 27, 2024
4542263
fix batch size update unintentional change
jakobkruse1 Mar 27, 2024
22c5bdb
update changelog with banzhaf MSR
jakobkruse1 Mar 27, 2024
8f8e187
update list type hint
jakobkruse1 Mar 27, 2024
3a27dfd
test additional functionality added
jakobkruse1 Mar 28, 2024
ea47b24
update documentation with MSR banzhaf
jakobkruse1 Mar 28, 2024
0c83af4
fix documentation links
jakobkruse1 Mar 28, 2024
ec343af
disable social plugin in mkdocs to repair ci
jakobkruse1 Apr 2, 2024
8ba78c3
docs updates from review
jakobkruse1 Apr 2, 2024
de2f2c8
Merge branch 'develop' into feature/msr-banzhaf
jakobkruse1 Apr 2, 2024
c2adbdc
check order of elements in test
jakobkruse1 Apr 2, 2024
57aa335
rename bibliography entry for wang_data_2023
jakobkruse1 Apr 3, 2024
737c0d3
minor changes in stopping formatting
jakobkruse1 Apr 3, 2024
ac2a71d
rewrite banzhaf notebook
jakobkruse1 Apr 4, 2024
7934056
run banzhaf notebook to get outputs for documentation
jakobkruse1 Apr 4, 2024
c6eb063
linter for banzhaf notebook
jakobkruse1 Apr 4, 2024
57eb45a
reenable mkdocs social plugin
jakobkruse1 Apr 4, 2024
0d2c4e0
add consistency check to msr notebook
jakobkruse1 Apr 5, 2024
30807c6
Merge branch 'develop' into feature/msr-banzhaf
jakobkruse1 Apr 5, 2024
be76dc5
speedup test in ci by using another utility
jakobkruse1 Apr 5, 2024
3b7dbb6
reduced runtime of notebook even more
jakobkruse1 Apr 8, 2024
bccb3ca
adapt to new parallel backend structure
jakobkruse1 Apr 8, 2024
637074b
finish notebook for msr banzhaf
jakobkruse1 Apr 8, 2024
7140215
Merge branch 'develop' into feature/msr-banzhaf
jakobkruse1 Apr 8, 2024
9c1509b
Merge branch 'develop' into feature/msr-banzhaf
jakobkruse1 Apr 8, 2024
dca88a7
Some tweaks to notebook text and supporting code
mdbenito Apr 8, 2024
2a395ee
Add number of updates to results dataframe
mdbenito Apr 8, 2024
cb50644
Add notebook to gallery
mdbenito Apr 11, 2024
23532ce
Fix imports and other minor stuff
mdbenito Apr 11, 2024
fdad917
run notebook complete
jakobkruse1 Apr 12, 2024
61fb081
satisfy linters
jakobkruse1 Apr 12, 2024
90ad81b
fix bug in least core caused by new updates column in results
jakobkruse1 Apr 12, 2024
1f7d7f2
Rename rank stability and homogeneize interface
mdbenito Apr 12, 2024
8f25b53
Require explicit stopping criteria (impossible to provide good defaults)
mdbenito Apr 12, 2024
b3b2ec2
Missing exports
mdbenito Apr 12, 2024
f202f88
Further cleanup of the notebook (hiding cells, text changes)
mdbenito Apr 12, 2024
426b867
[skip ci] Update CHANGELOG.md
mdbenito Apr 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@

- New method: `NystroemSketchInfluence`
[PR #504](https://github.com/aai-institute/pyDVL/pull/504)
- New method: `MSR Banzhaf Semivalues`
[PR #520](https://github.com/aai-institute/pyDVL/pull/520)
- New preconditioned block variant of conjugate gradient
[PR #507](https://github.com/aai-institute/pyDVL/pull/507)
- Improvements to documentation: fixes, links, text, example gallery, LFS and
Expand Down
2 changes: 1 addition & 1 deletion docs/assets/pydvl.bib
Original file line number Diff line number Diff line change
Expand Up @@ -450,7 +450,7 @@ @book{trefethen_numerical_1997
langid = {english}
}

@inproceedings{wang_data_2022,
@inproceedings{wang_data_2023,
title = {Data {{Banzhaf}}: {{A Robust Data Valuation Framework}} for {{Machine Learning}}},
shorttitle = {Data {{Banzhaf}}},
booktitle = {Proceedings of {{The}} 26th {{International Conference}} on {{Artificial Intelligence}} and {{Statistics}}},
Expand Down
36 changes: 33 additions & 3 deletions docs/value/semi-values.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ the set $D_{-i}^{(k)}$ contains all subsets of $D$ of size $k$ that do not
include sample $x_i$, $S_{+i}$ is the set $S$ with $x_i$ added, and $u$ is the
utility function.

Two instances of this are **Banzhaf indices** [@wang_data_2022],
Two instances of this are **Banzhaf indices** [@wang_data_2023],
and **Beta Shapley** [@kwon_beta_2022], with better numerical and
rank stability in certain situations.

Expand Down Expand Up @@ -84,7 +84,7 @@ any choice of weight function $w$, one can always construct a utility with
higher variance where $w$ is greater. Therefore, in a worst-case sense, the best
one can do is to pick a constant weight.

The authors of [@wang_data_2022] show that Banzhaf indices are more robust to
The authors of [@wang_data_2023] show that Banzhaf indices are more robust to
variance in the utility function than Shapley and Beta Shapley values. They are
available in pyDVL through
[compute_banzhaf_semivalues][pydvl.value.semivalues.compute_banzhaf_semivalues]:
Expand All @@ -98,6 +98,36 @@ values = compute_banzhaf_semivalues(
)
```

### Banzhaf semi-values with MSR sampling
jakobkruse1 marked this conversation as resolved.
Show resolved Hide resolved
Wang et. al. propose a more sample-efficient method for computing Banzhaf
semivalues in their paper *Data Banzhaf: A Robust Data Valuation Framework
for Machine Learning* [@wang_data_2023]. This method updates all semivalues
per evaluation of the utility (i.e. per model trained) based on whether a
specific data point was included in the data subset or not. The expression
for computing the semivalues is

$$\hat{\phi}_{MSR}(i) = \frac{1}{|\mathbf{S}_{\ni i}|} \sum_{S \in
\mathbf{S}_{\ni i}} U(S) - \frac{1}{|\mathbf{S}_{\not{\ni} i}|}
\sum_{S \in \mathbf{S}_{\not{\ni} i}} U(S)$$

where $\mathbf{S}_{\ni i}$ are the subsets that contain the index $i$ and
$\mathbf{S}_{\not{\ni} i}$ are the subsets not containing the index $i$.

The function implementing this method is
[compute_msr_banzhaf_semivalues][pydvl.value.semivalues.compute_msr_banzhaf_semivalues].
```python
from pydvl.value import compute_msr_banzhaf_semivalues, RankStability, Utility

utility = Utility(model, data)
values = compute_msr_banzhaf_semivalues(
u=utility, done=RankStability(rtol=0.001),
)
```
For further details on how to use this method and a comparison of the sample
efficiency, we suggest to take a look at the example notebook
[msr_banzhaf_spotify](../../examples/msr_banzhaf_spotify).


## General semi-values

As explained above, both Beta Shapley and Banzhaf indices are special cases of
Expand Down Expand Up @@ -130,7 +160,7 @@ values = compute_generic_semivalues(
Allowing any coefficient can help when experimenting with models which are more
sensitive to changes in training set size. However, Data Banzhaf indices are
proven to be the most robust to variance in the utility function, in the sense
of rank stability, across a range of models and datasets [@wang_data_2022].
of rank stability, across a range of models and datasets [@wang_data_2023].

!!! warning "Careful with permutation sampling"
This generic implementation of semi-values allowing for any combination of
Expand Down
2 changes: 2 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ nav:
- Data utility learning: examples/shapley_utility_learning.ipynb
- Least Core: examples/least_core_basic.ipynb
- Data OOB: examples/data_oob.ipynb
- Banzhaf Semivalues: examples/msr_banzhaf_spotify.ipynb
- Influence Function:
- For CNNs: examples/influence_imagenet.ipynb
- For mislabeled data: examples/influence_synthetic.ipynb
Expand Down Expand Up @@ -140,6 +141,7 @@ plugins:
type: iso_date
fallback_to_build_date: true
- social:
enabled: false
jakobkruse1 marked this conversation as resolved.
Show resolved Hide resolved
cards: !ENV [CI, True] # only build in CI

theme:
Expand Down
Loading
Loading