aai-institute · mdbenito · Apr 12, 2024 · May 18, 2023 · Mar 19, 2024 · Mar 19, 2024
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,6 +6,8 @@
 
 - New method: `NystroemSketchInfluence`
   [PR #504](https://github.com/aai-institute/pyDVL/pull/504)
+- New method: `MSR Banzhaf Semivalues`
+  [PR #520](https://github.com/aai-institute/pyDVL/pull/520)
 - New preconditioned block variant of conjugate gradient
   [PR #507](https://github.com/aai-institute/pyDVL/pull/507)
 - Improvements to documentation: fixes, links, text, example gallery, LFS and

diff --git a/docs/assets/pydvl.bib b/docs/assets/pydvl.bib
@@ -450,7 +450,7 @@ @book{trefethen_numerical_1997
   langid = {english}
 }
 
-@inproceedings{wang_data_2022,
+@inproceedings{wang_data_2023,
   title = {Data {{Banzhaf}}: {{A Robust Data Valuation Framework}} for {{Machine Learning}}},
   shorttitle = {Data {{Banzhaf}}},
   booktitle = {Proceedings of {{The}} 26th {{International Conference}} on {{Artificial Intelligence}} and {{Statistics}}},

diff --git a/docs/value/semi-values.md b/docs/value/semi-values.md
@@ -21,7 +21,7 @@ the set $D_{-i}^{(k)}$ contains all subsets of $D$ of size $k$ that do not
 include sample $x_i$, $S_{+i}$ is the set $S$ with $x_i$ added, and $u$ is the
 utility function.
 
-Two instances of this are **Banzhaf indices** [@wang_data_2022],
+Two instances of this are **Banzhaf indices** [@wang_data_2023],
 and **Beta Shapley** [@kwon_beta_2022], with better numerical and
 rank stability in certain situations.
 
@@ -84,7 +84,7 @@ any choice of weight function $w$, one can always construct a utility with
 higher variance where $w$ is greater. Therefore, in a worst-case sense, the best
 one can do is to pick a constant weight.
 
-The authors of [@wang_data_2022] show that Banzhaf indices are more robust to
+The authors of [@wang_data_2023] show that Banzhaf indices are more robust to
 variance in the utility function than Shapley and Beta Shapley values. They are
 available in pyDVL through
 [compute_banzhaf_semivalues][pydvl.value.semivalues.compute_banzhaf_semivalues]:
@@ -98,6 +98,36 @@ values = compute_banzhaf_semivalues(
 )
 ```
 
+### Banzhaf semi-values with MSR sampling
+Wang et. al. propose a more sample-efficient method for computing Banzhaf 
+semivalues in their paper *Data Banzhaf: A Robust Data Valuation Framework 
+for Machine Learning* [@wang_data_2023]. This method updates all semivalues
+per evaluation of the utility (i.e. per model trained) based on whether a 
+specific data point was included in the data subset or not. The expression 
+for computing the semivalues is
+
+$$\hat{\phi}_{MSR}(i) = \frac{1}{|\mathbf{S}_{\ni i}|} \sum_{S \in 
+\mathbf{S}_{\ni i}} U(S) - \frac{1}{|\mathbf{S}_{\not{\ni} i}|} 
+\sum_{S \in \mathbf{S}_{\not{\ni} i}} U(S)$$
+
+where $\mathbf{S}_{\ni i}$ are the subsets that contain the index $i$ and 
+$\mathbf{S}_{\not{\ni} i}$ are the subsets not containing the index $i$.
+
+The function implementing this method is
+[compute_msr_banzhaf_semivalues][pydvl.value.semivalues.compute_msr_banzhaf_semivalues].
+```python
+from pydvl.value import compute_msr_banzhaf_semivalues, RankStability, Utility
+
+utility = Utility(model, data)
+values = compute_msr_banzhaf_semivalues(
+    u=utility, done=RankStability(rtol=0.001),
+)
+```
+For further details on how to use this method and a comparison of the sample 
+efficiency, we suggest to take a look at the example notebook 
+[msr_banzhaf_spotify](../../examples/msr_banzhaf_spotify).
+
+
 ## General semi-values
 
 As explained above, both Beta Shapley and Banzhaf indices are special cases of
@@ -130,7 +160,7 @@ values = compute_generic_semivalues(
 Allowing any coefficient can help when experimenting with models which are more
 sensitive to changes in training set size. However, Data Banzhaf indices are
 proven to be the most robust to variance in the utility function, in the sense
-of rank stability, across a range of models and datasets [@wang_data_2022]. 
+of rank stability, across a range of models and datasets [@wang_data_2023]. 
 
 !!! warning "Careful with permutation sampling"
     This generic implementation of semi-values allowing for any combination of

diff --git a/mkdocs.yml b/mkdocs.yml
@@ -34,6 +34,7 @@ nav:
         - Data utility learning: examples/shapley_utility_learning.ipynb
         - Least Core: examples/least_core_basic.ipynb
         - Data OOB: examples/data_oob.ipynb
+        - Banzhaf Semivalues: examples/msr_banzhaf_spotify.ipynb
     - Influence Function:
       - For CNNs: examples/influence_imagenet.ipynb
       - For mislabeled data: examples/influence_synthetic.ipynb
@@ -140,6 +141,7 @@ plugins:
       type: iso_date
       fallback_to_build_date: true
   - social:
+      enabled: false
       cards: !ENV [CI, True] # only build in CI
 
 theme: