Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doc/glossary #537

Merged
merged 14 commits into from
Mar 26, 2024
6 changes: 4 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@

### Added

- Many improvements to the documentation: fixes, links, text, example gallery
and more. [PR #532](https://github.com/aai-institute/pyDVL/pull/532)
- Glossary of data valuation and influence terms in the docs.
[PR #537](https://github.com/aai-institute/pyDVL/pull/537
- Implement new method: `NystroemSketchInfluence`
[PR #504](https://github.com/aai-institute/pyDVL/pull/504)
- Add property `model_dtype` to instances of type `TorchInfluenceFunctionModel`
Expand All @@ -24,8 +28,6 @@

### Changed

- Improvements to documentation: fixes, links, text, example gallery and more
[PR #532](https://github.com/aai-institute/pyDVL/pull/532)
- Bump versions of CI actions to avoid warnings [PR #502](https://github.com/aai-institute/pyDVL/pull/502)
- Add Python Version 3.11 to supported versions [PR #510](https://github.com/aai-institute/pyDVL/pull/510)
- Documentation improvements and cleanup [PR #521](https://github.com/aai-institute/pyDVL/pull/521) [PR #522](https://github.com/aai-institute/pyDVL/pull/522)
Expand Down
280 changes: 280 additions & 0 deletions docs/getting-started/glossary.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,280 @@
# Glossary

This glossary is meant to provide only brief explanations of each term, helping
to clarify the concepts and techniques used in the library. For more detailed
information, please refer to the relevant literature or resources.

!!! warning
This glossary is still a work in progress. Pull requests are welcome!

Terms in data valuation and influence functions:

### Arnoldi Method

The Arnoldi method approximately computes eigenvalue, eigenvector pairs of
a symmetric matrix. For influence functions, it is used to approximate
the [iHVP][inverse-hessian-vector-product].
Introduced by [@schioppa_scaling_2021] in the context of influence functions.

* [Implementation (torch)
][pydvl.influence.torch.influence_function_model.ArnoldiInfluence]
* [Documentation (torch)][arnoldi]

### Block Conjugate Gradient

A blocked version of [CG][conjugate-gradient], which solves several linear
systems simultaneously. For Influence Functions, it is used to
approximate the [iHVP][inverse-hessian-vector-product].

* [Implementation (torch)
][pydvl.influence.torch.influence_function_model.CgInfluence]
* [Documentation (torch)][cg]

### Class-wise Shapley

Class-wise Shapley is a Shapley valuation method which introduces a utility
function that balances in-class, and out-of-class accuracy, with the goal of
favoring points that improve the model's performance on the class they belong
to. It is estimated to be particularly useful in imbalanced datasets, but more
research is needed to confirm this.
Introduced by [@schoch_csshapley_2022].

* [Implementation
][pydvl.value.shapley.classwise.compute_classwise_shapley_values]
* [Documentation][class-wise-shapley]

### Conjugate Gradient

CG is an algorithm for solving linear systems with a symmetric and
positive-definite coefficient matrix. For Influence Functions, it is used to
approximate the [iHVP][inverse-hessian-vector-product].

* [Implementation (torch)
][pydvl.influence.torch.influence_function_model.CgInfluence]
* [Documentation (torch)][cg]

### Data Utility Learning

Data Utility Learning is a method that uses an ML model to learn the utility
function. Essentially, it learns to predict the performance of a model when
trained on a given set of indices from the dataset. The cost of training this
model is quickly amortized by avoiding costly re-evaluations of the original
utility.
Introduced by [@wang_improving_2022].

* [Implementation][pydvl.utils.utility.DataUtilityLearning]
* [Documentation][creating-a-utility]

### Eigenvalue-corrected Kronecker-Factored Approximate Curvature

EKFAC builds on [K-FAC][kronecker-factored-approximate-curvature] by correcting
for the approximation errors in the eigenvalues of the blocks of the
Kronecker-factored approximate curvature matrix. This correction aims to refine
the accuracy of natural gradient approximations, thus potentially offering
better training efficiency and stability in neural networks.

* [Implementation (torch)
][pydvl.influence.torch.influence_function_model.EkfacInfluence]
* [Documentation (torch)][eigenvalue-corrected-k-fac]


### Group Testing

Group Testing is a strategy for identifying characteristics within groups of
items efficiently, by testing groups rather than individuals to quickly narrow
down the search for items with specific properties.
Introduced into data valuation by [@jia_efficient_2019a].

* [Implementation][pydvl.value.shapley.gt.group_testing_shapley]
* [Documentation][group-testing]

### Influence Function

The Influence Function measures the impact of a single data point on a
statistical estimator. In machine learning, it's used to understand how much a
particular data point affects the model's prediction.
Introduced into data valuation by [@koh_understanding_2017].

* [[influence-function|Documentation]]

### Inverse Hessian-vector product

iHVP is the operation of calculating the product of the inverse Hessian matrix
of a function and a vector, without explicitly constructing nor inverting the
full Hessian matrix first. This is essential for influence function computation.

### Kronecker-Factored Approximate Curvature

K-FAC is an optimization technique that approximates the Fisher Information
matrix's inverse efficiently. It uses the Kronecker product to factor the
matrix, significantly speeding up the computation of natural gradient updates
and potentially improving training efficiency.

### Least Core

The Least Core is a solution concept in cooperative game theory, referring to
the smallest set of payoffs to players that cannot be improved upon by any
coalition, ensuring stability in the allocation of value. In data valuation,
it implies solving a linear and a quadratic system whose constraints are
determined by the evaluations of the utility function on every subset of the
training data.
Introduced as data valuation method by [@yan_if_2021].

* [Implementation][pydvl.value.least_core.compute_least_core_values]
* [Documentation][least-core-values]

### Linear-time Stochastic Second-order Algorithm

LiSSA is an efficient algorithm for approximating the inverse Hessian-vector
product, enabling faster computations in large-scale machine learning
problems, particularly for second-order optimization.
For Influence Functions, it is used to
approximate the [iHVP][inverse-hessian-vector-product].
Introduced by [@agarwal_secondorder_2017].

* [Implementation (torch)
][pydvl.influence.torch.influence_function_model.LissaInfluence]
* [Documentation (torch)
][linear-time-stochastic-second-order-approximation-lissa]

### Leave-One-Out

LOO in the context of data valuation refers to the process of evaluating the
impact of removing individual data points on the model's performance. The
value of a training point is defined as the marginal change in the model's
performance when that point is removed from the training set.

* [Implementation][pydvl.value.loo.loo.compute_loo]
* [Documentation][leave-one-out-values]

### Monte Carlo Least Core

MCLC is a variation of the Least Core that uses a reduced amount of
constraints, sampled randomly from the powerset of the training data.
Introduced by [@yan_if_2021].

* [Implementation][pydvl.value.least_core.compute_least_core_values]
* [Documentation][monte-carlo-least-core]

### Monte Carlo Shapley

MCS estimates the Shapley Value using a Monte Carlo approximation to the sum
over subsets of the training set. This reduces computation to polynomial time
at the cost of accuracy, but this loss is typically irrelevant for downstream
applications in ML.
Introduced into data valuation by [@ghorbani_data_2019].

* [Implementation][pydvl.value.shapley.montecarlo]
* [Documentation][monte-carlo-combinatorial-shapley]

### Nyström Low-Rank Approximation

The Nyström approximation computes a low-rank approximation to a symmetric
positive-definite matrix via random projections. For influence functions,
it is used to approximate the [iHVP][inverse-hessian-vector-product].
Introduced as sketch and solve algorithm in [@hataya_nystrom_2023], and as
preconditioner for [PCG][preconditioned-conjugate-gradient] in
[@frangella_randomized_2023].

* [Implementation Sketch-and-Solve (torch)
][pydvl.influence.torch.influence_function_model.NystroemSketchInfluence]
* [Documentation Sketch-and-Solve (torch)][nystrom-sketch-and-solve]
* [Implementation Preconditioner (torch)
][pydvl.influence.torch.pre_conditioner.NystroemPreConditioner]

### Point removal task

A task in data valuation where the quality of a valuation method is measured
through the impact of incrementally removing data points on the model's
performance, where the points are removed in order of their value. See

* [Benchmarking tasks][benchmarking-tasks]

### Preconditioned Block Conjugate Gradient

A blocked version of [PCG][preconditioned-conjugate-gradient], which solves
several linear systems simultaneously. For Influence Functions, it is used to
approximate the [iHVP][inverse-hessian-vector-product].

* [Implementation CG (torch)
][pydvl.influence.torch.influence_function_model.CgInfluence]
* [Implementation Preconditioner (torch)
][pydvl.influence.torch.pre_conditioner]
* [Documentation (torch)][cg]

### Preconditioned Conjugate Gradient

A preconditioned version of [CG][conjugate-gradient] for improved
convergence, depending on the characteristics of the matrix and the
preconditioner. For Influence Functions, it is used to
approximate the [iHVP][inverse-hessian-vector-product].

* [Implementation CG (torch)
][pydvl.influence.torch.influence_function_model.CgInfluence]
* [Implementation Preconditioner (torch)
][pydvl.influence.torch.pre_conditioner]
* [Documentation (torch)][cg]

### Shapley Value

Shapley Value is a concept from cooperative game theory that allocates payouts
to players based on their contribution to the total payoff. In data valuation,
players are data points. The method assigns a value to each data point based
on a weighted average of its marginal contributions to the model's performance
when trained on each subset of the training set. This requires
$\mathcal{O}(2^{n-1})$ re-trainings of the model, which is infeasible for even
trivial data set sizes, so one resorts to approximations like TMCS.
Introduced into data valuation by [@ghorbani_data_2019].

* [Implementation][pydvl.value.shapley.naive]
* [Documentation][shapley-value]

### Truncated Monte Carlo Shapley

TMCS is an efficient approach to estimating the Shapley Value using a
truncated version of the Monte Carlo method, reducing computation time while
maintaining accuracy in large datasets.
Introduced by [@ghorbani_data_2019].

* [Implementation
][pydvl.value.shapley.montecarlo.permutation_montecarlo_shapley]
* [Documentation][permutation-shapley]

### Weighted Accuracy Drop

WAD is a metric to evaluate the impact of sequentially removing data points on
the performance of a machine learning model, weighted by their rank, i.e. by the
time at which they were removed.
Introduced by [@schoch_csshapley_2022].

---

## Other terms

### Coefficient of Variation

CV is a statistical measure of the dispersion of data points in a data series
around the mean, expressed as a percentage. It's used to compare the degree of
variation from one data series to another, even if the means are drastically
different.


### Constraint Satisfaction Problem

A CSP involves finding values for variables within specified constraints or
conditions, commonly used in scheduling, planning, and design problems where
solutions must satisfy a set of restrictions.

### Out-of-Bag

OOB refers to data samples in an ensemble learning context (like random forests)
that are not selected for training a specific model within the ensemble. These
OOB samples are used as a validation set to estimate the model's accuracy,
providing a convenient internal cross-validation mechanism.

### Machine Learning Reproducibility Challenge

The [MLRC](https://reproml.org/) is an initiative that encourages the
verification and replication of machine learning research findings, promoting
transparency and reliability in the field. Papers are published in
[Transactions on Machine Learning Research](https://jmlr.org/tmlr/) (TMLR).
29 changes: 18 additions & 11 deletions docs/value/the-core.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,7 @@ The Core is another solution concept in cooperative game theory that attempts to
ensure stability in the sense that it provides the set of feasible payoffs that
cannot be improved upon by a sub-coalition. This can be interesting for some
applications of data valuation because it yields values consistent with training
on the whole dataset, avoiding the spurious selection of subsets. It was first
introduced in this field by [@yan_if_2021].
on the whole dataset, avoiding the spurious selection of subsets.

It satisfies the following 2 properties:

Expand All @@ -30,6 +29,9 @@ It satisfies the following 2 properties:
the amount that these agents could earn by forming a coalition on their own.
$\sum_{i \in S} v(i) \geq u(S), \forall S \subset D.$

The Core was first introduced into data valuation by [@yan_if_2021], in the
following form.

## Least Core values

Unfortunately, for many cooperative games the Core may be empty. By relaxing the
Expand Down Expand Up @@ -59,9 +61,12 @@ _egalitarian least core_.

## Exact Least Core

This first algorithm is just a verbatim implementation of the definition.
As such it returns as exact a value as the utility function allows
(see what this means in [Problems of Data Values][problems-of-data-values]).
This first algorithm is just a verbatim implementation of the definition, in
[compute_least_core_values][pydvl.value.least_core.compute_least_core_values].
It computes all constraints for the linear problem by evaluating the utility on
every subset of the training data, and returns as exact a value as the utility
function allows (see what this means in [Problems of Data
Values][problems-of-data-values]).

```python
from pydvl.value import compute_least_core_values
Expand All @@ -74,18 +79,20 @@ values = compute_least_core_values(utility, mode="exact")
Because the number of subsets $S \subseteq D \setminus \{i\}$ is
$2^{ | D | - 1 }$, one typically must resort to approximations.

The simplest approximation consists in using a fraction of all subsets for the
constraints. [@yan_if_2021] show that a quantity of order
$\mathcal{O}((n - \log \Delta ) / \delta^2)$ is enough to obtain a so-called
$\delta$-*approximate least core* with high probability. I.e. the following
property holds with probability $1-\Delta$ over the choice of subsets:
The simplest one consists in using a fraction of all subsets for the constraints.
[@yan_if_2021] show that a quantity of order $\mathcal{O}((n - \log \Delta ) /
\delta^2)$ is enough to obtain a so-called $\delta$-*approximate least core*
with high probability. I.e. the following property holds with probability
$1-\Delta$ over the choice of subsets:

$$
\mathbb{P}_{S\sim D}\left[\sum_{i\in S} v(i) + e^{*} \geq u(S)\right]
\geq 1 - \delta,
$$

where $e^{*}$ is the optimal least core subsidy.
where $e^{*}$ is the optimal least core subsidy. This approximation is
also implemented in
[compute_least_core_values][pydvl.value.least_core.compute_least_core_values]:

```python
from pydvl.value import compute_least_core_values
Expand Down
Loading
Loading