From 1a6a196a08326f1c149ec7b192bb1db1488fa0ae Mon Sep 17 00:00:00 2001 From: Miguel de Benito Delgado Date: Sat, 23 Mar 2024 12:00:25 +0100 Subject: [PATCH 01/13] Stub glossary --- docs/getting-started/glossary.md | 144 +++++++++++++++++++++++++++++++ mkdocs.yml | 1 + 2 files changed, 145 insertions(+) create mode 100644 docs/getting-started/glossary.md diff --git a/docs/getting-started/glossary.md b/docs/getting-started/glossary.md new file mode 100644 index 000000000..3d6151880 --- /dev/null +++ b/docs/getting-started/glossary.md @@ -0,0 +1,144 @@ +# Glossary + +This glossary is meant to provide only brief explanations of each term, helping +to clarify the concepts and techniques used in the library. For more detailed +information, please refer to the relevant literature or resources. + +## Valuation + +- **Class-wise Shapley:** + Class-wise Shapley is a Shapley valuation method which introduces a utility + function that balances in-class, and out-of-class accuracy, with the goal of + favoring points that improve the model's performance on the class they belong + to. It is estimated to be particularly useful in imbalanced datasets, but more + research is needed to confirm this. + Introduced by [@schoch_csshapley_2022]. + [Implementation][pydvl.value.shapley.classwise.compute_classwise_shapley_values]. + +- **Data Utility Learning:** + Data Utility Learning is a method that uses an ML model to learn the utility + function. Essentially, it learns to predict the performance of a model when + trained on a given set of indices from the dataset. The cost of training this + model is quickly amortized by avoiding costly re-evaluations of the original + utility. + Introduced by [@wang_improving_2022]. + [Implementation][pydvl.utils.utility.DataUtilityLearning]. + +- **Eigenvalue-corrected Kronecker-Factored Approximate Curvature**: + EKFAC builds on K-FAC by correcting for the approximation errors in the + eigenvalues of the blocks of the Kronecker-factored approximate curvature + matrix. This correction aims to refine the accuracy of natural gradient + approximations, thus potentially offering better training efficiency and + stability in neural networks. + [Implementation (torch)][pydvl.influence.torch.influence_function_model.EkfacInfluence]. + +- **Kronecker-Factored Approximate Curvature**: + K-FAC is an optimization technique that approximates the Fisher Information + matrix's inverse efficiently. It uses the Kronecker product to factor the + matrix, significantly speeding up the computation of natural gradient updates + and potentially improving training efficiency. + +- **Group Testing:** + Group Testing is a strategy for identifying characteristics within groups of + items efficiently, by testing groups rather than individuals to quickly narrow + down the search for items with specific properties. + Introduced into data valuation by [@jia_efficient_2019a]. + [Implementation][pydvl.value.shapley.gt.group_testing_shapley]. + +- **Influence Function:** + The Influence Function measures the impact of a single data point on a + statistical estimator. In machine learning, it's used to understand how much a + particular data point affects the model's prediction. + Introduced into data valuation by [@koh_understanding_2017]. + [Documentation][influence-function]. + +- **inverse Hessian-vector product:** + iHVP involves calculating the product of the inverse Hessian matrix of a + function and a vector, which is essential in optimization and in computing + influence functions efficiently. + +- **Least Core:** + The Least Core is a solution concept in cooperative game theory, referring to + the smallest set of payoffs to players that cannot be improved upon by any + coalition, ensuring stability in the allocation of value. + Introduced as data valuation method by [@yan_if_2021]. + [Implementation][pydvl.value.least_core.common.lc_solve_problem]. + +- **Linear-time Stochastic Second-order Algorithm:** + LiSSA is an efficient algorithm for approximating the inverse Hessian-vector + product, enabling faster computations in large-scale machine learning + problems, particularly for second-order optimization. + Introduced by [@agarwal_secondorder_2017]. + [Implementation (torch)][pydvl.influence.torch.influence_function_model.LissaInfluence]. + +- **Leave-One-Out:** + LOO in the context of data valuation refers to the process of evaluating the + impact of removing individual data points on the model's performance. The + value of a training point is defined as the marginal change in the model's + performance when that point is removed from the training set. + [Implementation][pydvl.value.loo.loo.compute_loo]. + +- **Monte Carlo Least Core:** + MCLC is a variation of the Least Core that uses a reduced amount of + constraints sampled randomly. + Introduced by [@yan_if_2021]. + [Implementation][pydvl.value.least_core.compute_least_core_values]. + +- **Monte Carlo Shapley:** + MCS estimates the Shapley Value using a Monte Carlo approximation to the sum + over subsets of the training set. This reduces computation to polynomial time + at the cost of accuracy, but this loss is typically irrelevant for downstream + applications in ML. + +- **Shapley Value:** + Shapley Value is a concept from cooperative game theory that allocates payouts + to players based on their contribution to the total payoff. In data valuation, + players are data points. The method assigns a value to each data point based + on a weighted average of its marginal contributions to the model's performance + when trained on each subset of the training set. This requires + $\mathcal{O}(2^{n-1}$ evaluations of the model, which is infeasible for even + trivial data set sizes, so one resorts to approximations like TMCS. + +- **Truncated Monte Carlo Shapley:** + TMCS is an efficient approach to estimating the Shapley Value using a + truncated version of the Monte Carlo method, reducing computation time while + maintaining accuracy in large datasets. + Introduced by [@ghorbani_data_2019]. + [Implementation][pydvl.value.shapley.montecarlo.permutation_montecarlo_shapley]. + +- **Weighted Accuracy Drop:** + WAD is a metric to evaluate the impact of sequentially removing data points on + the performance of a machine learning model, weighted by their rank, i.e. by the + time at which they were removed. + Introduced by [@schoch_csshapley_2022]. + +## Other + +- **Coefficient of Variation:** + CV is a statistical measure of the dispersion of data points in a data series + around the mean, expressed as a percentage. It's used to compare the degree of + variation from one data series to another, even if the means are drastically + different. + +- **Conjugate Gradient:** + CG is an algorithm for solving linear systems with a symmetric and + positive-definite coefficient matrix. In machine learning, it's typically used + for efficiently finding the minima of convex functions, when the direct + computation of the Hessian is computationally expensive or impractical. + +- **Constraint Satisfaction Problem:** + A CSP involves finding values for variables within specified constraints or + conditions, commonly used in scheduling, planning, and design problems where + solutions must satisfy a set of restrictions. + +- **Out-of-Bag:** + OOB refers to data samples in an ensemble learning context (like random forests) + that are not selected for training a specific model within the ensemble. These + OOB samples are used as a validation set to estimate the model's accuracy, + providing a convenient internal cross-validation mechanism. + +- **Machine Learning Reproducibility Challenge:** + The [MLRC](https://reproml.org/) is an initiative that encourages the + verification and replication of machine learning research findings, promoting + transparency and reliability in the field. Papers are published in + [Transactions on Machine Learning Research](https://jmlr.org/tmlr/) (TMLR). diff --git a/mkdocs.yml b/mkdocs.yml index 54f025661..8b53292a5 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -14,6 +14,7 @@ nav: - Applications: getting-started/applications.md - Benchmarking: getting-started/benchmarking.md - Methods: getting-started/methods.md + - Glossary: getting-started/glossary.md - Data Valuation: - value/index.md - Shapley values: value/shapley.md From 5447aa4e669eada2c1daaef5670ee5009030706a Mon Sep 17 00:00:00 2001 From: Miguel de Benito Delgado Date: Sat, 23 Mar 2024 12:43:35 +0100 Subject: [PATCH 02/13] Sort abbreviations file --- docs_includes/abbreviations.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs_includes/abbreviations.md b/docs_includes/abbreviations.md index aa47c405c..86c4f6751 100644 --- a/docs_includes/abbreviations.md +++ b/docs_includes/abbreviations.md @@ -1,25 +1,25 @@ +*[CG]: Conjugate Gradient *[CSP]: Constraint Satisfaction Problem *[CV]: Coefficient of Variation *[CWS]: Class-wise Shapley *[DUL]: Data Utility Learning +*[EKFAC]: Eigenvalue-corrected Kronecker-Factored Approximate Curvature *[GT]: Group Testing *[IF]: Influence Function *[iHVP]: inverse Hessian-vector product +*[K-FAC]: Kronecker-Factored Approximate Curvature *[LC]: Least Core -*[LiSSA]: Linear-time Stochastic Second-order Algorithm *[LOO]: Leave-One-Out +*[LiSSA]: Linear-time Stochastic Second-order Algorithm *[MCLC]: Monte Carlo Least Core *[MCS]: Monte Carlo Shapley -*[ML]: Machine Learning *[MLP]: Multi-Layer Perceptron *[MLRC]: Machine Learning Reproducibility Challenge +*[ML]: Machine Learning *[MSE]: Mean Squared Error +*[OOB]: Out-of-Bag *[PCA]: Principal Component Analysis *[ROC]: Receiver Operating Characteristic *[SV]: Shapley Value *[TMCS]: Truncated Monte Carlo Shapley *[WAD]: Weighted Accuracy Drop -*[OOB]: Out-of-Bag -*[CG]: Conjugate Gradient -*[K-FAC]: Kronecker-Factored Approximate Curvature -*[EKFAC]: Eigenvalue-corrected Kronecker-Factored Approximate Curvature From ab3c4467dadd54d84fee906005d05b9f55fae2cd Mon Sep 17 00:00:00 2001 From: Miguel de Benito Delgado Date: Sat, 23 Mar 2024 13:01:43 +0100 Subject: [PATCH 03/13] Use sections to have permalinks. Add a new term --- docs/getting-started/glossary.md | 317 +++++++++++++++++-------------- 1 file changed, 179 insertions(+), 138 deletions(-) diff --git a/docs/getting-started/glossary.md b/docs/getting-started/glossary.md index 3d6151880..9d6b6a25f 100644 --- a/docs/getting-started/glossary.md +++ b/docs/getting-started/glossary.md @@ -4,141 +4,182 @@ This glossary is meant to provide only brief explanations of each term, helping to clarify the concepts and techniques used in the library. For more detailed information, please refer to the relevant literature or resources. -## Valuation - -- **Class-wise Shapley:** - Class-wise Shapley is a Shapley valuation method which introduces a utility - function that balances in-class, and out-of-class accuracy, with the goal of - favoring points that improve the model's performance on the class they belong - to. It is estimated to be particularly useful in imbalanced datasets, but more - research is needed to confirm this. - Introduced by [@schoch_csshapley_2022]. - [Implementation][pydvl.value.shapley.classwise.compute_classwise_shapley_values]. - -- **Data Utility Learning:** - Data Utility Learning is a method that uses an ML model to learn the utility - function. Essentially, it learns to predict the performance of a model when - trained on a given set of indices from the dataset. The cost of training this - model is quickly amortized by avoiding costly re-evaluations of the original - utility. - Introduced by [@wang_improving_2022]. - [Implementation][pydvl.utils.utility.DataUtilityLearning]. - -- **Eigenvalue-corrected Kronecker-Factored Approximate Curvature**: - EKFAC builds on K-FAC by correcting for the approximation errors in the - eigenvalues of the blocks of the Kronecker-factored approximate curvature - matrix. This correction aims to refine the accuracy of natural gradient - approximations, thus potentially offering better training efficiency and - stability in neural networks. - [Implementation (torch)][pydvl.influence.torch.influence_function_model.EkfacInfluence]. - -- **Kronecker-Factored Approximate Curvature**: - K-FAC is an optimization technique that approximates the Fisher Information - matrix's inverse efficiently. It uses the Kronecker product to factor the - matrix, significantly speeding up the computation of natural gradient updates - and potentially improving training efficiency. - -- **Group Testing:** - Group Testing is a strategy for identifying characteristics within groups of - items efficiently, by testing groups rather than individuals to quickly narrow - down the search for items with specific properties. - Introduced into data valuation by [@jia_efficient_2019a]. - [Implementation][pydvl.value.shapley.gt.group_testing_shapley]. - -- **Influence Function:** - The Influence Function measures the impact of a single data point on a - statistical estimator. In machine learning, it's used to understand how much a - particular data point affects the model's prediction. - Introduced into data valuation by [@koh_understanding_2017]. - [Documentation][influence-function]. - -- **inverse Hessian-vector product:** - iHVP involves calculating the product of the inverse Hessian matrix of a - function and a vector, which is essential in optimization and in computing - influence functions efficiently. - -- **Least Core:** - The Least Core is a solution concept in cooperative game theory, referring to - the smallest set of payoffs to players that cannot be improved upon by any - coalition, ensuring stability in the allocation of value. - Introduced as data valuation method by [@yan_if_2021]. - [Implementation][pydvl.value.least_core.common.lc_solve_problem]. - -- **Linear-time Stochastic Second-order Algorithm:** - LiSSA is an efficient algorithm for approximating the inverse Hessian-vector - product, enabling faster computations in large-scale machine learning - problems, particularly for second-order optimization. - Introduced by [@agarwal_secondorder_2017]. - [Implementation (torch)][pydvl.influence.torch.influence_function_model.LissaInfluence]. - -- **Leave-One-Out:** - LOO in the context of data valuation refers to the process of evaluating the - impact of removing individual data points on the model's performance. The - value of a training point is defined as the marginal change in the model's - performance when that point is removed from the training set. - [Implementation][pydvl.value.loo.loo.compute_loo]. - -- **Monte Carlo Least Core:** - MCLC is a variation of the Least Core that uses a reduced amount of - constraints sampled randomly. - Introduced by [@yan_if_2021]. - [Implementation][pydvl.value.least_core.compute_least_core_values]. - -- **Monte Carlo Shapley:** - MCS estimates the Shapley Value using a Monte Carlo approximation to the sum - over subsets of the training set. This reduces computation to polynomial time - at the cost of accuracy, but this loss is typically irrelevant for downstream - applications in ML. - -- **Shapley Value:** - Shapley Value is a concept from cooperative game theory that allocates payouts - to players based on their contribution to the total payoff. In data valuation, - players are data points. The method assigns a value to each data point based - on a weighted average of its marginal contributions to the model's performance - when trained on each subset of the training set. This requires - $\mathcal{O}(2^{n-1}$ evaluations of the model, which is infeasible for even - trivial data set sizes, so one resorts to approximations like TMCS. - -- **Truncated Monte Carlo Shapley:** - TMCS is an efficient approach to estimating the Shapley Value using a - truncated version of the Monte Carlo method, reducing computation time while - maintaining accuracy in large datasets. - Introduced by [@ghorbani_data_2019]. - [Implementation][pydvl.value.shapley.montecarlo.permutation_montecarlo_shapley]. - -- **Weighted Accuracy Drop:** - WAD is a metric to evaluate the impact of sequentially removing data points on - the performance of a machine learning model, weighted by their rank, i.e. by the - time at which they were removed. - Introduced by [@schoch_csshapley_2022]. - -## Other - -- **Coefficient of Variation:** - CV is a statistical measure of the dispersion of data points in a data series - around the mean, expressed as a percentage. It's used to compare the degree of - variation from one data series to another, even if the means are drastically - different. - -- **Conjugate Gradient:** - CG is an algorithm for solving linear systems with a symmetric and - positive-definite coefficient matrix. In machine learning, it's typically used - for efficiently finding the minima of convex functions, when the direct - computation of the Hessian is computationally expensive or impractical. - -- **Constraint Satisfaction Problem:** - A CSP involves finding values for variables within specified constraints or - conditions, commonly used in scheduling, planning, and design problems where - solutions must satisfy a set of restrictions. - -- **Out-of-Bag:** - OOB refers to data samples in an ensemble learning context (like random forests) - that are not selected for training a specific model within the ensemble. These - OOB samples are used as a validation set to estimate the model's accuracy, - providing a convenient internal cross-validation mechanism. - -- **Machine Learning Reproducibility Challenge:** - The [MLRC](https://reproml.org/) is an initiative that encourages the - verification and replication of machine learning research findings, promoting - transparency and reliability in the field. Papers are published in - [Transactions on Machine Learning Research](https://jmlr.org/tmlr/) (TMLR). +Terms in data valuation and influence functions: + +### Class-wise Shapley + +Class-wise Shapley is a Shapley valuation method which introduces a utility +function that balances in-class, and out-of-class accuracy, with the goal of +favoring points that improve the model's performance on the class they belong +to. It is estimated to be particularly useful in imbalanced datasets, but more +research is needed to confirm this. +Introduced by [@schoch_csshapley_2022]. +[Implementation][pydvl.value.shapley.classwise.compute_classwise_shapley_values]. + +### Data Utility Learning + +Data Utility Learning is a method that uses an ML model to learn the utility +function. Essentially, it learns to predict the performance of a model when +trained on a given set of indices from the dataset. The cost of training this +model is quickly amortized by avoiding costly re-evaluations of the original +utility. +Introduced by [@wang_improving_2022]. +[Implementation][pydvl.utils.utility.DataUtilityLearning]. + +### Eigenvalue-corrected Kronecker-Factored Approximate Curvature + +EKFAC builds on K-FAC by correcting for the approximation errors in the +eigenvalues of the blocks of the Kronecker-factored approximate curvature +matrix. This correction aims to refine the accuracy of natural gradient +approximations, thus potentially offering better training efficiency and +stability in neural networks. +[Implementation (torch)][pydvl.influence.torch.influence_function_model.EkfacInfluence]. + +### Kronecker-Factored Approximate Curvature + +K-FAC is an optimization technique that approximates the Fisher Information +matrix's inverse efficiently. It uses the Kronecker product to factor the +matrix, significantly speeding up the computation of natural gradient updates +and potentially improving training efficiency. + +### Group Testing + +Group Testing is a strategy for identifying characteristics within groups of +items efficiently, by testing groups rather than individuals to quickly narrow +down the search for items with specific properties. +Introduced into data valuation by [@jia_efficient_2019a]. +[Implementation][pydvl.value.shapley.gt.group_testing_shapley]. + +### Influence Function + +The Influence Function measures the impact of a single data point on a +statistical estimator. In machine learning, it's used to understand how much a +particular data point affects the model's prediction. +Introduced into data valuation by [@koh_understanding_2017]. +[[influence-function|Documentation]]. + +### inverse Hessian-vector product + +iHVP involves calculating the product of the inverse Hessian matrix of a +function and a vector, which is essential in optimization and in computing +influence functions efficiently. + +### Least Core + +The Least Core is a solution concept in cooperative game theory, referring to +the smallest set of payoffs to players that cannot be improved upon by any +coalition, ensuring stability in the allocation of value. In data valuation, +it implies solving a linear and a quadratic system whose constraints are +determined by the evaluations of the utility function on every subset of the +training data. +Introduced as data valuation method by [@yan_if_2021]. +[Implementation][pydvl.value.least_core.compute_least_core_values]. + +### Linear-time Stochastic Second-order Algorithm + +LiSSA is an efficient algorithm for approximating the inverse Hessian-vector +product, enabling faster computations in large-scale machine learning +problems, particularly for second-order optimization. +Introduced by [@agarwal_secondorder_2017]. +[Implementation (torch)][pydvl.influence.torch.influence_function_model.LissaInfluence]. + +### Leave-One-Out + +LOO in the context of data valuation refers to the process of evaluating the +impact of removing individual data points on the model's performance. The +value of a training point is defined as the marginal change in the model's +performance when that point is removed from the training set. +[Implementation][pydvl.value.loo.loo.compute_loo]. + +### Monte Carlo Least Core + +MCLC is a variation of the Least Core that uses a reduced amount of +constraints, sampled randomly from the powerset of the training data. +Introduced by [@yan_if_2021]. +[Implementation][pydvl.value.least_core.compute_least_core_values]. + +### Monte Carlo Shapley + +MCS estimates the Shapley Value using a Monte Carlo approximation to the sum +over subsets of the training set. This reduces computation to polynomial time +at the cost of accuracy, but this loss is typically irrelevant for downstream +applications in ML. +Introduced into data valuation by [@ghorbani_data_2019]. +[Implementation][pydvl.value.shapley.montecarlo]. +[[data-valuation|Documentation]]. + +### Point removal task + +A task in data valuation where the quality of a valuation method is measured +through the impact of incrementally removing data points on the model's +performance, where the points are removed in order of their value. See +[[benchmarks]]. + + +### Shapley Value + +Shapley Value is a concept from cooperative game theory that allocates payouts +to players based on their contribution to the total payoff. In data valuation, +players are data points. The method assigns a value to each data point based +on a weighted average of its marginal contributions to the model's performance +when trained on each subset of the training set. This requires +$\mathcal{O}(2^{n-1})$ evaluations of the model, which is infeasible for even +trivial data set sizes, so one resorts to approximations like TMCS. +Introduced into data valuation by [@ghorbani_data_2019]. +[Implementation][pydvl.value.shapley.naive]. +[[data-valuation|Documentation]]. + +### Truncated Monte Carlo Shapley + +TMCS is an efficient approach to estimating the Shapley Value using a +truncated version of the Monte Carlo method, reducing computation time while +maintaining accuracy in large datasets. +Introduced by [@ghorbani_data_2019]. +[Implementation][pydvl.value.shapley.montecarlo.permutation_montecarlo_shapley]. +[[data-valuation|Documentation]]. + +### Weighted Accuracy Drop + +WAD is a metric to evaluate the impact of sequentially removing data points on +the performance of a machine learning model, weighted by their rank, i.e. by the +time at which they were removed. +Introduced by [@schoch_csshapley_2022]. + +--- + +Other terms that might be useful: + + +### Coefficient of Variation + +CV is a statistical measure of the dispersion of data points in a data series +around the mean, expressed as a percentage. It's used to compare the degree of +variation from one data series to another, even if the means are drastically +different. + +### Conjugate Gradient + +CG is an algorithm for solving linear systems with a symmetric and +positive-definite coefficient matrix. In machine learning, it's typically used +for efficiently finding the minima of convex functions, when the direct +computation of the Hessian is computationally expensive or impractical. + +### Constraint Satisfaction Problem + +A CSP involves finding values for variables within specified constraints or +conditions, commonly used in scheduling, planning, and design problems where +solutions must satisfy a set of restrictions. + +### Out-of-Bag + +OOB refers to data samples in an ensemble learning context (like random forests) +that are not selected for training a specific model within the ensemble. These +OOB samples are used as a validation set to estimate the model's accuracy, +providing a convenient internal cross-validation mechanism. + +### Machine Learning Reproducibility Challenge + +The [MLRC](https://reproml.org/) is an initiative that encourages the +verification and replication of machine learning research findings, promoting +transparency and reliability in the field. Papers are published in +[Transactions on Machine Learning Research](https://jmlr.org/tmlr/) (TMLR). From b5d91567f16a3861f281335b6a0107710ede3c74 Mon Sep 17 00:00:00 2001 From: Miguel de Benito Delgado Date: Sat, 23 Mar 2024 14:09:00 +0100 Subject: [PATCH 04/13] wip warning in glossary --- docs/getting-started/glossary.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/docs/getting-started/glossary.md b/docs/getting-started/glossary.md index 9d6b6a25f..02d91374e 100644 --- a/docs/getting-started/glossary.md +++ b/docs/getting-started/glossary.md @@ -4,6 +4,9 @@ This glossary is meant to provide only brief explanations of each term, helping to clarify the concepts and techniques used in the library. For more detailed information, please refer to the relevant literature or resources. +!!! warning + This glossary is still a work in progress. Pull requests are welcome! + Terms in data valuation and influence functions: ### Class-wise Shapley @@ -147,8 +150,7 @@ Introduced by [@schoch_csshapley_2022]. --- -Other terms that might be useful: - +## Other terms ### Coefficient of Variation From d20c738570ea00f06a79928164f18e592dd25ade Mon Sep 17 00:00:00 2001 From: Miguel de Benito Delgado Date: Sat, 23 Mar 2024 14:09:22 +0100 Subject: [PATCH 05/13] Links to implementations in doc for the core --- docs/value/the-core.md | 29 ++++++++++++++++++----------- 1 file changed, 18 insertions(+), 11 deletions(-) diff --git a/docs/value/the-core.md b/docs/value/the-core.md index 67f67a269..b5370c59f 100644 --- a/docs/value/the-core.md +++ b/docs/value/the-core.md @@ -15,8 +15,7 @@ The Core is another solution concept in cooperative game theory that attempts to ensure stability in the sense that it provides the set of feasible payoffs that cannot be improved upon by a sub-coalition. This can be interesting for some applications of data valuation because it yields values consistent with training -on the whole dataset, avoiding the spurious selection of subsets. It was first -introduced in this field by [@yan_if_2021]. +on the whole dataset, avoiding the spurious selection of subsets. It satisfies the following 2 properties: @@ -30,6 +29,9 @@ It satisfies the following 2 properties: the amount that these agents could earn by forming a coalition on their own. $\sum_{i \in S} v(i) \geq u(S), \forall S \subset D.$ +The Core was first introduced into data valuation by [@yan_if_2021], in the +following form. + ## Least Core values Unfortunately, for many cooperative games the Core may be empty. By relaxing the @@ -59,9 +61,12 @@ _egalitarian least core_. ## Exact Least Core -This first algorithm is just a verbatim implementation of the definition. -As such it returns as exact a value as the utility function allows -(see what this means in [Problems of Data Values][problems-of-data-values]). +This first algorithm is just a verbatim implementation of the definition, in +[compute_least_core_values][pydvl.value.least_core.compute_least_core_values]. +It computes all constraints for the linear problem by evaluating the utility on +every subset of the training data, and returns as exact a value as the utility +function allows (see what this means in [Problems of Data +Values][problems-of-data-values]). ```python from pydvl.value import compute_least_core_values @@ -74,18 +79,20 @@ values = compute_least_core_values(utility, mode="exact") Because the number of subsets $S \subseteq D \setminus \{i\}$ is $2^{ | D | - 1 }$, one typically must resort to approximations. -The simplest approximation consists in using a fraction of all subsets for the -constraints. [@yan_if_2021] show that a quantity of order -$\mathcal{O}((n - \log \Delta ) / \delta^2)$ is enough to obtain a so-called -$\delta$-*approximate least core* with high probability. I.e. the following -property holds with probability $1-\Delta$ over the choice of subsets: +The simplest on consists in using a fraction of all subsets for the constraints. +[@yan_if_2021] show that a quantity of order $\mathcal{O}((n - \log \Delta ) / +\delta^2)$ is enough to obtain a so-called $\delta$-*approximate least core* +with high probability. I.e. the following property holds with probability +$1-\Delta$ over the choice of subsets: $$ \mathbb{P}_{S\sim D}\left[\sum_{i\in S} v(i) + e^{*} \geq u(S)\right] \geq 1 - \delta, $$ -where $e^{*}$ is the optimal least core subsidy. +where $e^{*}$ is the optimal least core subsidy. This approximation is +also implemented in +[compute_least_core_values][pydvl.value.least_core.compute_least_core_values]: ```python from pydvl.value import compute_least_core_values From d549291e9055266ecd4536796fbceaa225655aa7 Mon Sep 17 00:00:00 2001 From: Miguel de Benito Delgado Date: Sat, 23 Mar 2024 14:42:09 +0100 Subject: [PATCH 06/13] More glossary fixes --- docs/getting-started/glossary.md | 50 +++++++++++++++++--------------- 1 file changed, 26 insertions(+), 24 deletions(-) diff --git a/docs/getting-started/glossary.md b/docs/getting-started/glossary.md index 02d91374e..a937dacba 100644 --- a/docs/getting-started/glossary.md +++ b/docs/getting-started/glossary.md @@ -19,6 +19,14 @@ research is needed to confirm this. Introduced by [@schoch_csshapley_2022]. [Implementation][pydvl.value.shapley.classwise.compute_classwise_shapley_values]. +### Conjugate Gradient + +CG is an algorithm for solving linear systems with a symmetric and +positive-definite coefficient matrix. For Influence Functions, it is used to +approximate the [iHVP][inverse-hessian-vector-product]. +[Implementation (torch)][pydvl.influence.torch.influence_function_model.CgInfluence]. + + ### Data Utility Learning Data Utility Learning is a method that uses an ML model to learn the utility @@ -31,20 +39,13 @@ Introduced by [@wang_improving_2022]. ### Eigenvalue-corrected Kronecker-Factored Approximate Curvature -EKFAC builds on K-FAC by correcting for the approximation errors in the -eigenvalues of the blocks of the Kronecker-factored approximate curvature -matrix. This correction aims to refine the accuracy of natural gradient -approximations, thus potentially offering better training efficiency and -stability in neural networks. +EKFAC builds on [K-FAC][kronecker-factored-approximate-curvature] by correcting +for the approximation errors in the eigenvalues of the blocks of the +Kronecker-factored approximate curvature matrix. This correction aims to refine +the accuracy of natural gradient approximations, thus potentially offering +better training efficiency and stability in neural networks. [Implementation (torch)][pydvl.influence.torch.influence_function_model.EkfacInfluence]. -### Kronecker-Factored Approximate Curvature - -K-FAC is an optimization technique that approximates the Fisher Information -matrix's inverse efficiently. It uses the Kronecker product to factor the -matrix, significantly speeding up the computation of natural gradient updates -and potentially improving training efficiency. - ### Group Testing Group Testing is a strategy for identifying characteristics within groups of @@ -61,11 +62,18 @@ particular data point affects the model's prediction. Introduced into data valuation by [@koh_understanding_2017]. [[influence-function|Documentation]]. -### inverse Hessian-vector product +### Inverse Hessian-vector product + +iHVP is the operation of calculating the product of the inverse Hessian matrix +of a function and a vector, without explicitly constructing nor inverting the +full Hessian matrix first. This is essential for influence function computation. -iHVP involves calculating the product of the inverse Hessian matrix of a -function and a vector, which is essential in optimization and in computing -influence functions efficiently. +### Kronecker-Factored Approximate Curvature + +K-FAC is an optimization technique that approximates the Fisher Information +matrix's inverse efficiently. It uses the Kronecker product to factor the +matrix, significantly speeding up the computation of natural gradient updates +and potentially improving training efficiency. ### Least Core @@ -116,7 +124,7 @@ Introduced into data valuation by [@ghorbani_data_2019]. A task in data valuation where the quality of a valuation method is measured through the impact of incrementally removing data points on the model's performance, where the points are removed in order of their value. See -[[benchmarks]]. +[Benchmarking tasks][benchmarking-tasks]. ### Shapley Value @@ -126,7 +134,7 @@ to players based on their contribution to the total payoff. In data valuation, players are data points. The method assigns a value to each data point based on a weighted average of its marginal contributions to the model's performance when trained on each subset of the training set. This requires -$\mathcal{O}(2^{n-1})$ evaluations of the model, which is infeasible for even +$\mathcal{O}(2^{n-1})$ re-trainings of the model, which is infeasible for even trivial data set sizes, so one resorts to approximations like TMCS. Introduced into data valuation by [@ghorbani_data_2019]. [Implementation][pydvl.value.shapley.naive]. @@ -159,12 +167,6 @@ around the mean, expressed as a percentage. It's used to compare the degree of variation from one data series to another, even if the means are drastically different. -### Conjugate Gradient - -CG is an algorithm for solving linear systems with a symmetric and -positive-definite coefficient matrix. In machine learning, it's typically used -for efficiently finding the minima of convex functions, when the direct -computation of the Hessian is computationally expensive or impractical. ### Constraint Satisfaction Problem From 041febf165e443a20ab50ae3f23f47ece929f12a Mon Sep 17 00:00:00 2001 From: Miguel de Benito Delgado Date: Sat, 23 Mar 2024 14:49:48 +0100 Subject: [PATCH 07/13] Update CHANGELOG.md --- CHANGELOG.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 9d6540f3a..1c392bfca 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,10 @@ ### Added +- Many improvements to the documentation: fixes, links, text, example gallery + and more. [PR #532](https://github.com/aai-institute/pyDVL/pull/532) +- Glossary of data valuation terms in the docs. + [PR #537](https://github.com/aai-institute/pyDVL/pull/537 - Implement new method: `NystroemSketchInfluence` [PR #504](https://github.com/aai-institute/pyDVL/pull/504) - Add property `model_dtype` to instances of type `TorchInfluenceFunctionModel` @@ -24,8 +28,6 @@ ### Changed -- Improvements to documentation: fixes, links, text, example gallery and more - [PR #532](https://github.com/aai-institute/pyDVL/pull/532) - Bump versions of CI actions to avoid warnings [PR #502](https://github.com/aai-institute/pyDVL/pull/502) - Add Python Version 3.11 to supported versions [PR #510](https://github.com/aai-institute/pyDVL/pull/510) - Documentation improvements and cleanup [PR #521](https://github.com/aai-institute/pyDVL/pull/521) [PR #522](https://github.com/aai-institute/pyDVL/pull/522) From b6b04d78087c8dbe55d61f879163c418a2727131 Mon Sep 17 00:00:00 2001 From: Miguel de Benito Delgado Date: Sun, 24 Mar 2024 10:00:47 +0100 Subject: [PATCH 08/13] Typo Co-authored-by: Anes Benmerzoug --- docs/value/the-core.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/value/the-core.md b/docs/value/the-core.md index b5370c59f..5a44a8cd1 100644 --- a/docs/value/the-core.md +++ b/docs/value/the-core.md @@ -79,7 +79,7 @@ values = compute_least_core_values(utility, mode="exact") Because the number of subsets $S \subseteq D \setminus \{i\}$ is $2^{ | D | - 1 }$, one typically must resort to approximations. -The simplest on consists in using a fraction of all subsets for the constraints. +The simplest one consists in using a fraction of all subsets for the constraints. [@yan_if_2021] show that a quantity of order $\mathcal{O}((n - \log \Delta ) / \delta^2)$ is enough to obtain a so-called $\delta$-*approximate least core* with high probability. I.e. the following property holds with probability From 1c05c117d31a9b2bb3640759976be449fe04d8fc Mon Sep 17 00:00:00 2001 From: Kristof Schroeder Date: Mon, 25 Mar 2024 14:39:27 +0100 Subject: [PATCH 09/13] Add abbreviations for influence computation --- docs_includes/abbreviations.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/docs_includes/abbreviations.md b/docs_includes/abbreviations.md index 86c4f6751..a584b73d8 100644 --- a/docs_includes/abbreviations.md +++ b/docs_includes/abbreviations.md @@ -1,3 +1,5 @@ +*[AM]: Arnoldi Method +*[BCG]: Block Conjugate Gradient *[CG]: Conjugate Gradient *[CSP]: Constraint Satisfaction Problem *[CV]: Coefficient of Variation @@ -17,8 +19,11 @@ *[MLRC]: Machine Learning Reproducibility Challenge *[ML]: Machine Learning *[MSE]: Mean Squared Error +*[NLRA]: Nyström Low-Rank Approximation *[OOB]: Out-of-Bag *[PCA]: Principal Component Analysis +*[PBCG]: Preconditioned Block Conjugate Gradient +*[PCG]: Preconditioned Conjugate Gradient *[ROC]: Receiver Operating Characteristic *[SV]: Shapley Value *[TMCS]: Truncated Monte Carlo Shapley From 45fd65041d3367bcb0bbb98d0a63d12827160f77 Mon Sep 17 00:00:00 2001 From: Kristof Schroeder Date: Mon, 25 Mar 2024 14:41:30 +0100 Subject: [PATCH 10/13] =?UTF-8?q?Add=20influence=20related=20concepts=20to?= =?UTF-8?q?=20glossary:=20*=20add=20Arnoldi,=20Nystr=C3=B6m,=20Block=20CG,?= =?UTF-8?q?=20Preconditioners=20*=20add=20documentation=20links=20to=20Lis?= =?UTF-8?q?sa=20and=20EKFAC?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- docs/getting-started/glossary.md | 72 ++++++++++++++++++++++++++++++-- 1 file changed, 68 insertions(+), 4 deletions(-) diff --git a/docs/getting-started/glossary.md b/docs/getting-started/glossary.md index a937dacba..aa6a3e640 100644 --- a/docs/getting-started/glossary.md +++ b/docs/getting-started/glossary.md @@ -9,6 +9,25 @@ information, please refer to the relevant literature or resources. Terms in data valuation and influence functions: +### Arnoldi Method + +The Arnoldi method approximately computes eigenvalue, eigenvector pairs of +a symmetric matrix. For influence functions, it is used to approximate +the [iHVP][inverse-hessian-vector-product]. + +Introduced by [@schioppa_scaling_2021] in the context of influence functions. +[Implementation (torch) +][pydvl.influence.torch.influence_function_model.ArnoldiInfluence]. +[Documentation (torch)][arnoldi]. + +### Block Conjugate Gradient + +A blocked version of [CG][conjugate-gradient], which solves several linear +systems simultaneously. For Influence Functions, it is used to +approximate the [iHVP][inverse-hessian-vector-product]. +[Implementation (torch)][pydvl.influence.torch.influence_function_model.CgInfluence]. +[Documentation (torch)][cg] + ### Class-wise Shapley Class-wise Shapley is a Shapley valuation method which introduces a utility @@ -18,14 +37,16 @@ to. It is estimated to be particularly useful in imbalanced datasets, but more research is needed to confirm this. Introduced by [@schoch_csshapley_2022]. [Implementation][pydvl.value.shapley.classwise.compute_classwise_shapley_values]. +[Documentation][class-wise-shapley]. ### Conjugate Gradient CG is an algorithm for solving linear systems with a symmetric and positive-definite coefficient matrix. For Influence Functions, it is used to approximate the [iHVP][inverse-hessian-vector-product]. -[Implementation (torch)][pydvl.influence.torch.influence_function_model.CgInfluence]. - +[Implementation (torch) +][pydvl.influence.torch.influence_function_model.CgInfluence]. +[Documentation (torch)][cg] ### Data Utility Learning @@ -44,7 +65,10 @@ for the approximation errors in the eigenvalues of the blocks of the Kronecker-factored approximate curvature matrix. This correction aims to refine the accuracy of natural gradient approximations, thus potentially offering better training efficiency and stability in neural networks. -[Implementation (torch)][pydvl.influence.torch.influence_function_model.EkfacInfluence]. +[Implementation (torch) +][pydvl.influence.torch.influence_function_model.EkfacInfluence]. +[Documentation (torch)][eigenvalue-corrected-k-fac]. + ### Group Testing @@ -91,8 +115,13 @@ Introduced as data valuation method by [@yan_if_2021]. LiSSA is an efficient algorithm for approximating the inverse Hessian-vector product, enabling faster computations in large-scale machine learning problems, particularly for second-order optimization. +For Influence Functions, it is used to +approximate the [iHVP][inverse-hessian-vector-product]. Introduced by [@agarwal_secondorder_2017]. -[Implementation (torch)][pydvl.influence.torch.influence_function_model.LissaInfluence]. +[Implementation (torch) +][pydvl.influence.torch.influence_function_model.LissaInfluence]. +[Documentation (torch) +][linear-time-stochastic-second-order-approximation-lissa]. ### Leave-One-Out @@ -119,6 +148,21 @@ Introduced into data valuation by [@ghorbani_data_2019]. [Implementation][pydvl.value.shapley.montecarlo]. [[data-valuation|Documentation]]. +### Nyström Low-Rank Approximation + +The Nyström approximation computes a low-rank approximation to a symmetric +positive-definite matrix via random projections. For influence functions, +it is used to approximate the [iHVP][inverse-hessian-vector-product]. +Introduced as sketch and solve algorithm in [@hataya_nystrom_2023], and as +preconditioner for [PCG][preconditioned-conjugate-gradient] in +[@frangella_randomized_2023]. +[Implementation Sketch-and-Solve (torch) +][pydvl.influence.torch.influence_function_model.NystroemSketchInfluence]. +[Documentation Sketch-and-Solve (torch)][nystrom-sketch-and-solve]. +[Implementation Preconditioner (torch) +][pydvl.influence.torch.pre_conditioner.NystroemPreConditioner]. + + ### Point removal task A task in data valuation where the quality of a valuation method is measured @@ -126,6 +170,26 @@ through the impact of incrementally removing data points on the model's performance, where the points are removed in order of their value. See [Benchmarking tasks][benchmarking-tasks]. +### Preconditioned Block Conjugate Gradient + +A blocked version of [PCG][preconditioned-conjugate-gradient], which solves +several linear systems simultaneously. For Influence Functions, it is used to +approximate the [iHVP][inverse-hessian-vector-product]. +[Implementation CG (torch) +][pydvl.influence.torch.influence_function_model.CgInfluence] +[Implementation Preconditioner (torch)][pydvl.influence.torch.pre_conditioner] +[Documentation (torch)][cg] + +### Preconditioned Conjugate Gradient + +A preconditioned version of [CG][conjugate-gradient] for improved +convergence, depending on the characteristics of the matrix and the +preconditioner. For Influence Functions, it is used to +approximate the [iHVP][inverse-hessian-vector-product]. +[Implementation CG (torch) +][pydvl.influence.torch.influence_function_model.CgInfluence] +[Implementation Preconditioner (torch)][pydvl.influence.torch.pre_conditioner] +[Documentation (torch)][cg] ### Shapley Value From 25aa9c860f5e0da1a601fbd1ffe18e088f903564 Mon Sep 17 00:00:00 2001 From: Kristof Schroeder Date: Mon, 25 Mar 2024 15:02:18 +0100 Subject: [PATCH 11/13] Add documentation links to value related concepts --- docs/getting-started/glossary.md | 18 ++++++++++++------ 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/docs/getting-started/glossary.md b/docs/getting-started/glossary.md index aa6a3e640..8510c49ee 100644 --- a/docs/getting-started/glossary.md +++ b/docs/getting-started/glossary.md @@ -25,8 +25,9 @@ Introduced by [@schioppa_scaling_2021] in the context of influence functions. A blocked version of [CG][conjugate-gradient], which solves several linear systems simultaneously. For Influence Functions, it is used to approximate the [iHVP][inverse-hessian-vector-product]. -[Implementation (torch)][pydvl.influence.torch.influence_function_model.CgInfluence]. -[Documentation (torch)][cg] +[Implementation (torch) +][pydvl.influence.torch.influence_function_model.CgInfluence]. +[Documentation (torch)][cg]. ### Class-wise Shapley @@ -46,7 +47,7 @@ positive-definite coefficient matrix. For Influence Functions, it is used to approximate the [iHVP][inverse-hessian-vector-product]. [Implementation (torch) ][pydvl.influence.torch.influence_function_model.CgInfluence]. -[Documentation (torch)][cg] +[Documentation (torch)][cg]. ### Data Utility Learning @@ -57,6 +58,7 @@ model is quickly amortized by avoiding costly re-evaluations of the original utility. Introduced by [@wang_improving_2022]. [Implementation][pydvl.utils.utility.DataUtilityLearning]. +[Documentation][creating-a-utility]. ### Eigenvalue-corrected Kronecker-Factored Approximate Curvature @@ -77,6 +79,7 @@ items efficiently, by testing groups rather than individuals to quickly narrow down the search for items with specific properties. Introduced into data valuation by [@jia_efficient_2019a]. [Implementation][pydvl.value.shapley.gt.group_testing_shapley]. +[Documentation][group-testing]. ### Influence Function @@ -109,6 +112,7 @@ determined by the evaluations of the utility function on every subset of the training data. Introduced as data valuation method by [@yan_if_2021]. [Implementation][pydvl.value.least_core.compute_least_core_values]. +[Documentation][least-core-values]. ### Linear-time Stochastic Second-order Algorithm @@ -130,6 +134,7 @@ impact of removing individual data points on the model's performance. The value of a training point is defined as the marginal change in the model's performance when that point is removed from the training set. [Implementation][pydvl.value.loo.loo.compute_loo]. +[Documentation][leave-one-out-values]. ### Monte Carlo Least Core @@ -137,6 +142,7 @@ MCLC is a variation of the Least Core that uses a reduced amount of constraints, sampled randomly from the powerset of the training data. Introduced by [@yan_if_2021]. [Implementation][pydvl.value.least_core.compute_least_core_values]. +[Documentation][monte-carlo-least-core]. ### Monte Carlo Shapley @@ -146,7 +152,7 @@ at the cost of accuracy, but this loss is typically irrelevant for downstream applications in ML. Introduced into data valuation by [@ghorbani_data_2019]. [Implementation][pydvl.value.shapley.montecarlo]. -[[data-valuation|Documentation]]. +[Documentation][monte-carlo-combinatorial-shapley]. ### Nyström Low-Rank Approximation @@ -202,7 +208,7 @@ $\mathcal{O}(2^{n-1})$ re-trainings of the model, which is infeasible for even trivial data set sizes, so one resorts to approximations like TMCS. Introduced into data valuation by [@ghorbani_data_2019]. [Implementation][pydvl.value.shapley.naive]. -[[data-valuation|Documentation]]. +[Documentation][shapley-value]. ### Truncated Monte Carlo Shapley @@ -211,7 +217,7 @@ truncated version of the Monte Carlo method, reducing computation time while maintaining accuracy in large datasets. Introduced by [@ghorbani_data_2019]. [Implementation][pydvl.value.shapley.montecarlo.permutation_montecarlo_shapley]. -[[data-valuation|Documentation]]. +[Documentation][permutation-shapley]. ### Weighted Accuracy Drop From 6ef444774b1affa5eaa0c2974a6b5d4b85a7364b Mon Sep 17 00:00:00 2001 From: Kristof Schroeder Date: Mon, 25 Mar 2024 21:15:47 +0100 Subject: [PATCH 12/13] Reformat links to documentation and implementation to appear in a bullet list in glossary.md --- docs/getting-started/glossary.md | 121 ++++++++++++++++++------------- 1 file changed, 71 insertions(+), 50 deletions(-) diff --git a/docs/getting-started/glossary.md b/docs/getting-started/glossary.md index 8510c49ee..1ee8bce4c 100644 --- a/docs/getting-started/glossary.md +++ b/docs/getting-started/glossary.md @@ -14,20 +14,21 @@ Terms in data valuation and influence functions: The Arnoldi method approximately computes eigenvalue, eigenvector pairs of a symmetric matrix. For influence functions, it is used to approximate the [iHVP][inverse-hessian-vector-product]. - Introduced by [@schioppa_scaling_2021] in the context of influence functions. -[Implementation (torch) -][pydvl.influence.torch.influence_function_model.ArnoldiInfluence]. -[Documentation (torch)][arnoldi]. + + * [Implementation (torch) + ][pydvl.influence.torch.influence_function_model.ArnoldiInfluence] + * [Documentation (torch)][arnoldi] ### Block Conjugate Gradient A blocked version of [CG][conjugate-gradient], which solves several linear systems simultaneously. For Influence Functions, it is used to approximate the [iHVP][inverse-hessian-vector-product]. -[Implementation (torch) -][pydvl.influence.torch.influence_function_model.CgInfluence]. -[Documentation (torch)][cg]. + + * [Implementation (torch) + ][pydvl.influence.torch.influence_function_model.CgInfluence] + * [Documentation (torch)][cg] ### Class-wise Shapley @@ -37,17 +38,20 @@ favoring points that improve the model's performance on the class they belong to. It is estimated to be particularly useful in imbalanced datasets, but more research is needed to confirm this. Introduced by [@schoch_csshapley_2022]. -[Implementation][pydvl.value.shapley.classwise.compute_classwise_shapley_values]. -[Documentation][class-wise-shapley]. + + * [Implementation + ][pydvl.value.shapley.classwise.compute_classwise_shapley_values] + * [Documentation][class-wise-shapley] ### Conjugate Gradient CG is an algorithm for solving linear systems with a symmetric and positive-definite coefficient matrix. For Influence Functions, it is used to approximate the [iHVP][inverse-hessian-vector-product]. -[Implementation (torch) -][pydvl.influence.torch.influence_function_model.CgInfluence]. -[Documentation (torch)][cg]. + + * [Implementation (torch) +][pydvl.influence.torch.influence_function_model.CgInfluence] + * [Documentation (torch)][cg] ### Data Utility Learning @@ -57,8 +61,9 @@ trained on a given set of indices from the dataset. The cost of training this model is quickly amortized by avoiding costly re-evaluations of the original utility. Introduced by [@wang_improving_2022]. -[Implementation][pydvl.utils.utility.DataUtilityLearning]. -[Documentation][creating-a-utility]. + + * [Implementation][pydvl.utils.utility.DataUtilityLearning] + * [Documentation][creating-a-utility] ### Eigenvalue-corrected Kronecker-Factored Approximate Curvature @@ -67,9 +72,10 @@ for the approximation errors in the eigenvalues of the blocks of the Kronecker-factored approximate curvature matrix. This correction aims to refine the accuracy of natural gradient approximations, thus potentially offering better training efficiency and stability in neural networks. -[Implementation (torch) -][pydvl.influence.torch.influence_function_model.EkfacInfluence]. -[Documentation (torch)][eigenvalue-corrected-k-fac]. + + * [Implementation (torch) + ][pydvl.influence.torch.influence_function_model.EkfacInfluence] + * [Documentation (torch)][eigenvalue-corrected-k-fac] ### Group Testing @@ -78,8 +84,9 @@ Group Testing is a strategy for identifying characteristics within groups of items efficiently, by testing groups rather than individuals to quickly narrow down the search for items with specific properties. Introduced into data valuation by [@jia_efficient_2019a]. -[Implementation][pydvl.value.shapley.gt.group_testing_shapley]. -[Documentation][group-testing]. + + * [Implementation][pydvl.value.shapley.gt.group_testing_shapley] + * [Documentation][group-testing] ### Influence Function @@ -87,7 +94,8 @@ The Influence Function measures the impact of a single data point on a statistical estimator. In machine learning, it's used to understand how much a particular data point affects the model's prediction. Introduced into data valuation by [@koh_understanding_2017]. -[[influence-function|Documentation]]. + + * [[influence-function|Documentation]] ### Inverse Hessian-vector product @@ -111,8 +119,9 @@ it implies solving a linear and a quadratic system whose constraints are determined by the evaluations of the utility function on every subset of the training data. Introduced as data valuation method by [@yan_if_2021]. -[Implementation][pydvl.value.least_core.compute_least_core_values]. -[Documentation][least-core-values]. + + * [Implementation][pydvl.value.least_core.compute_least_core_values] + * [Documentation][least-core-values] ### Linear-time Stochastic Second-order Algorithm @@ -122,10 +131,11 @@ problems, particularly for second-order optimization. For Influence Functions, it is used to approximate the [iHVP][inverse-hessian-vector-product]. Introduced by [@agarwal_secondorder_2017]. -[Implementation (torch) -][pydvl.influence.torch.influence_function_model.LissaInfluence]. -[Documentation (torch) -][linear-time-stochastic-second-order-approximation-lissa]. + + * [Implementation (torch) + ][pydvl.influence.torch.influence_function_model.LissaInfluence] + * [Documentation (torch) + ][linear-time-stochastic-second-order-approximation-lissa] ### Leave-One-Out @@ -133,16 +143,18 @@ LOO in the context of data valuation refers to the process of evaluating the impact of removing individual data points on the model's performance. The value of a training point is defined as the marginal change in the model's performance when that point is removed from the training set. -[Implementation][pydvl.value.loo.loo.compute_loo]. -[Documentation][leave-one-out-values]. + + * [Implementation][pydvl.value.loo.loo.compute_loo] + * [Documentation][leave-one-out-values] ### Monte Carlo Least Core MCLC is a variation of the Least Core that uses a reduced amount of constraints, sampled randomly from the powerset of the training data. Introduced by [@yan_if_2021]. -[Implementation][pydvl.value.least_core.compute_least_core_values]. -[Documentation][monte-carlo-least-core]. + + * [Implementation][pydvl.value.least_core.compute_least_core_values] + * [Documentation][monte-carlo-least-core] ### Monte Carlo Shapley @@ -151,8 +163,9 @@ over subsets of the training set. This reduces computation to polynomial time at the cost of accuracy, but this loss is typically irrelevant for downstream applications in ML. Introduced into data valuation by [@ghorbani_data_2019]. -[Implementation][pydvl.value.shapley.montecarlo]. -[Documentation][monte-carlo-combinatorial-shapley]. + + * [Implementation][pydvl.value.shapley.montecarlo] + * [Documentation][monte-carlo-combinatorial-shapley] ### Nyström Low-Rank Approximation @@ -162,29 +175,32 @@ it is used to approximate the [iHVP][inverse-hessian-vector-product]. Introduced as sketch and solve algorithm in [@hataya_nystrom_2023], and as preconditioner for [PCG][preconditioned-conjugate-gradient] in [@frangella_randomized_2023]. -[Implementation Sketch-and-Solve (torch) -][pydvl.influence.torch.influence_function_model.NystroemSketchInfluence]. -[Documentation Sketch-and-Solve (torch)][nystrom-sketch-and-solve]. -[Implementation Preconditioner (torch) -][pydvl.influence.torch.pre_conditioner.NystroemPreConditioner]. + * [Implementation Sketch-and-Solve (torch) + ][pydvl.influence.torch.influence_function_model.NystroemSketchInfluence] + * [Documentation Sketch-and-Solve (torch)][nystrom-sketch-and-solve] + * [Implementation Preconditioner (torch) + ][pydvl.influence.torch.pre_conditioner.NystroemPreConditioner] ### Point removal task A task in data valuation where the quality of a valuation method is measured through the impact of incrementally removing data points on the model's performance, where the points are removed in order of their value. See -[Benchmarking tasks][benchmarking-tasks]. + + * [Benchmarking tasks][benchmarking-tasks] ### Preconditioned Block Conjugate Gradient A blocked version of [PCG][preconditioned-conjugate-gradient], which solves several linear systems simultaneously. For Influence Functions, it is used to approximate the [iHVP][inverse-hessian-vector-product]. -[Implementation CG (torch) -][pydvl.influence.torch.influence_function_model.CgInfluence] -[Implementation Preconditioner (torch)][pydvl.influence.torch.pre_conditioner] -[Documentation (torch)][cg] + + * [Implementation CG (torch) + ][pydvl.influence.torch.influence_function_model.CgInfluence] + * [Implementation Preconditioner (torch) + ][pydvl.influence.torch.pre_conditioner] + * [Documentation (torch)][cg] ### Preconditioned Conjugate Gradient @@ -192,10 +208,12 @@ A preconditioned version of [CG][conjugate-gradient] for improved convergence, depending on the characteristics of the matrix and the preconditioner. For Influence Functions, it is used to approximate the [iHVP][inverse-hessian-vector-product]. -[Implementation CG (torch) -][pydvl.influence.torch.influence_function_model.CgInfluence] -[Implementation Preconditioner (torch)][pydvl.influence.torch.pre_conditioner] -[Documentation (torch)][cg] + + * [Implementation CG (torch) + ][pydvl.influence.torch.influence_function_model.CgInfluence] + * [Implementation Preconditioner (torch) + ][pydvl.influence.torch.pre_conditioner] + * [Documentation (torch)][cg] ### Shapley Value @@ -207,8 +225,9 @@ when trained on each subset of the training set. This requires $\mathcal{O}(2^{n-1})$ re-trainings of the model, which is infeasible for even trivial data set sizes, so one resorts to approximations like TMCS. Introduced into data valuation by [@ghorbani_data_2019]. -[Implementation][pydvl.value.shapley.naive]. -[Documentation][shapley-value]. + + * [Implementation][pydvl.value.shapley.naive] + * [Documentation][shapley-value] ### Truncated Monte Carlo Shapley @@ -216,8 +235,10 @@ TMCS is an efficient approach to estimating the Shapley Value using a truncated version of the Monte Carlo method, reducing computation time while maintaining accuracy in large datasets. Introduced by [@ghorbani_data_2019]. -[Implementation][pydvl.value.shapley.montecarlo.permutation_montecarlo_shapley]. -[Documentation][permutation-shapley]. + + * [Implementation + ][pydvl.value.shapley.montecarlo.permutation_montecarlo_shapley] + * [Documentation][permutation-shapley] ### Weighted Accuracy Drop From 868574f1873f24a385aea3f78494bd71245a9284 Mon Sep 17 00:00:00 2001 From: Kristof Schroeder Date: Mon, 25 Mar 2024 21:18:02 +0100 Subject: [PATCH 13/13] Update CHANGELOG.md --- CHANGELOG.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 1c392bfca..97b365944 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -6,7 +6,7 @@ - Many improvements to the documentation: fixes, links, text, example gallery and more. [PR #532](https://github.com/aai-institute/pyDVL/pull/532) -- Glossary of data valuation terms in the docs. +- Glossary of data valuation and influence terms in the docs. [PR #537](https://github.com/aai-institute/pyDVL/pull/537 - Implement new method: `NystroemSketchInfluence` [PR #504](https://github.com/aai-institute/pyDVL/pull/504)