This area is a stub, you can help by improving it.
Quantifying the contribution of each training datapoint to an end model is useful in a number of settings:
- in active learning knowing the value of our training examples can help guide us in collecting more data
- when compensating individuals for the data they contribute to a training dataset (e.g. search engine users contributing their browsing data or patients contributing their medical data)
- for explaining a model's predictions and debugging its behavior.
However, data valuation can be quite tricky. The first challenge lies in selecting a suitable criterion for quantifying a datapoint's value. Most criteria aim to measure the gain in model performance attributable to including the datapoint in the training dataset. A common approach, dubbed "leave-one-out", simply computes the difference in performance between a model trained on the full dataset and one trained on the full dataset minus one example. Recently, Ghorbani et al. and Jia et al. proposed a data valuation scheme based on the Shapley value, a classic solution in game theory for distributing rewards in cooperative games. Empirically, Data Shapley valuations are more effective in downstream applications (e.g. active learning) than "leave-one-out" valuations. Moreover, they have several intuitive properties not shared by other criteria. Computing Shapley value can often be expensive, one line of research is to develop for simpler models PTIME Shapley algorithm and use as a proxy which can be effective in many scenarios (https://arxiv.org/pdf/1911.07128.pdf). DataScope also extends this functionality for end-to-end ML pipelines consist of both feature extractors and ML models.
Computing exact valuations according to either of these criteria requires retraining the model from scratch many times, which can be prohibitively expensive for large models. Thus, a second challenge lies in finding a good approximation for these measures. Influence functions provide an efficient estimate of the "leave-one-out" measure that only requires on access to the model's gradients and hessian-vector products. Shapley values can be estimated with Monte Carlo samples or, for models trained via stochastic gradient descent, a simple gradient-based approach.