Skip to content

Commit

Permalink
Merge pull request #558 from aai-institute/feature/refactor-value
Browse files Browse the repository at this point in the history
Feature/refactor value
  • Loading branch information
schroedk authored Sep 9, 2024
2 parents 81f72f8 + 2c44436 commit 6bfc142
Show file tree
Hide file tree
Showing 103 changed files with 14,265 additions and 1,025 deletions.
14 changes: 14 additions & 0 deletions .github/workflows/main.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,20 @@ jobs:
group_number: ${{ matrix.group_number }}
needs: [code-quality]

legacy-tests:
strategy:
fail-fast: false
matrix:
python_version: [ "3.11" ]
group_number: [ 1, 2, 3, 4 ]
name: Run Legacy tests - Python ${{ matrix.python_version }} - Group ${{ matrix.group_number }}
uses: ./.github/workflows/run-legacy-tests-workflow.yaml
with:
python_version: ${{ matrix.python_version }}
split_size: 4
group_number: ${{ matrix.group_number }}
needs: [ code-quality ]

push-docs-and-release-testpypi:
name: Push Docs and maybe release Package to TestPyPI
runs-on: ubuntu-latest
Expand Down
54 changes: 54 additions & 0 deletions .github/workflows/run-legacy-tests-workflow.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
name: Run Legacy Tests

on:
workflow_call:
inputs:
split_size:
description: "Determines the number of groups into which the tests should be split"
type: string
default: 4
group_number:
description: "Determines which which group of tests to run. Can be 1, 2, ..., split_size"
type: string
required: true
python_version:
description: "Determines which Python version to use"
type: string
required: true


env:
PY_COLORS: 1

jobs:
run-legacy-tests:
runs-on: ubuntu-latest
steps:
- name: Free Disk Space (Ubuntu)
uses: jlumbroso/free-disk-space@main
with:
large-packages: false
docker-images: false
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Setup Python ${{ inputs.python_version }}
uses: ./.github/actions/python
with:
python_version: ${{ inputs.python_version }}
- name: Cache Tox Directory for Tests
uses: actions/cache@v4
with:
key: tox-${{ runner.os }}-${{ github.ref }}-${{ hashFiles('tox.ini', 'requirements.txt') }}-${{ inputs.python_version }}
path: .tox
- name: Set up memcached
uses: niden/actions-memcached@v7
- name: Test Group ${{ inputs.group_number }}
run: tox -e legacy-tests -- --slow-tests --splits ${{ inputs.split_size }} --group ${{ inputs.group_number }}
- name: Upload coverage reports to Codecov
uses: codecov/codecov-action@v4
with:
token: ${{ secrets.CODECOV_TOKEN }}
files: ./coverage.xml
env_vars: OS,PYTHON
verbose: false
23 changes: 12 additions & 11 deletions .notebook_test_durations
Original file line number Diff line number Diff line change
@@ -1,12 +1,13 @@
{
"notebooks/data_oob.ipynb::": 14.514983271001256,
"notebooks/influence_imagenet.ipynb::": 15.937124550999215,
"notebooks/influence_sentiment_analysis.ipynb::": 26.479645616000198,
"notebooks/influence_synthetic.ipynb::": 6.61773010700017,
"notebooks/influence_wine.ipynb::": 16.312171267998565,
"notebooks/least_core_basic.ipynb::": 14.375480750999486,
"notebooks/msr_banzhaf_digits.ipynb::": 106.6507187110019,
"notebooks/shapley_basic_spotify.ipynb::": 15.657225806997303,
"notebooks/shapley_knn_flowers.ipynb::": 3.9943819290019746,
"notebooks/shapley_utility_learning.ipynb::": 25.939783253001224
}
"notebooks/data_oob.ipynb::": 13.150942041,
"notebooks/influence_imagenet.ipynb::": 17.281671249999995,
"notebooks/influence_sentiment_analysis.ipynb::": 19.578478917000005,
"notebooks/influence_synthetic.ipynb::": 7.191153166999996,
"notebooks/influence_wine.ipynb::": 11.610076332999995,
"notebooks/least_core_basic.ipynb::": 14.069404709000011,
"notebooks/least_core_basic_new.ipynb::": 24.492538208000013,
"notebooks/msr_banzhaf_digits.ipynb::": 86.62082037599998,
"notebooks/shapley_basic_spotify.ipynb::": 15.088616748999982,
"notebooks/shapley_knn_flowers.ipynb::": 6.810235208000023,
"notebooks/shapley_utility_learning.ipynb::": 24.370409832999997
}
2,206 changes: 1,581 additions & 625 deletions .test_durations

Large diffs are not rendered by default.

82 changes: 52 additions & 30 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,19 @@

### Added

- New method `InverseHarmonicMeanInfluence`, implementation for the paper
`DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and
- Refactor Classwise Shapley valuation with the interfaces and sampler architecture [PR #616](https://github.com/aai-institute/pyDVL/pull/616).
- Refactoring KNN Shapley values with the new sampler architecture [PR #610](https://github.com/aai-institute/pyDVL/pull/610).
- Refactoring MSR Banzhaf semivalues with the new sampler architecture.
[PR #605](https://github.com/aai-institute/pyDVL/pull/605)
- Refactoring group-testing shapley values with new sampler architecture
[PR #602](https://github.com/aai-institute/pyDVL/pull/602)
- Refactoring of least-core data valuation methods with more supported sampling methods
and consistent interface.
[PR #580](https://github.com/aai-institute/pyDVL/pull/580)
- Refactoring of owen shapley valuation with new sampler architecture
[PR #597](https://github.com/aai-institute/pyDVL/pull/597)
- New method `InverseHarmonicMeanInfluence`, implementation for the paper
`DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and
Diffusion Models`
[PR #582](https://github.com/aai-institute/pyDVL/pull/582)
- Add new backend implementations for influence computation
Expand All @@ -16,7 +27,7 @@
[PR #591](https://github.com/aai-institute/pyDVL/pull/591)
- Extend `LissaInfluence` with block-diagonal and Gauss-Newton approximation
[PR #593](https://github.com/aai-institute/pyDVL/pull/593)
- Extend `NystroemSketchInfluence` with block-diagonal and Gauss-Newton
- Extend `NystroemSketchInfluence` with block-diagonal and Gauss-Newton
approximation
[PR #596](https://github.com/aai-institute/pyDVL/pull/596)
- Extend `ArnoldiInfluence` with block-diagonal and Gauss-Newton
Expand All @@ -30,9 +41,20 @@
- Replace `np.float_` with `np.float64` and `np.alltrue` with `np.all`,
as the old aliases are removed in NumPy 2.0
[PR #604](https://github.com/aai-institute/pyDVL/pull/604)

## Changed

- Fix a bug in pydvl.utils.numeric.random_subset where 1 - q was used instead of q
as the probability of an element being sampled
[PR #597](https://github.com/aai-institute/pyDVL/pull/597)
- Fix a bug in the calculation of variance estimates for MSR Banzhaf
[PR #605](https://github.com/aai-institute/pyDVL/pull/605)
- Fix a bug in KNN Shapley values. See [Issue 613](https://github.com/aai-institute/pyDVL/issues/613) for details.


### Changed

- Use tighter bounds for the calculation of the minimal sample size that guarantees
an epsilon-delta approximation in group testing (Jia et al. 2023)
[PR #602](https://github.com/aai-institute/pyDVL/pull/602)
- **Breaking Changes**
- Rename parameter `hessian_regularization` of `DirectInfluence`
to `regularization` and change the type annotation to allow
Expand All @@ -42,7 +64,7 @@
to `regularization` and change the type annotation to allow
for block-wise regularization parameters
[PR #593](https://github.com/aai-institute/pyDVL/pull/593)
- Remove parameter `h0` from init of `LissaInfluence`
- Remove parameter `h0` from init of `LissaInfluence`
[PR #593](https://github.com/aai-institute/pyDVL/pull/593)
- Rename parameter `hessian_regularization` of `NystroemSketchInfluence`
to `regularization` and change the type annotation to allow
Expand Down Expand Up @@ -77,25 +99,25 @@
### Added

- Add progress bars to the computation of `LazyChunkSequence` and
`NestedLazyChunkSequence`
`NestedLazyChunkSequence`
[PR #567](https://github.com/aai-institute/pyDVL/pull/567)
- Add a device fixture for `pytest`, which depending on the availability and
- Add a device fixture for `pytest`, which depending on the availability and
user input (`pytest --with-cuda`) resolves to cuda device
[PR #574](https://github.com/aai-institute/pyDVL/pull/574)

### Fixed

- Fixed logging issue in decorator `log_duration`
[PR #567](https://github.com/aai-institute/pyDVL/pull/567)
- Fixed missing move of tensors to model device in `EkfacInfluence`
- Fixed missing move of tensors to model device in `EkfacInfluence`
implementation [PR #570](https://github.com/aai-institute/pyDVL/pull/570)
- Missing move to device of `preconditioner` in `CgInfluence` implementation
[PR #572](https://github.com/aai-institute/pyDVL/pull/572)
- Raise a more specific error message, when a `RunTimeError` occurs in
- Raise a more specific error message, when a `RunTimeError` occurs in
`torch.linalg.eigh`, so the user can check if it is related to a known
issue
[PR #578](https://github.com/aai-institute/pyDVL/pull/578)
- Fix an edge case (empty train data) in the test
- Fix an edge case (empty train data) in the test
`test_classwise_scorer_accuracies_manual_derivation`, which resulted
in undefined behavior (`np.nan` to `int` conversion with different results
depending on OS)
Expand All @@ -113,7 +135,7 @@

### Fixed

- `FutureWarning` for `ParallelConfig` constantly raised without actually
- `FutureWarning` for `ParallelConfig` constantly raised without actually
instantiating the object
[PR #562](https://github.com/aai-institute/pyDVL/pull/562)

Expand All @@ -129,7 +151,7 @@
- New preconditioned block variant of conjugate gradient
[PR #507](https://github.com/aai-institute/pyDVL/pull/507)
- Improvements to documentation: fixes, links, text, example gallery, LFS and
more [PR #532](https://github.com/aai-institute/pyDVL/pull/532),
more [PR #532](https://github.com/aai-institute/pyDVL/pull/532),
[PR #543](https://github.com/aai-institute/pyDVL/pull/543)
- Glossary of data valuation and influence terms in the documentation
[PR #537](https://github.com/aai-institute/pyDVL/pull/537
Expand All @@ -142,11 +164,11 @@
[PR #495](https://github.com/aai-institute/pyDVL/pull/495)
- Memory issue with `CgInfluence` and `ArnoldiInfluence`
[PR #498](https://github.com/aai-institute/pyDVL/pull/498)
- Raising specific error message with install instruction, when trying to load
- Raising specific error message with install instruction, when trying to load
`pydvl.utils.cache.memcached` without `pymemcache` installed.
If `pymemcache` is available, all symbols from `pydvl.utils.cache.memcached`
If `pymemcache` is available, all symbols from `pydvl.utils.cache.memcached`
are available through `pydvl.utils.cache`
[PR #509](https://github.com/aai-institute/pyDVL/pull/509)
[PR #509](https://github.com/aai-institute/pyDVL/pull/509)

### Changed

Expand Down Expand Up @@ -175,9 +197,9 @@
### Fixed

- Bug in using `DaskInfluenceCalcualator` with `TorchnumpyConverter`
for single dimensional arrays
for single dimensional arrays
[PR #485](https://github.com/aai-institute/pyDVL/pull/485)
- Fix implementations of `to` methods of `TorchInfluenceFunctionModel`
- Fix implementations of `to` methods of `TorchInfluenceFunctionModel`
implementations [PR #487](https://github.com/aai-institute/pyDVL/pull/487)
- Fixed bug with checking for converged values in semivalues
[PR #341](https://github.com/appliedAI-Initiative/pyDVL/pull/341)
Expand All @@ -197,15 +219,15 @@
- New influence function interface `InfluenceFunctionModel`
- Data parallel computation with `DaskInfluenceCalculator`
[PR #26](https://github.com/aai-institute/pyDVL/issues/26)
- Sequential batch-wise computation and write to disk with
`SequentialInfluenceCalculator`
- Sequential batch-wise computation and write to disk with
`SequentialInfluenceCalculator`
[PR #377](https://github.com/aai-institute/pyDVL/issues/377)
- Adapt notebooks to new influence abstractions
[PR #430](https://github.com/aai-institute/pyDVL/issues/430)

### Changed

- Refactor and simplify caching implementation
- Refactor and simplify caching implementation
[PR #458](https://github.com/aai-institute/pyDVL/pull/458)
- Simplify display of computation progress
[PR #466](https://github.com/aai-institute/pyDVL/pull/466)
Expand All @@ -230,8 +252,8 @@

- New method: Class-wise Shapley values
[PR #338](https://github.com/aai-institute/pyDVL/pull/338)
- New method: Data-OOB by @BastienZim
[PR #426](https://github.com/aai-institute/pyDVL/pull/426),
- New method: Data-OOB by @BastienZim
[PR #426](https://github.com/aai-institute/pyDVL/pull/426),
[PR $431](https://github.com/aai-institute/pyDVL/pull/431)
- Added `AntitheticPermutationSampler`
[PR #439](https://github.com/aai-institute/pyDVL/pull/439)
Expand Down Expand Up @@ -270,7 +292,7 @@ randomness.
- Added more abbreviations to documentation
[PR #415](https://github.com/aai-institute/pyDVL/pull/415)
- Added seed to functions from `pydvl.utils.numeric`, `pydvl.value.shapley` and
`pydvl.value.semivalues`. Introduced new type `Seed` and conversion function
`pydvl.value.semivalues`. Introduced new type `Seed` and conversion function
`ensure_seed_sequence`.
[PR #396](https://github.com/aai-institute/pyDVL/pull/396)
- Added `batch_size` parameter to `compute_banzhaf_semivalues`,
Expand All @@ -287,7 +309,7 @@ randomness.
[PR #352](https://github.com/aai-institute/pyDVL/pull/352)
- Made ray an optional dependency, relying on joblib as default parallel backend
[PR #408](https://github.com/aai-institute/pyDVL/pull/408)
- Decoupled `ray.init` from `ParallelConfig`
- Decoupled `ray.init` from `ParallelConfig`
[PR #373](https://github.com/aai-institute/pyDVL/pull/383)
- **Breaking Changes**
- Signature change: return information about Hessian inversion from
Expand Down Expand Up @@ -329,7 +351,7 @@ randomness.
(TMCS) starting too many processes and dying, plus other small changes
[PR #329](https://github.com/aai-institute/pyDVL/pull/329)
- Fix creation of GroupedDataset objects using the `from_arrays`
and `from_sklearn` class methods
and `from_sklearn` class methods
[PR #324](https://github.com/aai-institute/pyDVL/pull/334)
- Fix release job not triggering on CI when a new tag is pushed
[PR #331](https://github.com/aai-institute/pyDVL/pull/331)
Expand Down Expand Up @@ -386,13 +408,13 @@ randomness.
[PR #268](https://github.com/aai-institute/pyDVL/pull/268)
- Splitting of problem preparation and solution in Least-Core computation.
Umbrella function for LC methods.
[PR #257](https://github.com/aai-institute/pyDVL/pull/257)
[PR #257](https://github.com/aai-institute/pyDVL/pull/257)
- Operations on `ValuationResult` and `Status` and some cleanup
[PR #248](https://github.com/aai-institute/pyDVL/pull/248)
- **Bug fix and minor improvements**: Fixes bug in TMCS with remote Ray cluster,
raises an error for dummy sequential parallel backend with TMCS, clones model
inside `Utility` before fitting by default, with flag `clone_before_fit`
to disable it, catches all warnings in `Utility` when `show_warnings` is
inside `Utility` before fitting by default, with flag `clone_before_fit`
to disable it, catches all warnings in `Utility` when `show_warnings` is
`False`. Adds Miner and Gloves toy games utilities
[PR #247](https://github.com/aai-institute/pyDVL/pull/247)

Expand All @@ -402,7 +424,7 @@ randomness.
[PR #201](https://github.com/aai-institute/pyDVL/pull/201)
- Disabled caching of Utility values as well as repeated evaluations by default
[PR #211](https://github.com/aai-institute/pyDVL/pull/211)
- Test and officially support Python version 3.9 and 3.10
- Test and officially support Python version 3.9 and 3.10
[PR #208](https://github.com/aai-institute/pyDVL/pull/208)
- **Breaking change:** Introduces a class ValuationResult to gather and inspect
results from all valuation algorithms
Expand Down
Loading

0 comments on commit 6bfc142

Please sign in to comment.