Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add solution for explicit randomization in subprocesses. #396

Merged
merged 34 commits into from
Sep 2, 2023

Conversation

kosmitive
Copy link
Contributor

@kosmitive kosmitive commented Aug 8, 2023

Description

This PR closes #242, #392 and #398

Changes

  • Added seed to functions from pydvl.utils.numeric, pydvl.value.shapley and
    pydvl.value.semivalues.
  • Added new tests for testing reproducibility of modified functions.
  • Introduced new type Seed and conversion function ensure_seed_sequence.

Postponed

@kosmitive kosmitive linked an issue Aug 8, 2023 that may be closed by this pull request
@kosmitive kosmitive self-assigned this Aug 8, 2023
@AnesBenmerzoug
Copy link
Collaborator

This isn't the recommended way to implement seeded random number generation. We should favour creating rng objects (in the case of numpy) and a Random object (in the case of pure Python starting from version 3.9) and use those for any subsequent generation.

Here's how it's done in scikit-learn: https://github.com/scikit-learn/scikit-learn/blob/2a2772a87b6c772dc3b8292bcffb990ce27515a8/sklearn/utils/validation.py#L1207

We could use that as inspiration, even though it's using the legacy RandomState class instead of the new Generator class.

This doesn't really close #242. To solve that we should expose a seed parameter in all algorithms that we can use to seed the different workers.

Refer to this and this for more details.

@kosmitive kosmitive force-pushed the 392-explicit-randomization-for-subprocesses branch from 454bea5 to 28288db Compare August 9, 2023 04:08
@kosmitive
Copy link
Contributor Author

kosmitive commented Aug 9, 2023

@AnesBenmerzoug

  1. Can we drop support for Python 3.8. Reason for that is that in numpy>=1.25, the Generator object has a spawn method. This method doesn't exist in previous versions. However Python 3.8 doesn't requires numpy<1.25. Not having this method makes the design for seeding more complex as one has to distinguish between SeedSequence and Generator.

  2. I need to adapt the tests as they also rely on seeding and fix bugs in the seeding before I request a review.

  3. Do we want to create another issue for pyTorch to keep this PR small?

@kosmitive kosmitive changed the title Add solution for explicit randomization in subprocesses. Draft: Add solution for explicit randomization in subprocesses. Aug 9, 2023
@AnesBenmerzoug
Copy link
Collaborator

@kosmitive

  1. No, we shouldn't drop Python 3.8 until it reaches its end-of-life (14 Oct 2024). The Generator's spawn is just a convenience that was added in numpy 1.25 and we shouldn't bump the minimum requirement just for that. We can just directly instantiate a SeedSequence object and call its spawn method.

  2. Yes, please do that so we can be sure that this change works. No need to run all tests twice though to check that seeding works.

  3. Yes, that makes sense. Anyway, the random seed handling is different in PyTorch.

@kosmitive
Copy link
Contributor Author

@kosmitive

1. No, we shouldn't drop Python 3.8 until it reaches its end-of-life (14 Oct 2024). The `Generator`'s `spawn` is just a convenience that was added in numpy 1.25 and we shouldn't bump the minimum requirement just for that. We can just directly instantiate a [SeedSequence](https://numpy.org/doc/stable/reference/random/bit_generators/generated/numpy.random.SeedSequence.html#numpy.random.SeedSequence) object and call its `spawn` method.

2. Yes, please do that so we can be sure that this change works. No need to run all tests twice though to check that seeding works.

3. Yes, that makes sense. Anyway, the random seed handling is different in PyTorch.
  1. Okay then I allow only to pass a SeedSequence prior to using parallel execution.

  2. I'll create an issue for further investigation.

@kosmitive kosmitive force-pushed the 392-explicit-randomization-for-subprocesses branch 7 times, most recently from 4a8559f to 116f62a Compare August 12, 2023 20:51
@kosmitive kosmitive changed the title Draft: Add solution for explicit randomization in subprocesses. Add solution for explicit randomization in subprocesses. Aug 12, 2023
Copy link
Collaborator

@mdbenito mdbenito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I have left a couple of comments, but I think we need to discuss the general design for this. It is important to distinguish the different concepts used and how they enter the picture at which stages.

CHANGELOG.md Outdated Show resolved Hide resolved
requirements.txt Outdated Show resolved Hide resolved
src/pydvl/utils/types.py Outdated Show resolved Hide resolved
src/pydvl/utils/types.py Outdated Show resolved Hide resolved
src/pydvl/utils/types.py Outdated Show resolved Hide resolved
src/pydvl/value/result.py Outdated Show resolved Hide resolved
@kosmitive
Copy link
Contributor Author

@mdbenito @AnesBenmerzoug Let's decide on one way. Do we want to go forward with a seed parameter approach or do we switch to a different approach. What are your thoughts?

@kosmitive kosmitive force-pushed the 392-explicit-randomization-for-subprocesses branch from 7bbd5e4 to f216562 Compare August 29, 2023 20:40
@mdbenito
Copy link
Collaborator

mdbenito commented Sep 1, 2023

@kosmitive This error indicates that you must have made some mistake merging develop into your branch, because the function semivalues was removed from that branch and its tests.

@@ -58,18 +83,10 @@ def maybe_add_argument(fun: Callable, new_arg: str) -> Callable:
Returns:
A new function accepting one more keyword argument.
"""
params = inspect.signature(fun).parameters
if new_arg in params.keys():
if new_arg in unroll_partial_fn_args(fun):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the logic behind this change?

Copy link
Contributor Author

@kosmitive kosmitive Sep 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once a partial function is passed to maybe_add_argument, the inner __call__ logic is neglected for these types and thus also if it is contained in the arguments. Furthermore partial has to be used, because otherwise the wrapped function can't be pickled (happens for seed but not for job_id). So recursive partial support is needed in the checking.

Copy link
Collaborator

@mdbenito mdbenito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More comments. Please double check all docstrings before requesting another review.

src/pydvl/utils/functional.py Outdated Show resolved Hide resolved
src/pydvl/utils/numeric.py Outdated Show resolved Hide resolved
src/pydvl/utils/types.py Outdated Show resolved Hide resolved
src/pydvl/utils/types.py Outdated Show resolved Hide resolved
src/pydvl/utils/types.py Outdated Show resolved Hide resolved
src/pydvl/value/shapley/gt.py Outdated Show resolved Hide resolved
src/pydvl/value/shapley/common.py Outdated Show resolved Hide resolved
src/pydvl/value/result.py Outdated Show resolved Hide resolved
src/pydvl/value/result.py Outdated Show resolved Hide resolved
src/pydvl/utils/types.py Outdated Show resolved Hide resolved
@kosmitive kosmitive force-pushed the 392-explicit-randomization-for-subprocesses branch 2 times, most recently from 9fbeb70 to a5931c7 Compare September 1, 2023 17:05
@kosmitive kosmitive force-pushed the 392-explicit-randomization-for-subprocesses branch from a5931c7 to 6dee93d Compare September 1, 2023 17:07
src/pydvl/utils/types.py Outdated Show resolved Hide resolved
@kosmitive kosmitive force-pushed the 392-explicit-randomization-for-subprocesses branch from 98c5683 to 3a053d3 Compare September 1, 2023 18:31
@kosmitive kosmitive force-pushed the 392-explicit-randomization-for-subprocesses branch from fa20d85 to f3e3ff4 Compare September 1, 2023 20:25
@kosmitive kosmitive force-pushed the 392-explicit-randomization-for-subprocesses branch from f3e3ff4 to cb546ea Compare September 1, 2023 20:27
Copy link
Collaborator

@mdbenito mdbenito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kosmitive I went ahead and pushed directly to your branch because I don't know how to explain better that you need to think of the readers when writing documentation. I also moved things around a bit and made a couple minor changes / simplifications.

@mdbenito mdbenito merged commit f8fa433 into develop Sep 2, 2023
0 of 6 checks passed
@mdbenito mdbenito deleted the 392-explicit-randomization-for-subprocesses branch September 2, 2023 10:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Explicit randomization for subprocesses Properly handle random seeds
3 participants