Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add solution for explicit randomization in subprocesses. #396

Merged
merged 34 commits into from
Sep 2, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
40971ad
Add Seed type to `pydvl.utils.types`.
Aug 29, 2023
a798b39
Add seed parameter to :func:`pydvl.utils.numeric.random_subset`, :fun…
Aug 29, 2023
4e68acf
Add seed parameter to :func:`pydvl.utils.parallel.map_reduce.MapReduc…
Aug 29, 2023
b78dc14
Add seed parameter to :func:`pydvl.value.result.ValuationResult`.
Aug 29, 2023
7436189
Add class `pydvl.value.sampler.StochasticSampler` with seed property.…
Aug 30, 2023
1be1bb0
Merge function fn_accepts_param_name into maybe_add_argument and adap…
Aug 30, 2023
7b9c07d
Rename :func:`pydvl.utils.types.ensure_seed_seq` to :func:`pydvl.util…
Aug 30, 2023
fe03834
Add StochasticSampler mixin and add back the different typeof index i…
Aug 30, 2023
615fe34
Remove sets_are_equal function and replace by `set(a) == set(b)`.
Aug 30, 2023
df24a53
Change structure so that seed is set over the constructor for the sam…
Aug 30, 2023
778a7eb
Add seed parameter to all methods of `pydvl.value.shapley.common`. Ad…
Aug 31, 2023
86a2239
Add seed parameter to `pydvl.utils.numeric.random_matrix_with_conditi…
Aug 31, 2023
7ff27c0
Fix type hints as noted in https://github.com/appliedAI-Initiative/py…
Aug 31, 2023
4455ce9
Split test cases into reproducible and stochastic to match the other …
Aug 31, 2023
611cb0d
Remove constructors from samplers by using mixin formalism correctly.
Sep 1, 2023
12e90ab
Remove comment from functional.py
Sep 1, 2023
9186d4f
Extract separate method `call_fun_remove_arg`. Integrate backlog clas…
Sep 1, 2023
658b835
Merge develop and adapt new style for comments.
Sep 1, 2023
26d59bf
Remove backlog from semivalues.py and montecarlo.py.
Sep 1, 2023
4e889f7
Remove reproducibility tests from semivalues.py and deactivated affec…
Sep 1, 2023
c28ce5e
Deactivate test case and add TODO
Sep 1, 2023
6dee93d
Fix comments and typos.
Sep 1, 2023
3a053d3
Optimized function names and further extended one comment.
Sep 1, 2023
d3fb8ac
Add extended documentation.
Sep 1, 2023
cb546ea
Adapted CHANGELOG.md.
Sep 1, 2023
595105b
Merge branch 'develop' into 392-explicit-randomization-for-subprocesses
Sep 1, 2023
688c9d7
Fix indent in docstring.
Sep 1, 2023
b827439
Fix corner-case
mdbenito Sep 2, 2023
2ba97e7
Remove unnecessary function and merge tests
mdbenito Sep 2, 2023
087fdbb
Move maybe_add_argument to functional.py
mdbenito Sep 2, 2023
317f8b1
Renaming and simplifying docstrings
mdbenito Sep 2, 2023
f007ce8
Cosmetic
mdbenito Sep 2, 2023
1bb75e1
Nicer headers in API
mdbenito Sep 2, 2023
e38e1ff
Cleanup
mdbenito Sep 2, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 9 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
# Changelog

## 0.7.0 - 📚 Documentation overhaul, new methods and bug fixes 💥
## 0.7.0 - 📚🆕 Documentation and IF overhaul, new methods and bug fixes 💥🐞

This is our first β release! We have worked hard to deliver improvements across
the board, with a focus on documentation and usability.
the board, with a focus on documentation and usability. We have also reworked
the internals of the `influence` module, improved parallelism and handling of
randomness.

### Added

Expand All @@ -13,8 +15,13 @@ the board, with a focus on documentation and usability.
[PR #406](https://github.com/aai-institute/pyDVL/pull/406)
- Added more abbreviations to documentation
[PR #415](https://github.com/aai-institute/pyDVL/pull/415)
- Added seed to functions from `pydvl.utils.numeric`, `pydvl.value.shapley` and
`pydvl.value.semivalues`. Introduced new type `Seed` and conversion function
`ensure_seed_sequence`.
[PR #396](https://github.com/aai-institute/pyDVL/pull/396)

### Changed

- Replaced sphinx with mkdocs for documentation. Major overhaul of documentation
[PR #352](https://github.com/aai-institute/pyDVL/pull/352)
- Made ray an optional dependency, relying on joblib as default parallel backend
Expand Down
6 changes: 6 additions & 0 deletions docs/css/extra.css
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,12 @@ a.autorefs-external:hover::after {
user-select: none;
}

/* Nicer style of headers in generated API */
h2 code {
font-size: large!important;
background-color: inherit!important;
}

/* Remove cell input and output prompt */
.jp-InputArea-prompt, .jp-OutputArea-prompt {
display: none !important;
Expand Down
108 changes: 108 additions & 0 deletions src/pydvl/utils/functional.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
"""
Supporting utilities for manipulating arguments of functions.
"""

from __future__ import annotations

import inspect
from functools import partial
from typing import Callable, Set, Union

__all__ = ["maybe_add_argument"]


def _accept_additional_argument(*args, fun: Callable, arg: str, **kwargs):
"""Calls the given function with the given positional and keyword arguments,
removing `arg` from the keyword arguments.

Args:
args: Positional arguments to pass to the function.
fun: The function to call.
arg: The name of the argument to remove.
kwargs: Keyword arguments to pass to the function.

Returns:
The return value of the function.
"""
try:
del kwargs[arg]
except KeyError:
pass

return fun(*args, **kwargs)


def free_arguments(fun: Union[Callable, partial]) -> Set[str]:
"""Computes the set of free arguments for a function or
[functools.partial][] object.

All arguments of a function are considered free unless they are set by a
partial. For example, if `f = partial(g, a=1)`, then `a` is not a free
argument of `f`.

Args:
fun: A callable or a [partial object][].

Returns:
The set of free arguments of `fun`.

!!! tip "New in version 0.7.0"
"""
args_set_by_partial: Set[str] = set()

def _rec_unroll_partial_function_args(g: Union[Callable, partial]) -> Callable:
"""Stores arguments and recursively call itself if `g` is a
[functools.partial][] object. In the end, returns the initially wrapped
function.

This handles the construct `partial(_accept_additional_argument, *args,
**kwargs)` that is used by `maybe_add_argument`.

Args:
g: A partial or a function to unroll.

Returns:
Initial wrapped function.
"""
nonlocal args_set_by_partial

if isinstance(g, partial) and g.func == _accept_additional_argument:
arg = g.keywords["arg"]
if arg in args_set_by_partial:
args_set_by_partial.remove(arg)
return _rec_unroll_partial_function_args(g.keywords["fun"])
elif isinstance(g, partial):
args_set_by_partial.update(g.keywords.keys())
args_set_by_partial.update(g.args)
return _rec_unroll_partial_function_args(g.func)
else:
return g

wrapped_fn = _rec_unroll_partial_function_args(fun)
sig = inspect.signature(wrapped_fn)
return args_set_by_partial | set(sig.parameters.keys())


def maybe_add_argument(fun: Callable, new_arg: str) -> Callable:
"""Wraps a function to accept the given keyword parameter if it doesn't
already.

If `fun` already takes a keyword parameter of name `new_arg`, then it is
returned as is. Otherwise, a wrapper is returned which merely ignores the
argument.

Args:
fun: The function to wrap
new_arg: The name of the argument that the new function will accept
(and ignore).

Returns:
A new function accepting one more keyword argument.

!!! tip "Changed in version 0.7.0"
Ability to work with partials.
"""
if new_arg in free_arguments(fun):
return fun

return partial(_accept_additional_argument, fun=fun, arg=new_arg)
41 changes: 31 additions & 10 deletions src/pydvl/utils/numeric.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@
import numpy as np
from numpy.typing import NDArray

from pydvl.utils.types import Seed

__all__ = [
"running_moments",
"num_samples_permutation_hoeffding",
Expand Down Expand Up @@ -68,24 +70,32 @@ def num_samples_permutation_hoeffding(eps: float, delta: float, u_range: float)
return int(np.ceil(np.log(2 / delta) * 2 * u_range**2 / eps**2))


def random_subset(s: NDArray[T], q: float = 0.5) -> NDArray[T]:
"""Returns one subset at random from `s`.
def random_subset(
s: NDArray[T],
q: float = 0.5,
seed: Optional[Seed] = None,
) -> NDArray[T]:
"""Returns one subset at random from ``s``.

Args:
s: set to sample from
q: Sampling probability for elements. The default 0.5 yields a
uniform distribution over the power set of s.
seed: Either an instance of a numpy random number generator or a seed for it.

Returns:
The subset
"""
rng = np.random.default_rng()
rng = np.random.default_rng(seed)
selection = rng.uniform(size=len(s)) > q
return s[selection]


def random_powerset(
s: NDArray[T], n_samples: Optional[int] = None, q: float = 0.5
s: NDArray[T],
n_samples: Optional[int] = None,
q: float = 0.5,
seed: Optional[Seed] = None,
) -> Generator[NDArray[T], None, None]:
"""Samples subsets from the power set of the argument, without
pre-generating all subsets and in no order.
Expand All @@ -103,6 +113,7 @@ def random_powerset(
Defaults to `np.iinfo(np.int32).max`
q: Sampling probability for elements. The default 0.5 yields a
uniform distribution over the power set of s.
seed: Either an instance of a numpy random number generator or a seed for it.

Returns:
Samples from the power set of `s`.
Expand All @@ -114,21 +125,27 @@ def random_powerset(
if q < 0 or q > 1:
raise ValueError("Element sampling probability must be in [0,1]")

rng = np.random.default_rng(seed)
total = 1
if n_samples is None:
n_samples = np.iinfo(np.int32).max
while total <= n_samples:
yield random_subset(s, q)
yield random_subset(s, q, seed=rng)
total += 1


def random_subset_of_size(s: NDArray[T], size: int) -> NDArray[T]:
def random_subset_of_size(
s: NDArray[T],
size: int,
seed: Optional[Seed] = None,
) -> NDArray[T]:
"""Samples a random subset of given size uniformly from the powerset
of `s`.

Args:
s: Set to sample from
size: Size of the subset to generate
seed: Either an instance of a numpy random number generator or a seed for it.

Returns:
The subset
Expand All @@ -138,11 +155,13 @@ def random_subset_of_size(s: NDArray[T], size: int) -> NDArray[T]:
"""
if size > len(s):
raise ValueError("Cannot sample subset larger than set")
rng = np.random.default_rng()
rng = np.random.default_rng(seed)
return rng.choice(s, size=size, replace=False)


def random_matrix_with_condition_number(n: int, condition_number: float) -> NDArray:
def random_matrix_with_condition_number(
n: int, condition_number: float, seed: Optional[Seed] = None
) -> NDArray:
"""Constructs a square matrix with a given condition number.

Taken from:
Expand All @@ -156,6 +175,7 @@ def random_matrix_with_condition_number(n: int, condition_number: float) -> NDAr
Args:
n: size of the matrix
condition_number: duh
seed: Either an instance of a numpy random number generator or a seed for it.

Returns:
An (n,n) matrix with the requested condition number.
Expand All @@ -166,6 +186,7 @@ def random_matrix_with_condition_number(n: int, condition_number: float) -> NDAr
if condition_number <= 1:
raise ValueError("Condition number must be greater than 1")

rng = np.random.default_rng(seed)
log_condition_number = np.log(condition_number)
exp_vec = np.arange(
-log_condition_number / 4.0,
Expand All @@ -175,8 +196,8 @@ def random_matrix_with_condition_number(n: int, condition_number: float) -> NDAr
exp_vec = exp_vec[:n]
s: np.ndarray = np.exp(exp_vec)
S = np.diag(s)
U, _ = np.linalg.qr((np.random.rand(n, n) - 5.0) * 200)
V, _ = np.linalg.qr((np.random.rand(n, n) - 5.0) * 200)
U, _ = np.linalg.qr((rng.uniform(size=(n, n)) - 5.0) * 200)
V, _ = np.linalg.qr((rng.uniform(size=(n, n)) - 5.0) * 200)
P: np.ndarray = U.dot(S).dot(V.T)
P = P.dot(P.T)
return P
Expand Down
28 changes: 24 additions & 4 deletions src/pydvl/utils/parallel/map_reduce.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,17 @@
This interface might be deprecated or changed in a future release before 1.0

"""
from functools import reduce
from itertools import accumulate, repeat
from typing import Any, Collection, Dict, Generic, List, Optional, TypeVar, Union

from joblib import Parallel, delayed
from numpy.random import SeedSequence
from numpy.typing import NDArray

from ..config import ParallelConfig
from ..types import MapFunction, ReduceFunction, maybe_add_argument
from ..functional import maybe_add_argument
from ..types import MapFunction, ReduceFunction, Seed, ensure_seed_sequence
from .backend import init_parallel_backend

__all__ = ["MapReduceJob"]
Expand Down Expand Up @@ -104,25 +107,42 @@ def __init__(
self.map_kwargs = map_kwargs if map_kwargs is not None else dict()
self.reduce_kwargs = reduce_kwargs if reduce_kwargs is not None else dict()

self._map_func = maybe_add_argument(map_func, "job_id")
self._map_func = reduce(maybe_add_argument, ["job_id", "seed"], map_func)
self._reduce_func = reduce_func

def __call__(
self,
seed: Optional[Union[Seed, SeedSequence]] = None,
) -> R:
"""
Runs the map-reduce job.

Args:
seed: Either an instance of a numpy random number generator or a seed for
it.

Returns:
The result of the reduce function.
"""
if self.config.backend == "joblib":
backend = "loky"
else:
backend = self.config.backend
# In joblib the levels are reversed.
# 0 means no logging and 50 means log everything to stdout
verbose = 50 - self.config.logging_level
seed_seq = ensure_seed_sequence(seed)
with Parallel(backend=backend, n_jobs=self.n_jobs, verbose=verbose) as parallel:
chunks = self._chunkify(self.inputs_, n_chunks=self.n_jobs)
map_results: List[R] = parallel(
delayed(self._map_func)(next_chunk, job_id=j, **self.map_kwargs)
for j, next_chunk in enumerate(chunks)
delayed(self._map_func)(
next_chunk, job_id=j, seed=seed, **self.map_kwargs
)
for j, (next_chunk, seed) in enumerate(
zip(chunks, seed_seq.spawn(len(chunks)))
)
)

reduce_results: R = self._reduce_func(map_results, **self.reduce_kwargs)
return reduce_results

Expand Down
Loading
Loading