-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve MapReduceJob and replace Futures interface (in e.g. TMCS) with joblib #387
Comments
@mdbenito opinions? I know I took a long while to write the |
Hello, I'm the contributor that proposed joblib/joblib#1485 , I'd be interested to hear more feedback about real-world use-cases for those kind of features in joblib. Would you mind giving more context, what features would you like to see in joblib that could help you with said usecases ? Regarding joblib/joblib#1485 there are some limitations that might prevent it for being extended or adapted much further than what the example already show:
Beware the dask backend does not support Do you think an |
Hello @fcharras, thanks for taking interest in this issue. The pattern we use in pyDVL is a MapReduce like pattern in which we want want to be able to submit several jobs (Map) that may take a while to finish and to then combine the results (Reduce) and check a stopping criterion. If the criterion is satisfied then we do not submit more jobs and we ideally terminate any currently running jobs because they may take a long time to finish and we do want that. The pattern implemented in joblib/joblib#1485 can fit our use case:
I didn't know that other backends didn't support returning the results as a generator out of the box, that's kind of a show blocker for this issue. We currently the We are considering to fully switch to joblib because it would simplify our code and remove the need to handle the configuration for the different backends ourselves. If there was a builtin |
TY for the added information. Indeed after 1.3 backends do not support For pattern such as joblib/joblib#1485 , or the map-reduce you're describing, the only risk (to my knowledge) is indeed the one that is already documented in the PR example, a callback thread can't be asked to wait for completion of pending jobs in other workers. It can be worked around by pre-running enough tasks at the start, and ensuring that a cycle in the feedback loop creates as much task as it consumed results. Since it's implicit, maybe joblib shouldn't encourage towards this kind of patterns, maybe an |
What about? import joblib
@contextmanager
def Parallel(*args, **kwargs):
with joblib.Parallel(*args, **kwargs) as parallel:
yield parallel
parallel._backend.abort_everything(ensure_ready=False)
|
Agreed. A central dispatcher is also simpler for handling randomness (although we already have the seed sequences). Unless there is some use case I'm forgetting, we can think of dropping mapreduce |
@AnesBenmerzoug FYI support with dask backend was just merged , will be available from the next release on. For support with the ray backend I've also opened a PR at the ray project at ray-project/ray#41028, the merge depends on if ray maintainers notice / are interested in the feature. But the diff is trivial, in fact users can unlock the feature already by adding one attribute to the RayBackend class. I think with those two on top all major backends are now adressed ? |
This is outdated, due to the changes in #558, will close. |
Thanks to a new feature in joblib 1.3.0, namely the option to return the results as generator instead of waiting for all of them to finish, I think that we can get rid of the concurrent.Futures interface and replace it with joblib.
The main loop for
truncated_montecarlo_shapley
can be rewritten using joblib as follows:This should be almost the same as the version with the concurrent.Futures interface but with way less custom code.
We can rewrite the
MapReduceJob
class to use this pattern instead and have the default stopping criterion be a dummy one.We could take some inspiration from this recent open PR in joblib to implement a similar, but not quite the same, pattern.
Configuration
The users could configure the backend using using joblib's context managers
parallel_backend
or the newer and betterparallel_config
:For ray, the user would initialize it directly if they need to otherwise it will be done automatically:
This would resolve #360 (handled by joblib's parallel_config function) and allow us to resolve #385
Worker Signalling Mechanism
To make this change even more worthwhile it should also resolve #363
For this we can use
Event
synchronization variable.I think this is a better approach than relying on yet another service e.g. redis or queue because it is simpler.
Here's how we could do it for the different backends:
For joblib, we can either use the one in the multiprocessing package or the one that comes with Loky (the default joblib backend).
For ray, there is no Event class and we would have to implement one using an Actor:
For dask, there is an
Event
that we can use.Cancelling Running Jobs
Unfortunately, 'Loky' the default joblib backend does not terminate workers when we exit the
Parallel
context manager because it tries to reuse worker processes. We may not want that because the individual tasks may take a long time to finish and cause a memory leak.Other backends do not suffer from this issue.
There are 2 ways to resolve this:
Subclass the LokyBackend and make the terminate function actually stop the worker processes (This idea comes from this stackoverflow answer:
We can then register this backend and set it as default using joblib's register_parallel_backend:
Explicitly call the LokyBackend's
abort_everything
method once the stopping criterion is met.EDIT: Added more details and split it into different sections.
The text was updated successfully, but these errors were encountered: