Replies: 1 comment
-
@aadeshINL FYI. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Overview By Josh:
For many situations, RAVEN can take multiple days to finish an input if not running in parallel. In the past, RAVEN used ppserver, for running in parallel, but that software is no longer maintained, so RAVEN was switched to ray. Unfortunately, with many RAVEN inputs running in parallel with ray, after a ray remote task finishes, it leaves behind a 'ray::IDLE' process (as can be seen with top or ps on the node). These can accumulate (more than 20 ray::IDLE's for every running ray process) and eventually cause the node to run out of memory or run out of available process, which results in the input either ending early with failure, or failing to make progress.
Terminology note: RAVEN runs RAVEN (RrR) has an outer RAVEN, which controls the inner RAVEN.
Problems with ray have been investigated before, resulting in the following bugs being filed:
Request to allow us to stop a specific ray instance (which would allow multiple rays to be running) (Note that not having this feature blocks simply being able to do a qsub from a qsub and starting up a new ray server for each inner):
ray-project/ray#12264
Request to add way to fix ray::IDLE taking up too many resources
ray-project/ray#27499
Ray timeline status fail if too big:
ray-project/ray#27952
Ray completely hangs if one node can't get ports
ray-project/ray#28071
Ray fails if too much data returned
ray-project/ray#28855
Things tried recently:
Deleting the remote at completion and only using ray on the inner was tried in:
#2038
The ray::IDLE problem continued despite having these changes. This points strongly to a ray problem, since the inner's were running in minutes, but ray::IDLE's were persisting for hours after the inner's process exited.
Using only one outer thread, but using ray parallelism on the inner was tried, but this did not result in the ray problem being fixed.
As a quick feasibility check, the dask ( https://www.dask.org/ ) library was tried, and it was capable of running a small test on INL's sawtooth cluster and distributing a function to a remote node.
Possible Future directions:
Try different ways of using Ray (such as using one wait for all the job handler tasks instead of checking each individually).
Time estimate: 2 weeks
Probability of success estimate: 25%
Switching the RAVEN parallel library away from ray.
Time estimate: 2 months (1 week testing different libraries on INL clusters, 4 weeks converting RAVEN to library, 3 extra weeks for more testing)
Probability of success: 70%
Eliminate using RAVEN runs RAVEN for HERON by rewriting RAVEN and HERON as appropriate.
Time estimate: 2 to 4 months
Probability of successfully fixing parallel problem: %40 (note that removing the need for RrR probably has advantages even if this does not fix the parallel problem)
Directly use QSUB from inside of QSUB. This probably requires adding a "RAVEN sampler" code interface that directly samples one sample with a subprocess (RAVEN runs RAVEN runs RAVEN), but basically moves all parallelism to a process ran with qsub.
Time estimate: 2 months
Probability of success: 70%
Fix ray. Dig into code and fix problems in ray.
If time spent is 1 week, probability of success is something like 2%
If time spent is 1 year, probability of success is something like 95%
Personally, I think allocating 3 months split between to 1. and 2. might be a useful path forward to try and solve this problem.
Beta Was this translation helpful? Give feedback.
All reactions