-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error in readLog() on Slurm cluster #273
Comments
This problem has been solved by the pull request here, though it may reveal another bug at After this bugfix I can run batchtools jobs on a slurm cluster as expected, even when the machine needs to be provisioned, but despite adapting my slurm cluster nfs settings to sync as hinted at here, I still need to use a fs.latency of 70-80 seconds though I can now use a scheduler.latency of 1. |
@stuvet: Thanks for looking into this in more detail! I am currently experiencing seemingly the same issue ( |
Hi @kpj I'm sorry I have no idea about LSF, but I was mistaken thinking that I needed The main problem for me was the mismatch between the full log file path passed to I also noticed that some of the Slurm status codes were missing from Sorry I can't be more helpful, but hopefully it gives you some options about where to look next. |
Thanks a lot for your response!
Unfortunately, I don't have access and am used to network latency related issues.
I cloned mllg/batchtools and am on the latest commit (63a6f81).
When I print
What kind of delay do you mean?
Yes. E.g., right now I still have a few jobs running while my
No worries, you already gave me some useful pointers :-) |
In my case it was a provisioning delay, because the Slurm partition autoscaled down to 0 workers when not in use, so there was 60-90 sec when the worker was being recruited before a job could start. During this 60 sec Slurm would return a pending status code for the job that wasn't being handled properly by
Sounds like it may be worth chasing up that Best of luck with it! This took me ages to debug, especially since I had to track these errors through 3 different R packages & troubleshoot an HPC system that I didn't understand at all! |
I see, very interesting! I think this is not the case for me.
When I explicitly removed NA entries from the list of returned job ids, I ended up running into the
As far as I can tell
Thanks a lot, this is very reassuring 😄 |
Now you're definitely down the right track! (I hope...). I'd leave the NAs in, and try to work out how they're being produced - do those
Options for |
Sorry for the late response, our cluster was down...
The NAs apparently come from the interspersed message
The whole command being executed is actually
In my tests,
That definitely something to check later on, but for now none of my jobs get suspended while triggering the crash. |
Ah I misread the
Interesting. I wonder if the batchtools registry sees 3 jobs in your example, or 6 (3 without the appropriate job ids & statuses etc)? |
To get the list of running jobs, batchtools looks at the result of batchtools/R/clusterFunctions.R Lines 271 to 276 in 63a6f81
Lines 130 to 133 in 63a6f81
Simulating a call to But in the end they remove NA entries so it seems all good: batchtools/R/clusterFunctions.R Line 279 in 63a6f81
I guess I have to keep searching 😬 |
FWIW, I gave up on debugging this complex network of packages and ended up implementing my own LSF scheduling wrapper.
|
FWIW, in the next release of future.batchtools, you can tweak the underlying cluster functions parameters, e.g. plan(batchtools_slurm, scheduler.latency = 60, fs.latency = 20) Until on CRAN, you can install the develop version using: remotes::install_github("HenrikBengtsson/future.batchtools", ref="develop") I'm not sure if this fixes any of the problems here, but thought I comment in case it does. |
Problem
I'll cross-post a related feature request for future.batchtools, but I'm reliably getting:
Error : Log file ... for job with id 1 not available
when submitting jobs using future.bachtools on a small slurm cluster.
Importantly, it only ever happens when a new machine has to be recruited. Because I allow certain partitions to scale down to 0 machines when not in use I reliably get this error the first time I submit code to these partitions, either by future.batchtools, or directly from batchtools. Jobs run as expected once a node is running & idle.
The log files do exist, and the jobs continue to run on the cluster after batchtoools::readLogs errors.
Environment
Slurm cluster on Google compute engines, set up like this with near-default config, but with preemptible partitions which can scale down to 0 machines.
SessionInfo
Template
Potential Fix
Jobs reliably take 60 seconds (almost always) to an outside 69 seconds to begin after submission when a partition has to recruit a new machine. This means the waitForFile times out with the default fs.latency=65.
I've played around with both scheduler.latency & fs.latency, and extending fs.latency even to 260 seconds doesn't solve the problem (why? Is waitForFile getting an unexpected value for
path
by trying to runfs::path_dir(fn)
before the machine is fully set up?). My problem is solved by increasing scheduler.latency to 60-70 secnods, which allows me to drop fs.latency down to 10-20.This solves my problem, but makes batchtools slow to recognise that the job has completed.
Feature Request
Divide scheduler.latency into two figures - one for the initial sleep, and one for subsequent responses.
The text was updated successfully, but these errors were encountered: