-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposed bugfix for batchtools reveals bug in future.batchtools? #74
Comments
This seems to be caused by Here's an example output of
Perhaps an explicit call to |
@stuvet I tried out the development version of The job still automatically fails once loaded with: Error: Log file '%s' for job with id %i not available I've also attempted a crude hack to set cluster.functions = batchtools::makeClusterFunctionsSlurm(
scheduler.latency = 60, # default is 1
fs.latency = 120) # default is 65 However, even with inserting a shim directly into |
This was my first attempt (before I understood the problem completely) & it didn't solve the problem entirely. Perhaps this may help - obviously made for users of Hope it helps you. Also if I remember right there are some tweaks for the mount flags in /etc/fstab that could help files appear faster in heavy I/O scenarios - if the bugfixes don't help perhaps the workers actually can't see the log files before they timeout. I'm no expert here. I did take a long look at the flags here & changed some. I'll try to find the slurm-specific documentation & I'll update with the flags that worked for me. EDIT: on second thoughts is it possible the spike in I/O you're seeing actually reflects/is associated with recruitment of new worker machines? If so, that's exactly what I was seeing without the I/o - I was scaling up from 0 & seeing this reliably. |
I have submitted a simple pull request for a bugfix in
batchtools
which fuelled some of its behaviour mentioned in #73.I'm writing here because this proposed bugfix reveals an error in future.batchtools, though based on the future.debug output I don't believe the two are related - just that the batchtools bug previously threw the error first.
Describe the bug
When a slurm worker is availble for the job, (& when the batchtools::waitForFile is not required by batchtools::getLog -> batchtools::readLog), everything functions correctly (note the status):
But when a slurm worker needs to be provisioned to run the job (& so the batchtools::waitForFile will also be called), the initial result of the call to
future.batchtools::status(future)
infuture.batchtools::await
is incorrect:At this point, logging inserted into
batchtools:::waitForFile
begins to appear. No more future.debug messages appear until the log file has been detected (now, after the proposed bugfix) &batchtools::waitForFile
exits.To be clear, the logged output does exist, and continues to be written by the running job after future.batchtools::await flags the job as expired & exits.
Please let me know if there's anything else I can do to help resolve this issue.
Expected behavior
future.batchtools::await waits for the running job to exit, even when workers need to be provisioned or batchtools::waitForFile is triggered.
Session information
The text was updated successfully, but these errors were encountered: