Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix in waitForFile where fn was full file path #274

Closed
wants to merge 0 commits into from

Conversation

stuvet
Copy link
Contributor

@stuvet stuvet commented Jun 5, 2021

95f27ca

The bug corrected by 95f27ca was identified when waiting for a worker to be provisioned after submitting a jobs on a slurm cluster. It occurs when called from readLog when fs.latency > 0. While waiting for the log file to appear, fn is a full file path, so isn't found in the call to list.files without the full.names argument.

This solution handles fn as either a full file path or a file name, and provides the full file path in the timeout error message.

548129f

The first bugfix revealed that batchtools::getStatus was returning an incorrect 'expired' status for jobs during machine provisioning, which triggered future.batchtools::await to handle the 'expired' job & terminate early, described here. This was addressed in an inelegant 548129f. It would perhaps be better to update the log.file directly in the registry once the value is known, rather than adding it on the fly, but I don't know the full implications of doing this. I'm also not convinced that the specified timed.out is right - it won't terminate at the same time as the readLog -> waitForFile loop & it may not take into account queuing, but it seems to work well in my environment.

The new 'provisioning' status is strictly unnecessary, as preventing the 'expired' status would have the same effect on future.batchtools::await, but it feels more explicit.

I don't know about compatibility with other environments, but it seemed solid when tested across 500 jobs on a Slurm cluster starting at 0 nodes & limited at 20 nodes with a queue size of 50. Previously the same series of jobs simply wouldn't run unless nodes were persistent & pre-provisioned.

@stuvet
Copy link
Contributor Author

stuvet commented Jul 24, 2021

PR closed as superceded by #276 & #277

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant