Descriptive missing job.rds file error #278

stuvet · 2021-07-24T17:34:23Z

I've been troubleshooting stability of batchtools when used on Slurm with the default makeClusterFunctionsSlurm (PR #276 & #277 ).

The last (rare) error I can reproduce is:

Expected Behaviour

If a submitted job is requeued by Slurm:
1. batchtools should not report an expired status -> Mapped missing Slurm job state codes #277.
2. If the job should have run without error at first submission, the requeued job should also run successfully (assuming no fatal hardware errors)

Problem

Slurm jobs which are requeued because of a previous hardware failure fail within 30 seconds of starting the second run.

Reprex

Awkward, because it relies on an available (non-mission-critical) Slurm cluster, but manually deleting the worker node (via GCP) of a running & error-free job results in a requeue, a delay, then a reliable error about 20 seconds after the job begins its second run (file path removed for posting):

Error in gzfile(file, "rb") : cannot open the connection
Calls: <Anonymous> -> doJobCollection.character -> readRDS -> gzfile
In addition: Warning message:
In gzfile(file, "rb") :
  cannot open compressed file '.../jobs/job929872958e6074e5662a4c9hd3f312f4.rds', probable reason 'No such file or directory'

Cause

batchtools:::doJobCollection.character deletes the jobCollection file.rds on the first run, so when the failed job gets requeued the file is no longer there, causing the error.

Workaround

Passing chunks.as.arrayjobs = TRUE in the resources request prevents this error (even if jobs are submitted singly) as it prevents the first run of the job deleting the jobCollection .RDS.

Questions

Apart from needing to clean up the files afterwards, can you see any downsides of using chunks.as.arrayjobs = TRUE for single jobs too? If not, this could be a useful default setting for @HenrikBengtsson when submitting jobs from future.batchtools, simply to avoid triggering an unhandled error, and to allow jobs to requeue as expected (assuming backend configuration allows).
Perhaps a more explicit option would be better - allow.requeue or prevent.requeue?

Bugfix in waitForFile

Mapped missing Slurm job state codes

Descriptive missing job.rds file error

stuvet · 2021-07-24T19:46:29Z

Changed to 'Issue'

stuvet added 6 commits July 24, 2021 05:20

Bugfix in waitForFile

115cfdd

Mapped missing Slurm job state codes

8ff9188

Merge pull request #1 from define-ag/sr_waitForFile_patch

1dd640b

Bugfix in waitForFile

Merge pull request #2 from define-ag/sr_slurm_status_patch

a3aeafe

Mapped missing Slurm job state codes

Descriptive missing job.rds file error

bcfb7a4

Merge pull request #3 from define-ag/sr_slurm_requeue_patch

363c558

Descriptive missing job.rds file error

stuvet force-pushed the sr_slurm_requeue_patch branch from bcfb7a4 to 9a3bfc1 Compare July 24, 2021 18:44

Descriptive missing job.rds file error

c1f645f

stuvet force-pushed the sr_slurm_requeue_patch branch from 9a3bfc1 to c1f645f Compare July 24, 2021 18:47

Merge branch 'master' into sr_slurm_requeue_patch

7fe4b80

stuvet closed this Jul 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Descriptive missing job.rds file error #278

Descriptive missing job.rds file error #278

stuvet commented Jul 24, 2021

stuvet commented Jul 24, 2021

Descriptive missing job.rds file error #278

Descriptive missing job.rds file error #278

Conversation

stuvet commented Jul 24, 2021

Expected Behaviour

Problem

Reprex

Cause

Workaround

Questions

stuvet commented Jul 24, 2021