(3.11.0) Job submission failure caused by race condition in Pyxis configuration

The issue

We have discovered an issue in the way we configure the Pyxis Slurm plugin in ParallelCluster that can lead to job submission failures. When this issue occurs, the cluster enters an invalid state, and any subsequent job would fail to run, including those that do not require the Pyxis plugin.

If your cluster is affected by this issue, you will experience job failures with the following error in its output:

[ec2-user@ip-27-6-21-47 ~]$ cat slurm-1.out
srun: error: spank: Failed to open /opt/slurm/etc/plugstack.conf.d/sed6Yj8Ga: Permission denied
srun: error: Plug-in initialization failed

When the issue occurs, the cluster is unable to automatically recover from it, and all subsequent jobs will fail to run. However, running jobs will not be affected.

The issue is caused by a race condition happening during the compute node bootstrap process, as multiple processes write temporary files into the shared Slurm configuration directory. The presence of such temporary files causes Slurm failures when loading the SPANK plugins. A failure in removing these temporary files will render the cluster inoperable.

Affected versions (OSes, schedulers)

ParallelCluster 3.11.0

Mitigation

We suggest you run the below mitigation steps if your cluster is running on the affected version, even if your jobs are still working as expected, in order to avoid any future failures and cluster inoperability.

sudo mv /opt/slurm/etc/plugstack.conf /opt/slurm/etc/backup.plugstack.conf
sudo -i scontrol reconfigure

Please notice that this mitigation disables Spank. If your cluster is configured to use Spank plugins, then you need to restore you plugstack.conf avoiding the wildcard inclusion of plugins (i.e. do not use include /etc/slurm/plugstack.conf.d/*) For more details please refer to Spank documentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(3.11.0) Job submission failure caused by race condition in Pyxis configuration

The issue

Affected versions (OSes, schedulers)

Mitigation

Clone this wiki locally