You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have discovered an issue in the official ParallelCluster AMI for Amazon Linux 2023 that consistently leads to job submission failure on p4 compute nodes.
If your cluster is affected by this issue, you will experience job submission failures caused by compute nodes failing to bootstrap. The bootstrap error is:
================================================================================
Error executing action `configure` on resource 'fabric_manager[Configure fabric manager]'
================================================================================
Mixlib::ShellOut::CommandTimeout
--------------------------------
service[nvidia-fabricmanager] (aws-parallelcluster-platform::nvidia_config line 35) had an error: Mixlib::ShellOut::CommandTimeout: Command timed out after 900s:
Command exceeded allowed execution time, process terminated
After consistent bootstrap errors, the cluster is eventually set to protected mode, where the partitions are deactivated. See here how to recover from protected mode.
We are investigating the root cause preventing the nvidia-fabricmanager service to start. The issue is impacting NVIDIA drivers 550.90.07 on Amazon Linux 2023. This version of NVIDIA drivers is included in 3.11.0 and 3.11.1 ParallelCluster AMIs.
Affected versions (OSes, schedulers)
Official AMIs of ParallelCluster 3.11.0 and 3.11.1 for Amazon Linux 2023
We observed impact on p4 instances, but we cannot exclude impact on other GPU instance types.
We tested g4dn, g5, g6, and p5 instances and found this issue does not manifest on them.
The issue
We have discovered an issue in the official ParallelCluster AMI for Amazon Linux 2023 that consistently leads to job submission failure on p4 compute nodes.
If your cluster is affected by this issue, you will experience job submission failures caused by compute nodes failing to bootstrap. The bootstrap error is:
After consistent bootstrap errors, the cluster is eventually set to protected mode, where the partitions are deactivated. See here how to recover from protected mode.
We are investigating the root cause preventing the nvidia-fabricmanager service to start. The issue is impacting NVIDIA drivers 550.90.07 on Amazon Linux 2023. This version of NVIDIA drivers is included in 3.11.0 and 3.11.1 ParallelCluster AMIs.
Affected versions (OSes, schedulers)
Mitigation
You can find a detailed explanation and the mitigation of the problem here: (3.11.x) Job submission failure with Amazon Linux 2023
The text was updated successfully, but these errors were encountered: