(3.11.x) Job submission failure with Amazon Linux 2023 #6571

hgreebe · 2024-11-18T02:01:49Z

The issue

We have discovered an issue in the official ParallelCluster AMI for Amazon Linux 2023 that consistently leads to job submission failure on p4 compute nodes.

If your cluster is affected by this issue, you will experience job submission failures caused by compute nodes failing to bootstrap. The bootstrap error is:

================================================================================
Error executing action `configure` on resource 'fabric_manager[Configure fabric manager]'
================================================================================

Mixlib::ShellOut::CommandTimeout
--------------------------------
service[nvidia-fabricmanager] (aws-parallelcluster-platform::nvidia_config line 35) had an error: Mixlib::ShellOut::CommandTimeout: Command timed out after 900s:
Command exceeded allowed execution time, process terminated

After consistent bootstrap errors, the cluster is eventually set to protected mode, where the partitions are deactivated. See here how to recover from protected mode.

We are investigating the root cause preventing the nvidia-fabricmanager service to start. The issue is impacting NVIDIA drivers 550.90.07 on Amazon Linux 2023. This version of NVIDIA drivers is included in 3.11.0 and 3.11.1 ParallelCluster AMIs.

Affected versions (OSes, schedulers)

Official AMIs of ParallelCluster 3.11.0 and 3.11.1 for Amazon Linux 2023
We observed impact on p4 instances, but we cannot exclude impact on other GPU instance types.
- We tested g4dn, g5, g6, and p5 instances and found this issue does not manifest on them.

Mitigation

You can find a detailed explanation and the mitigation of the problem here: (3.11.x) Job submission failure with Amazon Linux 2023

The text was updated successfully, but these errors were encountered:

Bingjiling · 2024-11-25T21:01:30Z

Hi team, my team is blocked on this issue as well, do you have an estimate for when the patch will be released?

adebayoj · 2024-11-26T22:07:42Z

Following up here as well. Does this issue manifest for the official AMIs for ParallelCluster for Ubuntu 2020 or 2204? Thanks!

gmarciani · 2024-11-27T15:40:01Z

Hi @adebayoj ,

at the best of our knowledge the issue affects only Amazon Linux 2023.

gmarciani · 2024-11-27T15:41:27Z

@Bingjiling we are actively working on fixing this issue.
The patch release is planned for end of December.

Bingjiling · 2024-11-27T19:28:35Z

@gmarciani Thanks for the update! I will try to use Amazon Linux 2 AMI instead.

hgreebe added the known issue label Nov 18, 2024

gmarciani mentioned this issue Nov 25, 2024

p4d instance not able to run job with pcluster 3.11.1 #6549

Open

gmarciani added the pending release label Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(3.11.x) Job submission failure with Amazon Linux 2023 #6571

(3.11.x) Job submission failure with Amazon Linux 2023 #6571

hgreebe commented Nov 18, 2024 •

edited

Loading

Bingjiling commented Nov 25, 2024

adebayoj commented Nov 26, 2024

gmarciani commented Nov 27, 2024

gmarciani commented Nov 27, 2024

Bingjiling commented Nov 27, 2024

(3.11.x) Job submission failure with Amazon Linux 2023 #6571

(3.11.x) Job submission failure with Amazon Linux 2023 #6571

Comments

hgreebe commented Nov 18, 2024 • edited Loading

The issue

Affected versions (OSes, schedulers)

Mitigation

Bingjiling commented Nov 25, 2024

adebayoj commented Nov 26, 2024

gmarciani commented Nov 27, 2024

gmarciani commented Nov 27, 2024

Bingjiling commented Nov 27, 2024

hgreebe commented Nov 18, 2024 •

edited

Loading