(3.0.0 3.6.0) Compute nodes belonging to more than one partition causes compute to overscale

The issue

If a cluster is created with a custom Slurm configuration that places static compute nodes into more than one partition, then ParallelCluster will attempt to launch as many EC2 instances for a given node as the number of partitions that node belongs to. This will result in over scaling and node termination due to multiple instances backing a single node.

Example ParallelCluster Configuration

In this example cluster configuration, we see that we define a Slurm partition named queue that will contain two static nodes named queue-st-compute-1 and queue-st-compute-2.

# Cluster configuration snippet
 SlurmQueues:
    - Name: queue
      ComputeResources:
        - Name: compute
          InstanceType: c5.2xlarge
          MinCount: 2
          MaxCount: 2

Example Custom Slurm Configuration

The example below customizes the Slurm configuration by specifying two additional partitions not managed by ParallelCluster: CustomerPartion1 and CustomerPartion2. These additional partitions are both configured to use compute node names (queue-st-compute-1 and queue-st-compute-2) that overlap with the partition managed by ParallelCluster (defined by NodeSet).

# The following slurm.conf snippet is reusing queue-st-compute-1 and queue-st-compute-2 node names 
# for both partitions CustomerPartition1 and CustomPartition2
NodeSet=nodeset Nodes=queue-st-compute-[1-2]

PartitionName=CustomerPartition1 Nodes=nodeset PreemptMode=REQUEUE PriorityTier=10 GraceTime=600 Default=YES MaxTime=2:00:00 
PartitionName=CustomerPartition2 Nodes=nodeset PreemptMode=REQUEUE PriorityTier=4 GraceTime=1800 MaxTime=8:00:00

Example Log Output In /var/log/parallelcluster/clustermgtd Log On The Head Node

In the logs, you will see both nodes, queue-st-compute-1 and queue-st-compute-2, repeated three times in the Found the following unhealthy static nodes: log line. You should also see the the instance IDs repeated three times in the Terminating instances log line:

2023-01-01 01:00:00,629 - [slurm_plugin.slurm_resources:_is_static_node_configuration_valid] - WARNING - Node state check: static node without nodeaddr set, node queue-st-compute-1(queue-st-compute-1), node state DOWN+CLOUD+NOT_RESPONDING:
2023-01-01 01:00:00,629 - [slurm_plugin.slurm_resources:_is_static_node_configuration_valid] - WARNING - Node state check: static node without nodeaddr set, node queue-st-compute-2(queue-st-compute-2), node state DOWN+CLOUD+NOT_RESPONDING:
2023-01-01 01:00:00,630 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Found the following unhealthy static nodes: (x6) ['queue-st-compute-1(queue-st-compute-1)', 
'queue-st-compute-2(queue-st-compute-2)', 'queue-st-compute-1(queue-st-compute-1)', 
'queue-st-compute-2(queue-st-compute-2)', 'queue-st-compute-1(queue-st-compute-1)', 
'queue-st-compute-2(queue-st-compute-2)']
2023-01-01 01:00:00,630 - [slurm_plugin.clustermgtd:_handle_unhealthy_static_nodes] - INFO - Setting unhealthy static nodes to DOWN
2023-01-01 01:00:00,634- [slurm_plugin.clustermgtd:_handle_unhealthy_static_nodes] - INFO - Terminating instances backing unhealthy static nodes
2023-01-01 01:00:00,639 - [slurm_plugin.instance_manager:delete_instances] - INFO - Terminating instances (x6) ['i-instanceid1', 
'i-instanceid0', 'i-instanceid1', 'i-instanceid0', 'i-instanceid1', 'i-instanceid0']

Affected versions (OSes, schedulers)

ParallelCluster versions >= 3.0.0 and <= 3.6.0 on all OSs.
Only the Slurm scheduler is affected.

Mitigation

The following mitigation has only been tested on ParallelCluster versions 3.5.0-3.6.0

Save the following text as pcluster.patch to /tmp onto your head node:

diff --git a/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/clustermgtd.py b/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/clustermgtd.py
index 557c798..727c62c 100644
--- /opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/clustermgtd.py
+++ /opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/clustermgtd.py
@@ -1098,10 +1098,10 @@ class ClusterManager:

     @staticmethod
     def _find_active_nodes(partitions_name_map):
-        active_nodes = []
+        active_nodes = set()
         for partition in partitions_name_map.values():
             if partition.state != "INACTIVE":
-                active_nodes += partition.slurm_nodes
+                active_nodes |= set(partition.slurm_nodes)
         return active_nodes

     def _is_node_in_replacement_valid(self, node, check_node_is_valid):

Create and run the following script on the head node as the root user:

#!/bin/bash
set -e
. "/etc/parallelcluster/cfnconfig"

# Patch file must be run from the root path
pushd /
# Apply the patch to clustermgtd.py, save backup to clustermgtd.py.orig
cat /tmp/pcluster.patch | patch -p0 -b

# Restart clustermgtd
source /opt/parallelcluster/pyenv/versions/cookbook_virtualenv/bin/activate
supervisorctl reload

popd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly