-
Notifications
You must be signed in to change notification settings - Fork 312
(3.0.0 3.6.0) Compute nodes belonging to more than one partition causes compute to overscale
If a cluster is created with a custom Slurm configuration that places static compute nodes into more than one partition, then ParallelCluster will attempt to launch as many EC2 instances for a given node as the number of partitions that node belongs to. This will result in over scaling and node termination due to multiple instances backing a single node.
In this example cluster configuration, we see that we define a Slurm partition named queue that will contain two static nodes named queue-st-compute-1 and queue-st-compute-2.
# Cluster configuration snippet
SlurmQueues:
- Name: queue
ComputeResources:
- Name: compute
InstanceType: c5.2xlarge
MinCount: 2
MaxCount: 2
The example below customizes the Slurm configuration by specifying two additional partitions not managed by ParallelCluster: CustomerPartion1 and CustomerPartion2. These additional partitions are both configured to use compute node names (queue-st-compute-1 and queue-st-compute-2) that overlap with the partition managed by ParallelCluster (defined by NodeSet).
# The following slurm.conf snippet is reusing queue-st-compute-1 and queue-st-compute-2 node names
# for both partitions CustomerPartition1 and CustomPartition2
NodeSet=nodeset Nodes=queue-st-compute-[1-2]
PartitionName=CustomerPartition1 Nodes=nodeset PreemptMode=REQUEUE PriorityTier=10 GraceTime=600 Default=YES MaxTime=2:00:00
PartitionName=CustomerPartition2 Nodes=nodeset PreemptMode=REQUEUE PriorityTier=4 GraceTime=1800 MaxTime=8:00:00
In the logs, you will see both nodes, queue-st-compute-1 and queue-st-compute-2, repeated three times in the Found the following unhealthy static nodes: log line. You should also see the the instance IDs repeated three times in the Terminating instances log line:
2023-01-01 01:00:00,629 - [slurm_plugin.slurm_resources:_is_static_node_configuration_valid] - WARNING - Node state check: static node without nodeaddr set, node queue-st-compute-1(queue-st-compute-1), node state DOWN+CLOUD+NOT_RESPONDING:
2023-01-01 01:00:00,629 - [slurm_plugin.slurm_resources:_is_static_node_configuration_valid] - WARNING - Node state check: static node without nodeaddr set, node queue-st-compute-2(queue-st-compute-2), node state DOWN+CLOUD+NOT_RESPONDING:
2023-01-01 01:00:00,630 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Found the following unhealthy static nodes: (x6) ['queue-st-compute-1(queue-st-compute-1)',
'queue-st-compute-2(queue-st-compute-2)', 'queue-st-compute-1(queue-st-compute-1)',
'queue-st-compute-2(queue-st-compute-2)', 'queue-st-compute-1(queue-st-compute-1)',
'queue-st-compute-2(queue-st-compute-2)']
2023-01-01 01:00:00,630 - [slurm_plugin.clustermgtd:_handle_unhealthy_static_nodes] - INFO - Setting unhealthy static nodes to DOWN
2023-01-01 01:00:00,634- [slurm_plugin.clustermgtd:_handle_unhealthy_static_nodes] - INFO - Terminating instances backing unhealthy static nodes
2023-01-01 01:00:00,639 - [slurm_plugin.instance_manager:delete_instances] - INFO - Terminating instances (x6) ['i-instanceid1',
'i-instanceid0', 'i-instanceid1', 'i-instanceid0', 'i-instanceid1', 'i-instanceid0']
- ParallelCluster versions >= 3.0.0 and <= 3.6.0 on all OSs.
- Only the Slurm scheduler is affected.
The following mitigation has only been tested on ParallelCluster versions 3.5.0-3.6.0
- Save the following text as pcluster.patch to /tmp onto your head node:
diff --git a/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/clustermgtd.py b/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/clustermgtd.py
index 557c798..727c62c 100644
--- /opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/clustermgtd.py
+++ /opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/clustermgtd.py
@@ -1098,10 +1098,10 @@ class ClusterManager:
@staticmethod
def _find_active_nodes(partitions_name_map):
- active_nodes = []
+ active_nodes = set()
for partition in partitions_name_map.values():
if partition.state != "INACTIVE":
- active_nodes += partition.slurm_nodes
+ active_nodes |= set(partition.slurm_nodes)
return active_nodes
def _is_node_in_replacement_valid(self, node, check_node_is_valid):
- Create and run the following script on the head node as the root user:
#!/bin/bash
set -e
. "/etc/parallelcluster/cfnconfig"
# Patch file must be run from the root path
pushd /
# Apply the patch to clustermgtd.py, save backup to clustermgtd.py.orig
cat /tmp/pcluster.patch | patch -p0 -b
# Restart clustermgtd
source /opt/parallelcluster/pyenv/versions/cookbook_virtualenv/bin/activate
supervisorctl reload
popd