-
Notifications
You must be signed in to change notification settings - Fork 312
Possible performance degradation on ALinux2 when using ParallelCluster 2.11.0 and custom AMIs from 2.6.0 to 2.11.0
The performance of tightly coupled / MPI workloads on clusters with Amazon Linux 2 operating system may be impacted by enabling CloudWatch logging.
Our preliminary analysis has found this is likely related to the CloudWatch Agent version 1.247348.0b251302, you can check which version you have installed by running the command: yum list amazon-cloudwatch-agent
This performance issue may affect workloads differently depending on cluster size and applications used.
To overcome the issue there are multiple options.
This option can be applied to new or existing clusters after an update operation. Instruction steps follow:
- Create a bash script, e.g.
disable-cw-script.sh
, with the following content (or add the code to your existing post installation script)
#!/bin/bash
. "/etc/parallelcluster/cfnconfig"
case "${cfn_node_type}" in
ComputeFleet)
sudo systemctl stop amazon-cloudwatch-agent.service
sudo yum -y downgrade amazon-cloudwatch-agent-1.247347.4-1.amzn2
sudo systemctl start amazon-cloudwatch-agent.service
;;
*)
;;
esac
- Upload the script to an S3 bucket with correct permissions, see: https://docs.aws.amazon.com/parallelcluster/latest/ug/pre_post_install.html
E.g.:
aws s3 cp disable-cw-script.sh s3://yourbucket/
- Add the following setting to your cluster configuration
[cluster yourcluster]
post_install = s3://yourbucket/disable-cw-script.sh
...
- Either create a new cluster or follow the next steps to update an existing cluster
Update an existing cluster with the post installation script configured in the previous steps.
- Stop the cluster with pcluster stop command
- Update the cluster with pcluster update command
- Restart the cluster with pcluster start command
All the compute nodes will start with a version of CloudWatch agent not impacting your cluster.
This option applies only to new clusters.
Create a cluster with the following configuration:
[cluster yourcluster]
cw_log_settings = custom-cw
...
[cw_log custom-cw]
enable = false
CloudWatch logging and the CloudWatch Agent service will be disabled by default, avoiding the possible performance degradation issue.
This option applies only to new clusters.
- Follow the official documentation to modify an existing ParallelCluster AMI
- As part of the AMI customization step, connect to the instance and run the following command:
sudo yum -y downgrade amazon-cloudwatch-agent-1.247347.4-1.amzn2
- Complete the steps to create a custom AMI
- Create a cluster using the generated AMI with the
custom_ami
parameter.
This option can be applied to existing Slurm clusters.
Customize your job submission script by adding the steps to downgrade CloudWatch agent. Example:
#!/bin/bash
#SBATCH --job-name=yourjob
# add your options
# downgrade
for i in $(scontrol show hostnames $SLURM_JOB_NODELIST)
do
ssh $i "sudo systemctl stop amazon-cloudwatch-agent.service"
ssh $i "sudo yum -y downgrade amazon-cloudwatch-agent-1.247347.4-1.amzn2"
ssh $i "sudo systemctl start amazon-cloudwatch-agent.service"
done
# start your application
sleep 100