Release AWS ParallelCluster v3.6.0 · aws/aws-parallelcluster-cookbook

We're excited to announce the release of AWS ParallelCluster Cookbook 3.6.0

This is associated with AWS ParallelCluster v3.6.0

ENHANCEMENTS

Add support for RHEL8.
Add support for customizing the cluster Slurm configuration via the ParallelCluster configuration YAML file.
Build Slurm with support for LUA.
Add health check manager and GPU health check, which can be activated through cluster configuration.
Health check manager execution is triggered by a Slurm prolog script. GPU check verifies healthiness of a node by executing NVIDIA DCGM L2 diagnostic.
Add log rotation support for ParallelCluster managed logs.
Track head node memory and root volume disk utilization using the mem_used_percent and disk_used_percent metrics collected through the CloudWatch Agent.
Enforce the DCV Authenticator Server to use at least TLS-1.2 protocol when creating the SSL Socket.
Load kernel module nvidia-uvm by default to provide Unified Virtual Memory (UVM) functionality to the CUDA driver.
Install NVIDIA Persistence Daemon as a system service.
Install NVIDIA Data Center GPU Manager (DCGM) package on all supported OSes except for aarch64 centos7 and alinux2.

CHANGES

Upgrade Slurm to version 23.02.2.
Upgrade munge to version 0.5.15.
Set Slurm default TreeWidth to 30.
Set Slurm prolog and epilog configurations to target a directory, /opt/slurm/etc/scripts/prolog.d/ and /opt/slurm/etc/scripts/epilog.d/ respectively.
Set Slurm BatchStartTimeout to 3 minutes so to allow max 3 minutes Prolog execution during compute node registration.
Upgrade EFA installer to 1.22.1
- Dkms : 2.8.3-2
- Efa-driver: efa-2.1.1g
- Efa-config: efa-config-1.13-1
- Efa-profile: efa-profile-1.5-1
- Libfabric-aws: libfabric-aws-1.17.1-1
- Rdma-core: rdma-core-43.0-1
- Open MPI: openmpi40-aws-4.1.5-1
Upgrade Lustre client version to 2.12 on Amazon Linux 2 (same version available on Ubuntu 20.04, 18.04 and CentOS >= 7.7).
Upgrade Lustre client version to 2.10.8 on CentOS 7.6.
Upgrade aws-cfn-bootstrap to version 2.0-24.
Upgrade NVIDIA driver to version 470.182.03.
Upgrade NVIDIA Fabric Manager to version 470.182.03.
Upgrade NVIDIA CUDA Toolkit to version 11.8.0.
Upgrade NVIDIA CUDA sample to version 11.8.0.
Upgrade Intel MPI Library to 2021.9.0.43482.
Upgrade NICE DCV to version 2023.0-15022.
- server: 2023.0.15022-1
- xdcv: 2023.0.547-1
- gl: 2023.0.1027-1
- web_viewer: 2023.0.15022-1

BUG FIXES

Fix an issue that was causing misalignment of compute nodes IP on instances with multiple network interfaces.
Fix replacement of StoragePass in slurm_parallelcluster_slurmdbd.conf when a queue parameter update is performed and the Slurm accounting configurations are not updated.
Fix issue causing cfn-hup daemon to fail when it gets restarted.
Fix issue causing NVIDIA GPU compute nodes not to resume correctly after executing an scontrol reboot command.

Provide feedback