AWS ParallelCluster v3.6.0
We're excited to announce the release of AWS ParallelCluster Cookbook 3.6.0
This is associated with AWS ParallelCluster v3.6.0
ENHANCEMENTS
- Add support for RHEL8.
- Add support for customizing the cluster Slurm configuration via the ParallelCluster configuration YAML file.
- Build Slurm with support for LUA.
- Add health check manager and GPU health check, which can be activated through cluster configuration.
Health check manager execution is triggered by a Slurm prolog script. GPU check verifies healthiness of a node by executing NVIDIA DCGM L2 diagnostic. - Add log rotation support for ParallelCluster managed logs.
- Track head node memory and root volume disk utilization using the
mem_used_percent
anddisk_used_percent
metrics collected through the CloudWatch Agent. - Enforce the DCV Authenticator Server to use at least
TLS-1.2
protocol when creating the SSL Socket. - Load kernel module nvidia-uvm by default to provide Unified Virtual Memory (UVM) functionality to the CUDA driver.
- Install NVIDIA Persistence Daemon as a system service.
- Install NVIDIA Data Center GPU Manager (DCGM) package on all supported OSes except for aarch64
centos7
andalinux2
.
CHANGES
- Upgrade Slurm to version 23.02.2.
- Upgrade munge to version 0.5.15.
- Set Slurm default
TreeWidth
to 30. - Set Slurm prolog and epilog configurations to target a directory,
/opt/slurm/etc/scripts/prolog.d/
and/opt/slurm/etc/scripts/epilog.d/
respectively. - Set Slurm
BatchStartTimeout
to 3 minutes so to allow max 3 minutes Prolog execution during compute node registration. - Upgrade EFA installer to
1.22.1
- Dkms :
2.8.3-2
- Efa-driver:
efa-2.1.1g
- Efa-config:
efa-config-1.13-1
- Efa-profile:
efa-profile-1.5-1
- Libfabric-aws:
libfabric-aws-1.17.1-1
- Rdma-core:
rdma-core-43.0-1
- Open MPI:
openmpi40-aws-4.1.5-1
- Dkms :
- Upgrade Lustre client version to 2.12 on Amazon Linux 2 (same version available on Ubuntu 20.04, 18.04 and CentOS >= 7.7).
- Upgrade Lustre client version to 2.10.8 on CentOS 7.6.
- Upgrade
aws-cfn-bootstrap
to version 2.0-24. - Upgrade NVIDIA driver to version 470.182.03.
- Upgrade NVIDIA Fabric Manager to version 470.182.03.
- Upgrade NVIDIA CUDA Toolkit to version 11.8.0.
- Upgrade NVIDIA CUDA sample to version 11.8.0.
- Upgrade Intel MPI Library to 2021.9.0.43482.
- Upgrade NICE DCV to version
2023.0-15022
.- server:
2023.0.15022-1
- xdcv:
2023.0.547-1
- gl:
2023.0.1027-1
- web_viewer:
2023.0.15022-1
- server:
BUG FIXES
- Fix an issue that was causing misalignment of compute nodes IP on instances with multiple network interfaces.
- Fix replacement of
StoragePass
inslurm_parallelcluster_slurmdbd.conf
when a queue parameter update is performed and the Slurm accounting configurations are not updated. - Fix issue causing
cfn-hup
daemon to fail when it gets restarted. - Fix issue causing NVIDIA GPU compute nodes not to resume correctly after executing an
scontrol reboot
command.