Skip to content

AWS ParallelCluster v3.6.0

Compare
Choose a tag to compare
@enrico-usai enrico-usai released this 22 May 15:51
· 8 commits to release-3.6 since this release

We're excited to announce the release of AWS ParallelCluster Cookbook 3.6.0

This is associated with AWS ParallelCluster v3.6.0

ENHANCEMENTS

  • Add support for RHEL8.
  • Add support for customizing the cluster Slurm configuration via the ParallelCluster configuration YAML file.
  • Build Slurm with support for LUA.
  • Add health check manager and GPU health check, which can be activated through cluster configuration.
    Health check manager execution is triggered by a Slurm prolog script. GPU check verifies healthiness of a node by executing NVIDIA DCGM L2 diagnostic.
  • Add log rotation support for ParallelCluster managed logs.
  • Track head node memory and root volume disk utilization using the mem_used_percent and disk_used_percent metrics collected through the CloudWatch Agent.
  • Enforce the DCV Authenticator Server to use at least TLS-1.2 protocol when creating the SSL Socket.
  • Load kernel module nvidia-uvm by default to provide Unified Virtual Memory (UVM) functionality to the CUDA driver.
  • Install NVIDIA Persistence Daemon as a system service.
  • Install NVIDIA Data Center GPU Manager (DCGM) package on all supported OSes except for aarch64 centos7 and alinux2.

CHANGES

  • Upgrade Slurm to version 23.02.2.
  • Upgrade munge to version 0.5.15.
  • Set Slurm default TreeWidth to 30.
  • Set Slurm prolog and epilog configurations to target a directory, /opt/slurm/etc/scripts/prolog.d/ and /opt/slurm/etc/scripts/epilog.d/ respectively.
  • Set Slurm BatchStartTimeout to 3 minutes so to allow max 3 minutes Prolog execution during compute node registration.
  • Upgrade EFA installer to 1.22.1
    • Dkms : 2.8.3-2
    • Efa-driver: efa-2.1.1g
    • Efa-config: efa-config-1.13-1
    • Efa-profile: efa-profile-1.5-1
    • Libfabric-aws: libfabric-aws-1.17.1-1
    • Rdma-core: rdma-core-43.0-1
    • Open MPI: openmpi40-aws-4.1.5-1
  • Upgrade Lustre client version to 2.12 on Amazon Linux 2 (same version available on Ubuntu 20.04, 18.04 and CentOS >= 7.7).
  • Upgrade Lustre client version to 2.10.8 on CentOS 7.6.
  • Upgrade aws-cfn-bootstrap to version 2.0-24.
  • Upgrade NVIDIA driver to version 470.182.03.
  • Upgrade NVIDIA Fabric Manager to version 470.182.03.
  • Upgrade NVIDIA CUDA Toolkit to version 11.8.0.
  • Upgrade NVIDIA CUDA sample to version 11.8.0.
  • Upgrade Intel MPI Library to 2021.9.0.43482.
  • Upgrade NICE DCV to version 2023.0-15022.
    • server: 2023.0.15022-1
    • xdcv: 2023.0.547-1
    • gl: 2023.0.1027-1
    • web_viewer: 2023.0.15022-1

BUG FIXES

  • Fix an issue that was causing misalignment of compute nodes IP on instances with multiple network interfaces.
  • Fix replacement of StoragePass in slurm_parallelcluster_slurmdbd.conf when a queue parameter update is performed and the Slurm accounting configurations are not updated.
  • Fix issue causing cfn-hup daemon to fail when it gets restarted.
  • Fix issue causing NVIDIA GPU compute nodes not to resume correctly after executing an scontrol reboot command.