AWS ParallelCluster v3.2.0
We're excited to announce the release of AWS ParallelCluster Cookbook 3.2.0
This is associated with AWS ParallelCluster v3.2.0
ENHANCEMENTS
- Add support for multiple Elastic File Systems.
- Add support for multiple FSx File System.
- Add support for attaching existing FSx for Ontap and FSx for OpenZFS File Systems.
- Install NVIDIA GDRCopy 2.3 to enable low-latency GPU memory copy on supported instance types.
- During cluster update set Slurm nodes state accordingly to strategy set through the configuration parameter
Scheduling/SchedulerSettings/QueueUpdateStrategy
. - Add support for memory-based scheduling in Slurm.
- Configure
RealMemory
on compute nodes by default as 95% of the EC2 memory. - Move
SelectTypeParameters
toslurm_parallelcluster.conf
include file. - Move
ConstrainRAMSpace
toslurm_parallelcluster_cgroup.conf
include file. - Add support for new configuration parameter
Scheduling/SlurmSettings/EnableMemoryBasedScheduling
to configure memory-based scheduling in Slurm. - Add support for new configuration parameter
Scheduling/SlurmQueues/ComputeResources/SchedulableMemory
to override default value of the memory seen by the scheduler on compute nodes.
- Configure
- Add support for rebooting compute nodes via Slurm.
CHANGES
- Restart
clustermgtd
andslurmctld
daemons at cluster update time only whenScheduling
parameters are updated in the cluster configuration. - Update slurmctld and slurmd systemd service files.
- Upgrade NICE DCV to version 2022.0-12760.
- Upgrade NVIDIA driver to version 470.129.06.
- Upgrade NVIDIA Fabric Manager to version 470.129.06.
- Upgrade EFA installer to version 1.17.2.
- EFA driver:
efa-1.16.0-1
- EFA configuration:
efa-config-1.10-1
- EFA profile:
efa-profile-1.5-1
- Libfabric:
libfabric-aws-1.16.0~amzn2.0-1
- RDMA core:
rdma-core-41.0-2
- Open MPI:
openmpi40-aws-4.1.4-2
- EFA driver:
- Restrict IPv6 access to IMDS to root and cluster admin users only, when configuration parameter
HeadNode/Imds/Secured
is enabled. - Set Slurm configuration
AuthInfo=cred_expire=70
to reduce the time requeued jobs must wait before starting again when nodes are not available. - Move
SelectTypeParameters
andConstrainRAMSpace
to theparallelcluster_slurm*.conf
include files. - Upgrade third-party cookbook dependencies:
- apt-7.4.2 (from apt-7.4.0)
- line-4.5.2 (from line-4.0.1)
- openssh-2.10.3 (from openssh-2.9.1)
- pyenv-3.5.1 (from pyenv-3.4.2)
- selinux-6.0.4 (from selinux-3.1.1)
- yum-7.4.0 (from yum-6.1.1)
- yum-epel-4.5.0 (from yum-epel-4.1.2)
- Disable
aws-ubuntu-eni-helper
service, available in Deep Learning AMIs, to avoid conflicts withconfigure_nw_interface.sh
when configuring instances with multiple network cards. - Set MTU to 9001 for all the network interfaces when configuring instances with multiple network cards.
- Remove the trailing dot when configuring the compute node FQDN.