Skip to content

AWS ParallelCluster v2.4.1

Compare
Choose a tag to compare
@demartinofra demartinofra released this 29 Jul 10:37
· 75 commits to master since this release
f0b50ba

We're excited to announce the release of AWS ParallelCluster Cookbook 2.4.1.

This is associated with AWS ParallelCluster v2.4.1.

Enhancements

  • Install IntelMPI on Alinux, Centos 7 and Ubuntu 1604
  • Upgrade EFA to version 1.4.1
  • Run all node daemons and cookbook recipes in isolated Python virtualenvs. This allows our code to always
    run with the required Python dependencies and solves all conflicts and runtime failures that were being
    caused by user packages installed in the system Python

Changes

  • Torque: upgrade to version 6.1.2
  • Run all node daemons with Python 3.6
  • Torque: changed following parameters in global configuration:
    • server node_check_rate = 120 - Specifies the minimum duration (in seconds) that a node can fail to send a status update before being marked down by the pbs_server daemon. Previously was 600. This reduces scaling reaction times in case of instance failure or unexpected termination (especially with spot)
    • server node_ping_rate = 60 - Specifies the maximum interval (in seconds) between successive "pings" sent from the pbs_server daemon to the pbs_mom daemon to determine node/daemon health. Previously was 300. Setting it to half the node_check_rate.
    • server timeout_for_job_delete = 30 - The specific timeout used when deleting jobs because the node they are executing on is being deleted. Previously was 120. This prevents job deletion to hang for more than 30 seconds when the node they are running on is being deleted.
    • server timeout_for_job_requeue = 30 - The specific timeout used when requeuing jobs because the node they are executing on is being deleted. Previously was 120. This prevents node deletion to hang for more than 30 seconds when a job cannot be rescheduled.

Bug Fixes

  • Restore correct value for filehandle_limit that was getting reset when setting memory_limit for EFA
  • Torque: fix configuration of server operators that was preventing compute nodes from disabling themselves
    before termination

Support

Need help / have a feature request?
AWS Support: https://console.aws.amazon.com/support/home
ParallelCluster Issues tracker on GitHub: https://github.com/aws/aws-parallelcluster
The HPC Forum on the AWS Forums page: https://forums.aws.amazon.com/forum.jspa?forumID=192