Release 2.5.0

Merge Release 2.5.0
aws · Nov 15, 2019 · c4eab44 · c4eab44
2 parents 8f5359f + da173a8
commit c4eab44
Show file tree

Hide file tree

Showing 262 changed files with 17,530 additions and 11,645 deletions.
diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md
@@ -0,0 +1,27 @@
+---
+name: Bug report
+about: Please create a detailed report by completing the following information
+title: ''
+labels: ''
+assignees: ''
+
+---
+
+**Environment:**
+ - AWS ParallelCluster / CfnCluster version [e.g. aws-parallelcluster-2.4.1]
+ - OS: [e.g. alinux]
+ - Scheduler: [e.g. SGE]
+ - Master instance type: [e.g. m5.xlarge]
+ - Compute instance type: [e.g. c5.8xlarge]
+
+**Bug description and how to reproduce:**
+A clear and concise description of what the bug is and the steps to reproduce the behavior.
+
+**Additional context:**
+Any other context about the problem. E.g.:
+ - configuration file without any credentials or personal data.
+ - pre/post-install scripts, if any
+ - screenshots, if useful
+ - if the cluster fails creation, please re-execute `create` action using `--norollback` option and attach `/var/log/cfn-init.log`, `/var/log/cloud-init.log` and `/var/log/cloud-init-output.log` files from the Master node
+ - if a compute node was terminated due to failure, there will be a directory `/home/logs/compute`. Attach one of the `instance-id.tar.gz` from that directory
+ - if you encounter scaling problems please attach `/var/log/nodewatcher` from the Compute node and `/var/log/jobwatcher` and `/var/log/sqswatcher` from the Master node
diff --git a/.isort.cfg b/.isort.cfg
@@ -11,3 +11,4 @@ known_third_party=boto3,botocore,awscli,tabulate,argparse,configparser,pytest,py
 # )
 multi_line_output=3
 include_trailing_comma=true
+skip=pcluster/resources/batch/custom_resources_code/crhelper
diff --git a/.travis.yml b/.travis.yml
@@ -8,6 +8,7 @@ python:
   - "3.5"
   - "3.6"
   - "3.7"
+  - "3.8"
 
 matrix:
   include:
@@ -19,13 +20,6 @@ matrix:
     python: 3.6
     stage: linters
     env: TOXENV=cfn-format-check,cfn-lint
-  - name: Docs Checks
-    python: 3.6
-    stage: linters
-    env: TOXENV=docs-linters
-    before_install:
-      # Needed to run docs-linters target in tox.
-      - sudo apt-get update && sudo apt-get install -y enchant
 
 install:
   - pip install tox-travis

diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -2,15 +2,78 @@
 CHANGELOG
 =========
 
+2.5.0
+=====
+
+**ENHANCEMENTS**
+
+* Add support for new OS: Ubuntu 18.04
+* Add support for AWS Batch scheduler in China partition and in ``eu-north-1``.
+* Revamped ``pcluster configure`` command which now supports automated networking configuration.
+* Add support for NICE DCV on Centos 7 to setup a graphical remote desktop session on the Master node.
+* Add support for new EFA supported instances: ``c5n.metal``, ``m5dn.24xlarge``, ``m5n.24xlarge``, ``r5dn.24xlarge``,
+  ``r5n.24xlarge``
+* Add support for scheduling with GPU options in Slurm. Currently supports the following GPU-related options: ``—G/——gpus,
+  ——gpus-per-task, ——gpus-per-node, ——gres=gpu, ——cpus-per-gpu``.
+  Integrated GPU requirements into scaling logic, cluster will scale automatically to satisfy GPU/CPU requirements
+  for pending jobs. When submitting GPU jobs, CPU/node/task information is not required but preferred in order to
+  avoid ambiguity. If only GPU requirements are specified, cluster will scale up to the minimum number of nodes
+  required to satisfy all GPU requirements.
+* Add new cluster configuration option to automatically disable Hyperthreading (``disable_hyperthreading = true``)
+* Install Intel Parallel Studio 2019.5 Runtime in Centos 7 when ``enable_intel_hpc_platform = true``  and share /opt/intel over NFS
+* Additional EC2 IAM Policies can now be added to the role ParallelCluster automatically creates for cluster nodes by
+  simply specifying ``additional_iam_policies`` in the cluster config.
+
+**CHANGES**
+
+* Ubuntu 14.04 is no longer supported
+* Upgrade Intel MPI to version U5.
+* Upgrade EFA Installer to version 1.7.0, this also upgrades Open MPI to 4.0.2.
+* Upgrade NVIDIA driver to Tesla version 418.87.
+* Upgrade CUDA library to version 10.1.
+* Upgrade Slurm to version 19.05.3-2.
+* Install EFA in China AMIs.
+* Increase default EBS volume size from 17GB to 25GB
+* FSx Lustre now supports new storage_capacity options 1,200 and 2,400 GiB
+* Enable ``flock user_xattr noatime`` Lustre mount options by default everywhere and
+  ``x-systemd.automount x-systemd.requires=lnet.service`` for systemd based systems.
+* Increase the number of hosts that can be processed by scaling daemons in a single batch from 50 to 200. This
+  improves the scaling time especially with increased ASG launch rates.
+* Change default sshd config in order to disable X11 forwarding and update the list of supported ciphers
+  significantly increases scaling speed when ASG launch rate is raised.
+* Increase faulty node termination timeout from 1 minute to 5 in order to give some additional time to the scheduler
+  to recover when under heavy load.
+* Extended ``pcluster createami`` command to specify the VPC and network settings when building the AMI.
+* Support inline comments in config file
+* Support Python 3.8 in pcluster CLI.
+* Deprecate Python 2.6 support
+* Add ``ClusterName`` tag to EC2 instances.
+* Search for new available version only at ``pcluster create`` action.
+* Enable ``sanity_check`` by default.
+
+**BUG FIXES**
+
+* Fix sanity check for custom ec2 role. Fixes `#1241 <https://github.com/aws/aws-parallelcluster/issues/1241>`_ .
+* Fix bug when using same subnet for both master and compute.
+* Fix bug when ganglia is enabled ganglia urls are shown. Fixes `#1322 <https://github.com/aws/aws-parallelcluster/issues/1322>`_ .
+* Fix bug with ``awsbatch`` scheduler that prevented Multi-node jobs from running.
+* Fix jobwatcher behaviour that was marking nodes locked by the nodewatcher as busy even if they had been removed
+  already from the ASG Desired count. This was causing, in rare circumstances, a cluster overscaling.
+* Fix bug that was causing failures in sqswatcher when ADD and REMOVE event for the same host are fetched together.
+* Fix bug that was preventing nodes to mount partitioned EBS volumes.
+* Implement paginated calls in ``pcluster list``.
+* Fix bug when creating ``awsbatch`` cluster with name longer than 31 chars
+* Fix a bug that lead to ssh not working after ssh'ing into a compute node by ip address.
+
 2.4.1
 =====
 
 **ENHANCEMENTS**
 
 * Add support for ap-east-1 region (Hong Kong)
 * Add possibility to specify instance type to use when building custom AMIs with ``pcluster createami``
-* Speed up cluster creation by having compute nodes starting together with master node
-* Enable ASG CloudWatch metrics for the ASG managing compute nodes
+* Speed up cluster creation by having compute nodes starting together with master node. **Note** this requires one new IAM permissions in the `ParallelClusterInstancePolicy <https://docs.aws.amazon.com/en_us/parallelcluster/latest/ug/iam.html#parallelclusterinstancepolicy>`_, ``cloudformation:DescribeStackResource``
+* Enable ASG CloudWatch metrics for the ASG managing compute nodes. **Note** this requires two new IAM permissions in the `ParallelClusterUserPolicy <https://docs.aws.amazon.com/parallelcluster/latest/ug/iam.html#parallelclusteruserpolicy>`_, ``autoscaling:DisableMetricsCollection`` and ``autoscaling:EnableMetricsCollection``
 * Install Intel MPI 2019u4 on Amazon Linux, Centos 7 and Ubuntu 1604
 * Upgrade Elastic Fabric Adapter (EFA) to version 1.4.1 that supports Intel MPI
 * Run all node daemons and cookbook recipes in isolated Python virtualenvs. This allows our code to always run with the
@@ -44,7 +107,7 @@ CHANGELOG
 * Make FSx Substack depend on ComputeSecurityGroupIngress to keep FSx from trying to create prior to the SG
   allowing traffic within itself
 * Restore correct value for ``filehandle_limit`` that was getting reset when setting ``memory_limit`` for EFA
-* Torque: fix compute nodes locking mechanism to prevent job scheduling on nodes being terminated 
+* Torque: fix compute nodes locking mechanism to prevent job scheduling on nodes being terminated
 * Restore logic that was automatically adding compute nodes identity to SSH ``known_hosts`` file
 * Slurm: fix issue that was causing the ParallelCluster daemons to fail when the cluster is stopped and an empty compute nodes file
   is imported in Slurm config

diff --git a/README.rst b/README.rst
@@ -19,7 +19,16 @@ You can build higher level workflows, such as a Genomics portal that automates t
 
 Quick Start
 -----------
-First, install the library:
+**IMPORTANT**: you will need an **Amazon EC2 Key Pair** to be able to complete the following steps.
+Please see the `Official AWS Guide <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html>`_.
+
+First, make sure you have installed the `AWS Command Line Interface <http://>`_:
+
+.. code-block:: sh
+
+    $ pip install awscli
+
+Then you can install AWS ParallelCluster:
 
 .. code-block:: sh
 
@@ -35,34 +44,59 @@ Next, configure your aws credentials and default region:
     Default region name [us-east-1]:
     Default output format [None]:
 
-Then, run pcluster configure:
+Then, run ``pcluster configure``. A list of valid options will be displayed for each
+configuration parameter. Type an option number and press ``Enter`` to select a specific option,
+or just press ``Enter`` to accept the default option.
 
 .. code-block:: ini
 
   $ pcluster configure
-  Cluster Template [default]:
-  Acceptable Values for AWS Region ID:
-      ap-south-1
-      ...
-      us-west-2
+  INFO: Configuration file /dir/conf_file will be written.
+  Press CTRL-C to interrupt the procedure.
+
+
+  Allowed values for AWS Region ID:
+  1. eu-north-1
+  ...
+  15. us-west-1
+  16. us-west-2
   AWS Region ID [us-east-1]:
-  VPC Name [myvpc]:
-  Acceptable Values for Key Name:
-    keypair1
-    keypair-test
-    production-key
-  Key Name []:
-  Acceptable Values for VPC ID:
-    vpc-1kd24879
-    vpc-blk4982d
-  VPC ID []:
-  Acceptable Values for Master Subnet ID:
-    subnet-9k284a6f
-    subnet-1k01g357
-    subnet-b921nv04
-  Master Subnet ID []:
-
-Now you can create your first cluster;
+  ...
+
+Be sure to select a region containing the EC2 key pair you wish to use. You can also import a public key using
+`these instructions <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html#how-to-generate-your-own-key-and-import-it-to-aws>`_.
+
+During the process you will be asked to set up your networking environment. The wizard will offer you the choice of
+using an existing VPC or creating a new one on the fly.
+
+.. code-block:: ini
+
+  Automate VPC creation? (y/n) [n]:
+
+Enter '``n``' if you already have a VPC suitable for the cluster. Otherwise you can let ``pcluster configure``
+create a VPC for you. The same choice is given for the subnet: you can select a valid subnet ID for
+both the master and compute nodes, or you can let ``pcluster configure`` set up everything for you.
+The same choice is given for the subnet configuration: you can select a valid subnet ID for both
+the master and compute nodes, or you can let pcluster configure set up everything for you.
+In the latter case, just select the configuration you prefer.
+
+.. code-block:: ini
+
+  Automate Subnet creation? (y/n) [y]: y
+  Allowed values for Network Configuration:
+  1. Master in a public subnet and compute fleet in a private subnet
+  2. Master and compute fleet in the same public subnet
+
+
+At the end of the process a message like this one will be shown:
+
+.. code-block:: ini
+
+  Configuration file written to /dir/conf_file
+  You can edit your configuration file or simply run 'pcluster create -c /dir/conf_file cluster-name' to create your cluster
+
+
+Now you can create your first cluster:
 
 .. code-block:: sh