Skip to content

Commit

Permalink
Release 2.5.0
Browse files Browse the repository at this point in the history
Merge Release 2.5.0
  • Loading branch information
sean-smith authored Nov 15, 2019
2 parents 8f5359f + da173a8 commit c4eab44
Show file tree
Hide file tree
Showing 262 changed files with 17,530 additions and 11,645 deletions.
27 changes: 27 additions & 0 deletions .github/ISSUE_TEMPLATE/bug_report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
---
name: Bug report
about: Please create a detailed report by completing the following information
title: ''
labels: ''
assignees: ''

---

**Environment:**
- AWS ParallelCluster / CfnCluster version [e.g. aws-parallelcluster-2.4.1]
- OS: [e.g. alinux]
- Scheduler: [e.g. SGE]
- Master instance type: [e.g. m5.xlarge]
- Compute instance type: [e.g. c5.8xlarge]

**Bug description and how to reproduce:**
A clear and concise description of what the bug is and the steps to reproduce the behavior.

**Additional context:**
Any other context about the problem. E.g.:
- configuration file without any credentials or personal data.
- pre/post-install scripts, if any
- screenshots, if useful
- if the cluster fails creation, please re-execute `create` action using `--norollback` option and attach `/var/log/cfn-init.log`, `/var/log/cloud-init.log` and `/var/log/cloud-init-output.log` files from the Master node
- if a compute node was terminated due to failure, there will be a directory `/home/logs/compute`. Attach one of the `instance-id.tar.gz` from that directory
- if you encounter scaling problems please attach `/var/log/nodewatcher` from the Compute node and `/var/log/jobwatcher` and `/var/log/sqswatcher` from the Master node
1 change: 1 addition & 0 deletions .isort.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,4 @@ known_third_party=boto3,botocore,awscli,tabulate,argparse,configparser,pytest,py
# )
multi_line_output=3
include_trailing_comma=true
skip=pcluster/resources/batch/custom_resources_code/crhelper
8 changes: 1 addition & 7 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ python:
- "3.5"
- "3.6"
- "3.7"
- "3.8"

matrix:
include:
Expand All @@ -19,13 +20,6 @@ matrix:
python: 3.6
stage: linters
env: TOXENV=cfn-format-check,cfn-lint
- name: Docs Checks
python: 3.6
stage: linters
env: TOXENV=docs-linters
before_install:
# Needed to run docs-linters target in tox.
- sudo apt-get update && sudo apt-get install -y enchant

install:
- pip install tox-travis
Expand Down
69 changes: 66 additions & 3 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,78 @@
CHANGELOG
=========

2.5.0
=====

**ENHANCEMENTS**

* Add support for new OS: Ubuntu 18.04
* Add support for AWS Batch scheduler in China partition and in ``eu-north-1``.
* Revamped ``pcluster configure`` command which now supports automated networking configuration.
* Add support for NICE DCV on Centos 7 to setup a graphical remote desktop session on the Master node.
* Add support for new EFA supported instances: ``c5n.metal``, ``m5dn.24xlarge``, ``m5n.24xlarge``, ``r5dn.24xlarge``,
``r5n.24xlarge``
* Add support for scheduling with GPU options in Slurm. Currently supports the following GPU-related options: ``—G/——gpus,
——gpus-per-task, ——gpus-per-node, ——gres=gpu, ——cpus-per-gpu``.
Integrated GPU requirements into scaling logic, cluster will scale automatically to satisfy GPU/CPU requirements
for pending jobs. When submitting GPU jobs, CPU/node/task information is not required but preferred in order to
avoid ambiguity. If only GPU requirements are specified, cluster will scale up to the minimum number of nodes
required to satisfy all GPU requirements.
* Add new cluster configuration option to automatically disable Hyperthreading (``disable_hyperthreading = true``)
* Install Intel Parallel Studio 2019.5 Runtime in Centos 7 when ``enable_intel_hpc_platform = true`` and share /opt/intel over NFS
* Additional EC2 IAM Policies can now be added to the role ParallelCluster automatically creates for cluster nodes by
simply specifying ``additional_iam_policies`` in the cluster config.

**CHANGES**

* Ubuntu 14.04 is no longer supported
* Upgrade Intel MPI to version U5.
* Upgrade EFA Installer to version 1.7.0, this also upgrades Open MPI to 4.0.2.
* Upgrade NVIDIA driver to Tesla version 418.87.
* Upgrade CUDA library to version 10.1.
* Upgrade Slurm to version 19.05.3-2.
* Install EFA in China AMIs.
* Increase default EBS volume size from 17GB to 25GB
* FSx Lustre now supports new storage_capacity options 1,200 and 2,400 GiB
* Enable ``flock user_xattr noatime`` Lustre mount options by default everywhere and
``x-systemd.automount x-systemd.requires=lnet.service`` for systemd based systems.
* Increase the number of hosts that can be processed by scaling daemons in a single batch from 50 to 200. This
improves the scaling time especially with increased ASG launch rates.
* Change default sshd config in order to disable X11 forwarding and update the list of supported ciphers
significantly increases scaling speed when ASG launch rate is raised.
* Increase faulty node termination timeout from 1 minute to 5 in order to give some additional time to the scheduler
to recover when under heavy load.
* Extended ``pcluster createami`` command to specify the VPC and network settings when building the AMI.
* Support inline comments in config file
* Support Python 3.8 in pcluster CLI.
* Deprecate Python 2.6 support
* Add ``ClusterName`` tag to EC2 instances.
* Search for new available version only at ``pcluster create`` action.
* Enable ``sanity_check`` by default.

**BUG FIXES**

* Fix sanity check for custom ec2 role. Fixes `#1241 <https://github.com/aws/aws-parallelcluster/issues/1241>`_ .
* Fix bug when using same subnet for both master and compute.
* Fix bug when ganglia is enabled ganglia urls are shown. Fixes `#1322 <https://github.com/aws/aws-parallelcluster/issues/1322>`_ .
* Fix bug with ``awsbatch`` scheduler that prevented Multi-node jobs from running.
* Fix jobwatcher behaviour that was marking nodes locked by the nodewatcher as busy even if they had been removed
already from the ASG Desired count. This was causing, in rare circumstances, a cluster overscaling.
* Fix bug that was causing failures in sqswatcher when ADD and REMOVE event for the same host are fetched together.
* Fix bug that was preventing nodes to mount partitioned EBS volumes.
* Implement paginated calls in ``pcluster list``.
* Fix bug when creating ``awsbatch`` cluster with name longer than 31 chars
* Fix a bug that lead to ssh not working after ssh'ing into a compute node by ip address.

2.4.1
=====

**ENHANCEMENTS**

* Add support for ap-east-1 region (Hong Kong)
* Add possibility to specify instance type to use when building custom AMIs with ``pcluster createami``
* Speed up cluster creation by having compute nodes starting together with master node
* Enable ASG CloudWatch metrics for the ASG managing compute nodes
* Speed up cluster creation by having compute nodes starting together with master node. **Note** this requires one new IAM permissions in the `ParallelClusterInstancePolicy <https://docs.aws.amazon.com/en_us/parallelcluster/latest/ug/iam.html#parallelclusterinstancepolicy>`_, ``cloudformation:DescribeStackResource``
* Enable ASG CloudWatch metrics for the ASG managing compute nodes. **Note** this requires two new IAM permissions in the `ParallelClusterUserPolicy <https://docs.aws.amazon.com/parallelcluster/latest/ug/iam.html#parallelclusteruserpolicy>`_, ``autoscaling:DisableMetricsCollection`` and ``autoscaling:EnableMetricsCollection``
* Install Intel MPI 2019u4 on Amazon Linux, Centos 7 and Ubuntu 1604
* Upgrade Elastic Fabric Adapter (EFA) to version 1.4.1 that supports Intel MPI
* Run all node daemons and cookbook recipes in isolated Python virtualenvs. This allows our code to always run with the
Expand Down Expand Up @@ -44,7 +107,7 @@ CHANGELOG
* Make FSx Substack depend on ComputeSecurityGroupIngress to keep FSx from trying to create prior to the SG
allowing traffic within itself
* Restore correct value for ``filehandle_limit`` that was getting reset when setting ``memory_limit`` for EFA
* Torque: fix compute nodes locking mechanism to prevent job scheduling on nodes being terminated
* Torque: fix compute nodes locking mechanism to prevent job scheduling on nodes being terminated
* Restore logic that was automatically adding compute nodes identity to SSH ``known_hosts`` file
* Slurm: fix issue that was causing the ParallelCluster daemons to fail when the cluster is stopped and an empty compute nodes file
is imported in Slurm config
Expand Down
82 changes: 58 additions & 24 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,16 @@ You can build higher level workflows, such as a Genomics portal that automates t

Quick Start
-----------
First, install the library:
**IMPORTANT**: you will need an **Amazon EC2 Key Pair** to be able to complete the following steps.
Please see the `Official AWS Guide <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html>`_.

First, make sure you have installed the `AWS Command Line Interface <http://>`_:

.. code-block:: sh
$ pip install awscli
Then you can install AWS ParallelCluster:

.. code-block:: sh
Expand All @@ -35,34 +44,59 @@ Next, configure your aws credentials and default region:
Default region name [us-east-1]:
Default output format [None]:
Then, run pcluster configure:
Then, run ``pcluster configure``. A list of valid options will be displayed for each
configuration parameter. Type an option number and press ``Enter`` to select a specific option,
or just press ``Enter`` to accept the default option.

.. code-block:: ini
$ pcluster configure
Cluster Template [default]:
Acceptable Values for AWS Region ID:
ap-south-1
...
us-west-2
INFO: Configuration file /dir/conf_file will be written.
Press CTRL-C to interrupt the procedure.
Allowed values for AWS Region ID:
1. eu-north-1
...
15. us-west-1
16. us-west-2
AWS Region ID [us-east-1]:
VPC Name [myvpc]:
Acceptable Values for Key Name:
keypair1
keypair-test
production-key
Key Name []:
Acceptable Values for VPC ID:
vpc-1kd24879
vpc-blk4982d
VPC ID []:
Acceptable Values for Master Subnet ID:
subnet-9k284a6f
subnet-1k01g357
subnet-b921nv04
Master Subnet ID []:
Now you can create your first cluster;
...
Be sure to select a region containing the EC2 key pair you wish to use. You can also import a public key using
`these instructions <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html#how-to-generate-your-own-key-and-import-it-to-aws>`_.

During the process you will be asked to set up your networking environment. The wizard will offer you the choice of
using an existing VPC or creating a new one on the fly.

.. code-block:: ini
Automate VPC creation? (y/n) [n]:
Enter '``n``' if you already have a VPC suitable for the cluster. Otherwise you can let ``pcluster configure``
create a VPC for you. The same choice is given for the subnet: you can select a valid subnet ID for
both the master and compute nodes, or you can let ``pcluster configure`` set up everything for you.
The same choice is given for the subnet configuration: you can select a valid subnet ID for both
the master and compute nodes, or you can let pcluster configure set up everything for you.
In the latter case, just select the configuration you prefer.

.. code-block:: ini
Automate Subnet creation? (y/n) [y]: y
Allowed values for Network Configuration:
1. Master in a public subnet and compute fleet in a private subnet
2. Master and compute fleet in the same public subnet
At the end of the process a message like this one will be shown:

.. code-block:: ini
Configuration file written to /dir/conf_file
You can edit your configuration file or simply run 'pcluster create -c /dir/conf_file cluster-name' to create your cluster
Now you can create your first cluster:

.. code-block:: sh
Expand Down
Loading

0 comments on commit c4eab44

Please sign in to comment.