Merge branch 'develop' into develop

hanwen-pcluste · Apr 19, 2023 · 16a5239 · 16a5239
2 parents b5b723d + e49c40e
commit 16a5239
Show file tree

Hide file tree

Showing 207 changed files with 15,635 additions and 3,874 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -94,6 +94,13 @@ jobs:
         run: pip install tox
       - name: Run Tox
         run: cd ${{ matrix.toxdir }} && tox -e ${{ matrix.toxenv }}
+      - name: Upload code coverage report to Codecov
+        uses: codecov/codecov-action@v3
+        if: ${{ endsWith(matrix.toxenv, '-cov') }}
+        with:
+          files: cli/coverage.xml
+          flags: unittests
+          verbose: true
   awsbatch-cli-tests:
     name: AWS Batch CLI Tests
     runs-on: ${{ matrix.os }}
@@ -169,7 +176,7 @@ jobs:
     runs-on: ubuntu-latest
     steps:
       - uses: actions/checkout@v2
-      - uses: mikefarah/yq@v4.6.3
+      - uses: mikefarah/yq@v4.32.2
       - run: api/docker/awslambda/docker-build.sh
   shellcheck:
     name: Shellcheck

diff --git a/.github/workflows/codeql-analysis.yml b/.github/workflows/codeql-analysis.yml
@@ -22,9 +22,9 @@ jobs:
     - name: Checkout repository
       uses: actions/checkout@v2
     - name: Initialize CodeQL
-      uses: github/codeql-action/init@v1
+      uses: github/codeql-action/init@v2
       with:
         languages: ${{ matrix.language }}
         queries: +security-and-quality
     - name: Perform CodeQL Analysis
-      uses: github/codeql-action/analyze@v1
+      uses: github/codeql-action/analyze@v2
diff --git a/.gitignore b/.gitignore
@@ -17,3 +17,4 @@ report.html
 tests_outputs/
 .python-version
 test.yaml
+.vscode
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -16,6 +16,7 @@ repos:
       - id: check-symlinks
       - id: end-of-file-fixer
       - id: pretty-format-json
+        args: ['--autofix']
       - id: requirements-txt-fixer
       - id: mixed-line-ending
         args: ['--fix=no']

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,16 +4,58 @@ CHANGELOG
 3.6.0
 ----
 **ENHANCEMENTS**
+- Add a CloudFormation custom resource for creating and managing clusters from CloudFormation.
 - Add `mem_used_percent` and `disk_used_percent` metrics for head node memory and root volume disk utilization tracking on the ParallelCluster CloudWatch dashboard, and set up alarms for monitoring these metrics.
-
-**ENHANCEMENTS**
 - Add log rotation support for ParallelCluster managed logs.
+- Track common errors of compute nodes on Cloudwatch Dashboard.
+- Increase the limit on the maximum number of queues per cluster from 10 to 100. Each cluster can however have a maximum number of 150 compute resources and each queue can have a maximum of 40 compute resources.
+- Allow to specify a sequence of multiple custom actions scripts per event.
+- Add support for customizing the cluster Slurm configuration via the ParallelCluster configuration YAML file.
+- Track the longest dynamic node idle time in CloudWatch Dashboard.
+- Add new configuration section `HealthChecks/Gpu` for enabling the GPU Health Check in the compute node before job execution.
+- Add support for `DetailedMonitoring` in the `Monitoring` section.
+- Add support for `Tags` in the `SlurmQueues` and `SlurmQueues/ComputeResources` section.
+- Build Slurm with support for LUA.
 
 **CHANGES**
 - Increase the default `RetentionInDays` of CloudWatch logs from 14 to 180 days.
+- Set Slurm prolog and epilog configurations to target a directory, /opt/slurm/etc/scripts/prolog.d/ and /opt/slurm/etc/scripts/epilog.d/ respectively.
+- Upgrade Slurm to version 23.02.1.
+- Upgrade munge to version 0.5.15.
+- Upgrade image used by CodeBuild environment when building container images for AWS Batch clusters, from
+  `aws/codebuild/amazonlinux2-x86_64-standard:3.0` to `aws/codebuild/amazonlinux2-x86_64-standard:4.0` and from
+  `aws/codebuild/amazonlinux2-aarch64-standard:1.0` to `aws/codebuild/amazonlinux2-aarch64-standard:2.0`.
 
 **BUG FIXES**
 - Fix EFS, FSx network security groups validators to avoid reporting false errors.
+- Fix missing tagging of resources created by ImageBuilder during the `build-image` operation.
+- Fix Update policy for MaxCount to always perform numerical comparisons on MaxCount property.
+- Fix IP association on instances with multiple network cards.
+- Fix replacement of StoragePass in slurm_parallelcluster_slurmdbd.conf when a queue parameter update is performed and the Slurm accounting configurations are not updated.
+
+3.5.1
+-----
+**ENHANCEMENTS**
+- Add a new way to distribute ParallelCluster as a self-contained executable shipped with a dedicated installer.
+- Add support for US isolated region us-isob-east-1.
+
+**CHANGES**
+- Upgrade EFA installer to `1.22.0`
+  - Efa-driver: `efa-2.1.1g`
+  - Efa-config: `efa-config-1.13-1`
+  - Efa-profile: `efa-profile-1.5-1`
+  - Libfabric-aws: `libfabric-aws-1.17.0-1`
+  - Rdma-core: `rdma-core-43.0-1`
+  - Open MPI: `openmpi40-aws-4.1.5-1`
+- Upgrade NICE DCV to version `2022.2-14521`.
+  - server: `2022.2.14521-1`
+  - xdcv: `2022.2.519-1`
+  - gl: `2022.2.1012-1`
+  - web_viewer: `2022.2.14521-1`
+
+**BUG FIXES**
+- Fix update cluster to remove shared EBS volumes can potentially cause node launching failures if `MountDir` match the same pattern in `/etc/exports`.
+- Fix for compute_console_output log file being truncated at every clustermgtd iteration.
 
 3.5.0
 -----
@@ -23,7 +65,6 @@ CHANGELOG
 - Add a Python library to allow customers to use ParallelCluster functionalities in their own code.
 - Add logging of compute node console output to CloudWatch on compute node bootstrap failure.
 - Add failures field containing failure code and reason to `describe-cluster` output when cluster creation fails.
-- Add support for US isolated regions: us-iso-* and us-isob-*.
 
 **CHANGES**
 - Upgrade Slurm to version 22.05.8.
@@ -204,6 +245,25 @@ CHANGELOG
 - Fix ParallelCluster API stack update failure when upgrading from a previus version. Add resource pattern used for the `ListImagePipelineImages` action in the `EcrImageDeletionLambdaRole`.
 - Fix ParallelCluster API adding missing permissions needed to import/export from S3 when creating an FSx for Lustre storage.
 
+3.1.5
+------
+
+**CHANGES**
+- Upgrade EFA installer to `1.18.0`
+  - Efa-driver: `efa-1.16.0-1`
+  - Efa-config: `efa-config-1.11-1`
+  - Efa-profile: `efa-profile-1.5-1`
+  - Libfabric-aws: `libfabric-aws-1.16.0~amzn4.0-1`
+  - Rdma-core: `rdma-core-41.0-2`
+  - Open MPI: `openmpi40-aws-4.1.4-2`
+- Add `lambda:ListTags` and `lambda:UntagResource` to `ParallelClusterUserRole` used by ParallelCluster API stack for cluster update.
+- Upgrade Intel MPI Library to 2021.6.0.602.
+- Upgrade NVIDIA driver to version 470.141.03.
+- Upgrade NVIDIA Fabric Manager to version 470.141.03.
+
+**BUG FIXES**
+- Fix Slurm issue that prevents idle nodes termination.
+
 3.1.4
 ------
 
@@ -680,7 +740,7 @@ CHANGELOG
 - Improve retrieval of instance type info by using `DescribeInstanceType` API.
 - Remove `custom_awsbatch_template_url` configuration parameter.
 - Upgrade `pip` to latest version in virtual environments.
-- Upgrade image used by CodeBuild environment when building container images for Batch clusters, from
+- Upgrade image used by CodeBuild environment when building container images for AWS Batch clusters, from
   `aws/codebuild/amazonlinux2-x86_64-standard:1.0` to `aws/codebuild/amazonlinux2-x86_64-standard:3.0`.
 
 **BUG FIXES**

diff --git a/api/README.md b/api/README.md
@@ -91,98 +91,11 @@ correctness of the API model evey time a PR is opened.
 The ParallelCluster OpenAPI Generator workflow (`workdlows/openapi_generator.yml`) defines a `generate-openapi-model`
 build step that automatically adds to the PR the generated OpenAPI model in case this was not included in the commit.
 
-## Packaging the API as an AWS Lambda container
+## Testing
 
-The `docker/awslambda` directory contains the definition of a Dockerfile that is used to package the ParallelCluster
-API as an AWS Lambda function. Running the `docker/awslambda/docker-build.sh` script will produce a `pcluster-lambda`
-Docker container that packages and exposes the ParallelCluster API in a format which is compatible with the AWS Lambda runtime.
+The API is a facade ontop of the controllers (as well as the CLI) so much of the underlying functionality can be tested
+through unit tests and integration tests that exercise the operations.
 
-### Running Testing and Debugging the API locally
+In order to test the API specifically, there are integraiton tests which will deploy the API and test the functionality using
+the generated client.
 
-Once the Docker image has been successfully built you have the following options:
-
-#### Run a shell in the container
-Use the following to run a shell in the container: `docker run -it --entrypoint /bin/bash pcluster-lambda`.
-
-This is particularly useful to debug issues with the container runtime.
-
-#### Run a local AWS Lambda endpoint
-Use the following to run a local AWS Lambda endpoint hosting the API: `docker run -e POWERTOOLS_TRACE_DISABLED=1 -e AWS_REGION=eu-west-1 -p 9000:8080 pcluster-lambda`
-
-Then you can use the following to send requests to the local endpoint:
-`curl -XPOST "http://localhost:9000/2015-03-31/functions/function/invocations" -d @docker/awslambda/test-events/event.json`
-
-This is useful to test the integration with AWS Lambda.
-
-#### Run the Flask development server
-Use the following to run a local Flask development server hosting the API: `docker run -p 8080:8080 --entrypoint python pcluster-lambda -m pcluster.api.flask_app`
-
-Then you can navigate to the following url to test the API: `http://0.0.0.0:8080/ui`
-Note that to enable swagger-ui you have to build the docker with `--build-arg PROFILE=dev`.
-
-This is particularly useful to ignore the AWS Lambda layer and directly hit the Flask application with plain HTTP requests.
-An even simpler way to do this which also offers live reloading of the API code, is to just ignore the Docker container
-and run a local Flask server on your host by executing `cd ../cli/src && python -m pcluster.api.flask_app`
-
-## Deploy the API test infrastructure with SAM cli (API Gateway + Lambda)
-The Serverless Application Model Command Line Interface (SAM CLI) is an extension of the AWS CLI that adds functionality
-for building and testing Lambda applications. It uses Docker to run your functions in an Amazon Linux environment that
-matches Lambda. It can also emulate your application's build environment and API.
-
-To use the SAM CLI, you need the following tools.
-
-* SAM CLI - [Install the SAM CLI](https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-sam-cli-install.html)
-* Docker - [Install Docker community edition](https://hub.docker.com/search/?type=edition&offering=community)
-
-You may need the following for local testing.
-* [Python 3 installed](https://www.python.org/downloads/)
-
-The `docker/awslambda/sam` directory contains a sample [SAM](https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/what-is-sam.html)
-template that can be used to test the ParallelCluster API.
-
-### Run a local AWS APIGateway endpoint with SAM
-The SAM template can be used together with the SAM CLI to locally test the ParallelCluster API as if it were hosted
-behind an API Gateway endpoint.
-
-To do so move to the `docker/awslambda/sam` directory and run:
-
-```bash
-sam build
-sam local start-api
-```
-
-To only invoke the AWS Lambda function locally you can run:
-```bash
-sam build
-sam local invoke ParallelClusterFunction --event ../test-events/event.json
-```
-
-For further details and
-to review all the testing features available through SAM please refer to the official
-[SAM docs](https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-test-and-debug.html).
-
-### Deploy the API test infrastructure
-To build and deploy your application for the first time, run the following in your shell:
-
-```bash
-sam build
-sam deploy --guided
-```
-
-The first command will build a docker image from a Dockerfile and then copy the source of your application inside the Docker image.
-The second command will package and deploy your application to AWS, with a series of prompts.
-
-#### Fetch, tail, and filter Lambda function logs
-
-To simplify troubleshooting, SAM CLI has a command called `sam logs`. `sam logs` lets you fetch logs generated by your
-deployed Lambda function from the command line. In addition to printing the logs on the terminal, this command has
-several nifty features to help you quickly find the bug.
-
-NOTE: This command works for all AWS Lambda functions; not just the ones you deploy using SAM.
-
-```bash
-sam logs -n ParallelClusterFunction --stack-name pcluster-lambda --tail
-```
-
-You can find more information and examples about filtering Lambda function logs in the 
-[SAM CLI Documentation](https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-sam-cli-logging.html).
diff --git a/api/client/patch-client.sh b/api/client/patch-client.sh
@@ -7,6 +7,8 @@
 # OR CONDITIONS OF ANY KIND, express or implied. See the License for the specific language governing permissions and
 # limitations under the License.
 
+set -ex
+
 cp client/resources/sigv4_auth.py client/src/pcluster_client
 patch -u -N client/src/pcluster_client/api_client.py < client/resources/api_client.py.patch
 patch -u -N client/src/requirements.txt < client/resources/client-requirements.txt.patch

diff --git a/api/client/resources/api_client.py.patch b/api/client/resources/api_client.py.patch
@@ -8,10 +8,10 @@
 
 
  class ApiClient(object):
-@@ -603,6 +604,9 @@
-         if not auth_settings:
+@@ -633,6 +634,9 @@ class ApiClient(object):
+                     headers, queries, resource_path, method, body, auth_setting)
              return
- 
+
 +        if 'aws.auth.sigv4' in auth_settings:
 +            sigv4_auth(method, self.configuration.host, resource_path, queries, body, headers)
 +

diff --git a/api/docker/awslambda/sam/template.yaml b/api/docker/awslambda/sam/template.yaml