Skip to content

Commit

Permalink
Merge branch 'develop' into develop
Browse files Browse the repository at this point in the history
  • Loading branch information
hanwen-pcluste authored Apr 19, 2023
2 parents b5b723d + e49c40e commit 16a5239
Show file tree
Hide file tree
Showing 207 changed files with 15,635 additions and 3,874 deletions.
9 changes: 8 additions & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,13 @@ jobs:
run: pip install tox
- name: Run Tox
run: cd ${{ matrix.toxdir }} && tox -e ${{ matrix.toxenv }}
- name: Upload code coverage report to Codecov
uses: codecov/codecov-action@v3
if: ${{ endsWith(matrix.toxenv, '-cov') }}
with:
files: cli/coverage.xml
flags: unittests
verbose: true
awsbatch-cli-tests:
name: AWS Batch CLI Tests
runs-on: ${{ matrix.os }}
Expand Down Expand Up @@ -169,7 +176,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: mikefarah/yq@v4.6.3
- uses: mikefarah/yq@v4.32.2
- run: api/docker/awslambda/docker-build.sh
shellcheck:
name: Shellcheck
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/codeql-analysis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,9 @@ jobs:
- name: Checkout repository
uses: actions/checkout@v2
- name: Initialize CodeQL
uses: github/codeql-action/init@v1
uses: github/codeql-action/init@v2
with:
languages: ${{ matrix.language }}
queries: +security-and-quality
- name: Perform CodeQL Analysis
uses: github/codeql-action/analyze@v1
uses: github/codeql-action/analyze@v2
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -17,3 +17,4 @@ report.html
tests_outputs/
.python-version
test.yaml
.vscode
1 change: 1 addition & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ repos:
- id: check-symlinks
- id: end-of-file-fixer
- id: pretty-format-json
args: ['--autofix']
- id: requirements-txt-fixer
- id: mixed-line-ending
args: ['--fix=no']
Expand Down
68 changes: 64 additions & 4 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,58 @@ CHANGELOG
3.6.0
----
**ENHANCEMENTS**
- Add a CloudFormation custom resource for creating and managing clusters from CloudFormation.
- Add `mem_used_percent` and `disk_used_percent` metrics for head node memory and root volume disk utilization tracking on the ParallelCluster CloudWatch dashboard, and set up alarms for monitoring these metrics.

**ENHANCEMENTS**
- Add log rotation support for ParallelCluster managed logs.
- Track common errors of compute nodes on Cloudwatch Dashboard.
- Increase the limit on the maximum number of queues per cluster from 10 to 100. Each cluster can however have a maximum number of 150 compute resources and each queue can have a maximum of 40 compute resources.
- Allow to specify a sequence of multiple custom actions scripts per event.
- Add support for customizing the cluster Slurm configuration via the ParallelCluster configuration YAML file.
- Track the longest dynamic node idle time in CloudWatch Dashboard.
- Add new configuration section `HealthChecks/Gpu` for enabling the GPU Health Check in the compute node before job execution.
- Add support for `DetailedMonitoring` in the `Monitoring` section.
- Add support for `Tags` in the `SlurmQueues` and `SlurmQueues/ComputeResources` section.
- Build Slurm with support for LUA.

**CHANGES**
- Increase the default `RetentionInDays` of CloudWatch logs from 14 to 180 days.
- Set Slurm prolog and epilog configurations to target a directory, /opt/slurm/etc/scripts/prolog.d/ and /opt/slurm/etc/scripts/epilog.d/ respectively.
- Upgrade Slurm to version 23.02.1.
- Upgrade munge to version 0.5.15.
- Upgrade image used by CodeBuild environment when building container images for AWS Batch clusters, from
`aws/codebuild/amazonlinux2-x86_64-standard:3.0` to `aws/codebuild/amazonlinux2-x86_64-standard:4.0` and from
`aws/codebuild/amazonlinux2-aarch64-standard:1.0` to `aws/codebuild/amazonlinux2-aarch64-standard:2.0`.

**BUG FIXES**
- Fix EFS, FSx network security groups validators to avoid reporting false errors.
- Fix missing tagging of resources created by ImageBuilder during the `build-image` operation.
- Fix Update policy for MaxCount to always perform numerical comparisons on MaxCount property.
- Fix IP association on instances with multiple network cards.
- Fix replacement of StoragePass in slurm_parallelcluster_slurmdbd.conf when a queue parameter update is performed and the Slurm accounting configurations are not updated.

3.5.1
-----
**ENHANCEMENTS**
- Add a new way to distribute ParallelCluster as a self-contained executable shipped with a dedicated installer.
- Add support for US isolated region us-isob-east-1.

**CHANGES**
- Upgrade EFA installer to `1.22.0`
- Efa-driver: `efa-2.1.1g`
- Efa-config: `efa-config-1.13-1`
- Efa-profile: `efa-profile-1.5-1`
- Libfabric-aws: `libfabric-aws-1.17.0-1`
- Rdma-core: `rdma-core-43.0-1`
- Open MPI: `openmpi40-aws-4.1.5-1`
- Upgrade NICE DCV to version `2022.2-14521`.
- server: `2022.2.14521-1`
- xdcv: `2022.2.519-1`
- gl: `2022.2.1012-1`
- web_viewer: `2022.2.14521-1`

**BUG FIXES**
- Fix update cluster to remove shared EBS volumes can potentially cause node launching failures if `MountDir` match the same pattern in `/etc/exports`.
- Fix for compute_console_output log file being truncated at every clustermgtd iteration.

3.5.0
-----
Expand All @@ -23,7 +65,6 @@ CHANGELOG
- Add a Python library to allow customers to use ParallelCluster functionalities in their own code.
- Add logging of compute node console output to CloudWatch on compute node bootstrap failure.
- Add failures field containing failure code and reason to `describe-cluster` output when cluster creation fails.
- Add support for US isolated regions: us-iso-* and us-isob-*.

**CHANGES**
- Upgrade Slurm to version 22.05.8.
Expand Down Expand Up @@ -204,6 +245,25 @@ CHANGELOG
- Fix ParallelCluster API stack update failure when upgrading from a previus version. Add resource pattern used for the `ListImagePipelineImages` action in the `EcrImageDeletionLambdaRole`.
- Fix ParallelCluster API adding missing permissions needed to import/export from S3 when creating an FSx for Lustre storage.

3.1.5
------

**CHANGES**
- Upgrade EFA installer to `1.18.0`
- Efa-driver: `efa-1.16.0-1`
- Efa-config: `efa-config-1.11-1`
- Efa-profile: `efa-profile-1.5-1`
- Libfabric-aws: `libfabric-aws-1.16.0~amzn4.0-1`
- Rdma-core: `rdma-core-41.0-2`
- Open MPI: `openmpi40-aws-4.1.4-2`
- Add `lambda:ListTags` and `lambda:UntagResource` to `ParallelClusterUserRole` used by ParallelCluster API stack for cluster update.
- Upgrade Intel MPI Library to 2021.6.0.602.
- Upgrade NVIDIA driver to version 470.141.03.
- Upgrade NVIDIA Fabric Manager to version 470.141.03.

**BUG FIXES**
- Fix Slurm issue that prevents idle nodes termination.

3.1.4
------

Expand Down Expand Up @@ -680,7 +740,7 @@ CHANGELOG
- Improve retrieval of instance type info by using `DescribeInstanceType` API.
- Remove `custom_awsbatch_template_url` configuration parameter.
- Upgrade `pip` to latest version in virtual environments.
- Upgrade image used by CodeBuild environment when building container images for Batch clusters, from
- Upgrade image used by CodeBuild environment when building container images for AWS Batch clusters, from
`aws/codebuild/amazonlinux2-x86_64-standard:1.0` to `aws/codebuild/amazonlinux2-x86_64-standard:3.0`.

**BUG FIXES**
Expand Down
97 changes: 5 additions & 92 deletions api/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,98 +91,11 @@ correctness of the API model evey time a PR is opened.
The ParallelCluster OpenAPI Generator workflow (`workdlows/openapi_generator.yml`) defines a `generate-openapi-model`
build step that automatically adds to the PR the generated OpenAPI model in case this was not included in the commit.

## Packaging the API as an AWS Lambda container
## Testing

The `docker/awslambda` directory contains the definition of a Dockerfile that is used to package the ParallelCluster
API as an AWS Lambda function. Running the `docker/awslambda/docker-build.sh` script will produce a `pcluster-lambda`
Docker container that packages and exposes the ParallelCluster API in a format which is compatible with the AWS Lambda runtime.
The API is a facade ontop of the controllers (as well as the CLI) so much of the underlying functionality can be tested
through unit tests and integration tests that exercise the operations.

### Running Testing and Debugging the API locally
In order to test the API specifically, there are integraiton tests which will deploy the API and test the functionality using
the generated client.

Once the Docker image has been successfully built you have the following options:

#### Run a shell in the container
Use the following to run a shell in the container: `docker run -it --entrypoint /bin/bash pcluster-lambda`.

This is particularly useful to debug issues with the container runtime.

#### Run a local AWS Lambda endpoint
Use the following to run a local AWS Lambda endpoint hosting the API: `docker run -e POWERTOOLS_TRACE_DISABLED=1 -e AWS_REGION=eu-west-1 -p 9000:8080 pcluster-lambda`

Then you can use the following to send requests to the local endpoint:
`curl -XPOST "http://localhost:9000/2015-03-31/functions/function/invocations" -d @docker/awslambda/test-events/event.json`

This is useful to test the integration with AWS Lambda.

#### Run the Flask development server
Use the following to run a local Flask development server hosting the API: `docker run -p 8080:8080 --entrypoint python pcluster-lambda -m pcluster.api.flask_app`

Then you can navigate to the following url to test the API: `http://0.0.0.0:8080/ui`
Note that to enable swagger-ui you have to build the docker with `--build-arg PROFILE=dev`.

This is particularly useful to ignore the AWS Lambda layer and directly hit the Flask application with plain HTTP requests.
An even simpler way to do this which also offers live reloading of the API code, is to just ignore the Docker container
and run a local Flask server on your host by executing `cd ../cli/src && python -m pcluster.api.flask_app`

## Deploy the API test infrastructure with SAM cli (API Gateway + Lambda)
The Serverless Application Model Command Line Interface (SAM CLI) is an extension of the AWS CLI that adds functionality
for building and testing Lambda applications. It uses Docker to run your functions in an Amazon Linux environment that
matches Lambda. It can also emulate your application's build environment and API.

To use the SAM CLI, you need the following tools.

* SAM CLI - [Install the SAM CLI](https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-sam-cli-install.html)
* Docker - [Install Docker community edition](https://hub.docker.com/search/?type=edition&offering=community)

You may need the following for local testing.
* [Python 3 installed](https://www.python.org/downloads/)

The `docker/awslambda/sam` directory contains a sample [SAM](https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/what-is-sam.html)
template that can be used to test the ParallelCluster API.

### Run a local AWS APIGateway endpoint with SAM
The SAM template can be used together with the SAM CLI to locally test the ParallelCluster API as if it were hosted
behind an API Gateway endpoint.

To do so move to the `docker/awslambda/sam` directory and run:

```bash
sam build
sam local start-api
```

To only invoke the AWS Lambda function locally you can run:
```bash
sam build
sam local invoke ParallelClusterFunction --event ../test-events/event.json
```

For further details and
to review all the testing features available through SAM please refer to the official
[SAM docs](https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-test-and-debug.html).

### Deploy the API test infrastructure
To build and deploy your application for the first time, run the following in your shell:

```bash
sam build
sam deploy --guided
```

The first command will build a docker image from a Dockerfile and then copy the source of your application inside the Docker image.
The second command will package and deploy your application to AWS, with a series of prompts.

#### Fetch, tail, and filter Lambda function logs

To simplify troubleshooting, SAM CLI has a command called `sam logs`. `sam logs` lets you fetch logs generated by your
deployed Lambda function from the command line. In addition to printing the logs on the terminal, this command has
several nifty features to help you quickly find the bug.

NOTE: This command works for all AWS Lambda functions; not just the ones you deploy using SAM.

```bash
sam logs -n ParallelClusterFunction --stack-name pcluster-lambda --tail
```

You can find more information and examples about filtering Lambda function logs in the
[SAM CLI Documentation](https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-sam-cli-logging.html).
2 changes: 2 additions & 0 deletions api/client/patch-client.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@
# OR CONDITIONS OF ANY KIND, express or implied. See the License for the specific language governing permissions and
# limitations under the License.

set -ex

cp client/resources/sigv4_auth.py client/src/pcluster_client
patch -u -N client/src/pcluster_client/api_client.py < client/resources/api_client.py.patch
patch -u -N client/src/requirements.txt < client/resources/client-requirements.txt.patch
Expand Down
6 changes: 3 additions & 3 deletions api/client/resources/api_client.py.patch
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,10 @@


class ApiClient(object):
@@ -603,6 +604,9 @@
if not auth_settings:
@@ -633,6 +634,9 @@ class ApiClient(object):
headers, queries, resource_path, method, body, auth_setting)
return

+ if 'aws.auth.sigv4' in auth_settings:
+ sigv4_auth(method, self.configuration.host, resource_path, queries, body, headers)
+
Expand Down
72 changes: 0 additions & 72 deletions api/docker/awslambda/sam/template.yaml

This file was deleted.

Loading

0 comments on commit 16a5239

Please sign in to comment.