Skip to content

Self patch a Cluster Used for Submitting Multi node Parallel Jobs through AWS Batch

Nathan Stornetta edited this page Nov 4, 2019 · 3 revisions

A small number of users have been affected by an issue in submitting multi-node parallel jobs with AWS Batch as the job scheduler. This issue is fixable through self-patching by applying the script below that has been authored by the AWS ParallelCluster team.

The recommended course of action if you are affected by this issue is to copy the script below and paste it into a new .sh file of your choosing on your local machine. The script can then be run for each combination of cluster and region that are affected by this issue as follows: <filename>.sh -r <affected region> -c <affected cluster name>

For example, if I copy-pasted the script below and saved my shell script as batch_patch.sh, the name of my cluster were default, and my cluster was located in us-east-1, I would run the patching script by running this command from a command line of my local machine: batch_patch.sh -r us-east-1 -c default

Please note the following requirements to successfully run this batch script:

  • The AWS CLI (instructions for downloading the CLI can be found here)
  • The zip command
  • The tar command
!/bin/bash
set -e
usage()
{
echo "usage: script-name.sh -r region -c cluster-name"
}

CLUSTER_NAME=
REGION=

while [ "$1" != "" ]; do
case $1 in
-r | --region ) shift
REGION=$1
;;
-c | --cluster-name ) shift
CLUSTER_NAME=$1
;;
-h | --help ) usage
exit
;;

    * )                     usage
                            exit 1
esac
shift
done

if [ -z "${CLUSTER_NAME}" ] || [ -z "${REGION}" ]; then
usage
exit
fi

STACK_NAME="parallelcluster-${CLUSTER_NAME}"
S3_BUCKET=$(aws cloudformation describe-stacks --stack-name "${STACK_NAME}" --query "Stacks[0].Outputs[?OutputKey=='ResourcesS3Bucket'].OutputValue" --output text --region ${REGION})
WD="/tmp/pcluster-awsbatch-fix"

echo "Dowloading patched artifacts"
curl -s -L https://github.com/aws/aws-parallelcluster/tarball/b781987 > /tmp/pcluster-awsbatch-fix.tar.gz
mkdir -p "${WD}" && tar -xf /tmp/pcluster-awsbatch-fix.tar.gz -C "${WD}"
cd "${WD}/aws-aws-parallelcluster-b781987/cli/pcluster/resources/batch/docker/" && zip --quiet -r "${WD}/artifacts.zip" ./*

echo "Uploading patched artifacts to ${S3_BUCKET}"
aws s3 cp "${WD}/artifacts.zip" s3://${S3_BUCKET}/docker/artifacts.zip --region ${REGION}

CODEBUILD_PROJECT="${STACK_NAME}-build-docker-images-project"
BUILD_ID=$(aws codebuild start-build --project-name ${CODEBUILD_PROJECT} --query "build.id" --output text --region ${REGION})

echo "Applying patch to docker image..."
BUILD_STATUS=
while [ "${BUILD_STATUS}" != "SUCCEEDED" ] && [ "${BUILD_STATUS}" != "FAILED" ]; do
sleep 20
BUILD_STATUS=$(aws codebuild batch-get-builds --ids ${BUILD_ID} --query "builds[0].buildStatus" --output text --region ${REGION})
echo "Build status is ${BUILD_STATUS}"
done

echo "Patch applied successfully!"
Clone this wiki locally