Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First round of ML updates #413

Open
wants to merge 4 commits into
base: ml
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
106 changes: 0 additions & 106 deletions content/03-Cluster/05-run-efa-nccl-tests.md

This file was deleted.

65 changes: 65 additions & 0 deletions content/03-Cluster/06-ami.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
---
title: "e. Create a Machine Image
date: 2020-05-12T13:27:03Z
weight : 40
tags : ["tutorial", "EFA", "ec2", "NCCL", "MPI", "Benchmark", "compile"]
---

Pcluster provides default AMI, but we can use custom AMI to make experience better for our use case. Lets look at 2 ways we can build custom AMI.
lipovsek-aws marked this conversation as resolved.
Show resolved Hide resolved

# DLAMI
[DLAMI](https://docs.aws.amazon.com/dlami/latest/devguide/what-is-dlami.html) contains GPU dependencies and ML frameworks (for example pytorch) which are seamlesly integrated with other AWS services like EFA.
Lets create `ami.yml` (see [DLAMI release notes](https://docs.aws.amazon.com/dlami/latest/devguide/appendix-ami-release-notes.html) to get AMI ARN (`ParentImage` in config bellow)):
```
Build:
SecurityGroupIds: [<insert you SG - in requires outbound traffic>]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you automate this?

Copy link
Author

@lipovsek-aws lipovsek-aws Apr 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's automated with pcluster CLI, we could add cloudformation that sets up VPC, subnet, SG and runs pcluster CLI via bash runner (or even better trigger lambda to run it). I though to create short bash script with templating for config yaml file but I don't think it's worth the effort since it's not abstracting anything and adds boilerplate.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% chance someone won't know how to find a security group and or even what it is. Provide the steps to create one and retrieve ID or provide the retrieve security group id step.

SubnetId: subnet-123
InstanceType: g5.2xlarge # you can choose different instance
ParentImage: ami-123
```

Now run [pcluster command](https://docs.aws.amazon.com/parallelcluster/latest/ug/pcluster.build-image-v3.html) that will add all pcluster dependencies to your DLAMI of choice:
```
pcluster build-image -c ami.yml -i NEW_AMI_ID -r REGION
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

variables...

Suggested change
pcluster build-image -c ami.yml -i NEW_AMI_ID -r REGION
pcluster build-image -c ami.yml -i $NEW_AMI_ID -r $AWS_REGION

```

# Fully custom AMI
We created Packer configuration for AMI that allows you to customize different aspects of deep learning toolchain(for example use specific CUDA version).

#### 1 - Assets Required to Build the Image

First let's fetch the assets required to build the image:

```bash
wget https://ml.hpcworkshops.com/scripts/packer/packer.tar.gz
Copy link
Collaborator

@mhuguesaws mhuguesaws Apr 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain the content of the archive before proposing to download.
For sake of clarity, I suggest you make the reader download the 3 files separately. At least they can review the file on github before hands. That's also give an opportunity to reviewer to look at the content of the files that are of this workshop.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sean-smith you added this and I just moved it, any specific reason for this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this anymore. This can just be:

git clone [email protected]:aws-samples/parallelcluster-efa-gpu-preflight-ami.git

tar -xzf packer.tar.gz
```

This consists of:
* `nvidia-efa-ml-al2-enroot_pyxis.json`: is your main image file, it consists of several sections to define the resources (instance, base AMI, security groups...) you will use to build your image. The base AMI is a ParallelCluster Amazon Linux 2 base AMI. The provisioners section consists of inline scripts that will be executed serially to install the desired software stack onto your image.
* `variables.json`: contains some key variables. Packer will refer to them in the image script through user variables calls.
* `enroot.com`: in the enroot directory contains the [Enroot](https://github.com/NVIDIA/enroot) configuration that will be copied to your AMI.

#### 2 - Installing Packer

You can install Packer using [Brew](https://brew.sh/) on OSX or Linux as follows:

```bash
brew install packer
Copy link
Collaborator

@mhuguesaws mhuguesaws Apr 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Standardize on Cloud9 or cloudshell..
If I have windows how do I do?

Provide a specific version to prevent regression in the future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll standardize on Cloud9. Cloudshell storage space is too limited. IMHO most ML devops don't need instructions on how to use the cli. This is different than HPC.

```

Alternatively, you can download the Packer binary through the [tool website](https://www.packer.io/). Ensure your `PATH` is set to use the binary or use its absolute path. Once Packer installed, proceed to the next stage.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too many options.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll remove.


#### 3 - Build Your Image
sean-smith marked this conversation as resolved.
Show resolved Hide resolved

Once packer installed, from the assets directory run the command below:

```bash
packer build -color=true -var-file variables.json nvidia-efa-ml-al2-enroot_pyxis.json | tee build_AL2.log
```

Packer will start by creating the instances and associated resources (EC2 Key, Security Group...), run through the installation scripts, shutdown the instance and image it then terminate the instance.

The process is automated and the output will be displayed on your terminal. If Packer encounters an error during the installation, it will stop the process and terminate all the resources. You will have to go through its log to identify where the error occurred and correct it.

Once the image build, feel free to use it to create new clusters. The image will be retrieval from the Amazon EC2 Console under "Images -> AMIs"
10 changes: 10 additions & 0 deletions content/04-Verify cluster/01-preflight.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
---
title: "b. Download, compile and run the NCCL tests"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
title: "b. Download, compile and run the NCCL tests"
title: "b. Run the NCCL tests"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed.

date: 2020-05-12T13:27:03Z
weight : 40
tags : ["tutorial", "EFA", "ec2", "NCCL", "MPI", "Benchmark", "compile"]
---

# Preflight

We created automated cluster validator to make sure all you dependencies are set correctly. Follow instructions in [preflight](https://github.com/aws-samples/parallelcluster-efa-gpu-preflight-ami/tree/main/preflight) repository.
177 changes: 177 additions & 0 deletions content/04-Verify cluster/02-run-efa-nccl-tests.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
---
title: "b. Download, compile and run the NCCL tests"
date: 2020-05-12T13:27:03Z
weight : 40
tags : ["tutorial", "EFA", "ec2", "NCCL", "MPI", "Benchmark", "compile"]
---



In this section, you will download, compile and run on 2 nodes a common GPU to GPU communication benchmarks from Nvidia used in ML Frameworks such as PyTorch.


#### Download and Compile the NCCL tests

You can run the script below on the Master node of your ParallelCluster in the home directory to

```bash
cd ~

cat > compile_nccl.sh << EOF
Copy link
Collaborator

@mhuguesaws mhuguesaws Apr 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolute path

Suggested change
cat > compile_nccl.sh << EOF
cat > ~/compile_nccl.sh << EOF

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

#!/bin/bash

module load intelmpi

git clone -b v2.17.1-1 https://github.com/NVIDIA/nccl.git
cd nccl
make -j src.build CUDA_HOME=/usr/local/cuda NVCC_GENCODE='-gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope cuda version does not change...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's default "system" cuda - new AMIs will have new CUDA on the same path. I can add a note that if they have custom cuda they change this path, I assumed if someone is more advanced to add specific CUDA version they will be familiar with these parameters. But good point, I'll add the note.

cd ..

git clone -b aws https://github.com/aws/aws-ofi-nccl.git
cd aws-ofi-nccl
./autogen.sh
./configure --prefix=${HOME}/aws-ofi-nccl/install --with-mpi=/opt/amazon/openmpi --with-libfabric=/opt/amazon/efa --with-cuda=/usr/local/cuda
make
make install
cd ..

git clone -b v2.13.6 https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests
make MPI=1 CUDA_HOME=/usr/local/cuda MPI_HOME=/opt/amazon/openmpi NCCL_HOME=${HOME}/nccl/build
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

module load

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's using OpenMPI form pcluster AMI, no need for IntelMPI to get correct performance.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

module load openmpi. That's what I said.


echo "Installation done, run a quick test!"
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${HOME}/nccl/build/lib:${HOME}/aws-ofi-nccl/install/lib
/opt/amazon/openmpi/bin/mpirun -np $(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l) ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1

EOF

sh ./compile_nccl.sh
```


#### Submit NCCL benchmark

Create your job submission script for *OSU Latency* and use **sbatch** to submit your job:

```bash
cat > nccl_test.sbatch << \EOF
Copy link
Collaborator

@mhuguesaws mhuguesaws Apr 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cat > nccl_test.sbatch << \EOF
cat > ~/nccl_test.sbatch << EOF

#!/bin/bash
#SBATCH -n 192
#SBATCH -N 2
#SBATCH --gres=gpu:8
#SBATCH --output=nccl.out

NCCL_TEST_PATH=${HOME}/nccl-tests/build
MPI_PATH=/opt/amazon/openmpi
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

module load...


export LD_LIBRARY_PATH=${HOME}/nccl/build/lib:${HOME}/aws-ofi-nccl/install/lib
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use module....


export NCCL_PROTO=simple
export FI_EFA_USE_DEVICE_RDMA=1 # use for P4
export FI_EFA_FORK_SAFE=1
export FI_PROVIDER=efa
export FI_EFA_ENABLE_SHM_TRANSFER=0

export NCCL_DEBUG=INFO
export FI_LOG_LEVEL=1

${MPI_PATH}/bin/mpirun --map-by ppr:8:node --rank-by slot \
Copy link
Collaborator

@mhuguesaws mhuguesaws Apr 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One the openmpi is loaded no need to have path like this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
${MPI_PATH}/bin/mpirun --map-by ppr:8:node --rank-by slot \
${MPI_PATH}/bin/mpirun --map-by ppr:4:socket \

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--mca pml ^cm --mca btl tcp,self \
--mca btl_tcp_if_exclude lo,docker0 --bind-to none \
${NCCL_TEST_PATH}/all_reduce_perf -b 8 -e 9G -f 2 -g 1 -c 1 -n 100
EOF

sbatch nccl_test.sbatch
watch squeue
```

You have to wait a couple of minutes for your compute instances to come up, once you see the job go from **PD** pending to **R** running state, you know the instances are up. Type **Ctrl-C** to exit squeue at any point.

After the job has completed, find the output in `cat ~/nccl.out` . You will see something like:

```bash
$ head -n 19 ~/nccl.out
# nThread 1 nGpus 1 minBytes 8 maxBytes 9663676416 step: 2(factor) warmup iters: 5 iters: 100 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 77208 on new-st-gpu-1 device 0 [0x10] NVIDIA A100-SXM4-80GB
# Rank 1 Group 0 Pid 77209 on new-st-gpu-1 device 1 [0x10] NVIDIA A100-SXM4-80GB
# Rank 2 Group 0 Pid 77211 on new-st-gpu-1 device 2 [0x20] NVIDIA A100-SXM4-80GB
# Rank 3 Group 0 Pid 77212 on new-st-gpu-1 device 3 [0x20] NVIDIA A100-SXM4-80GB
# Rank 4 Group 0 Pid 77213 on new-st-gpu-1 device 4 [0x90] NVIDIA A100-SXM4-80GB
# Rank 5 Group 0 Pid 77214 on new-st-gpu-1 device 5 [0x90] NVIDIA A100-SXM4-80GB
# Rank 6 Group 0 Pid 77215 on new-st-gpu-1 device 6 [0xa0] NVIDIA A100-SXM4-80GB
# Rank 7 Group 0 Pid 77216 on new-st-gpu-1 device 7 [0xa0] NVIDIA A100-SXM4-80GB
# Rank 8 Group 0 Pid 95401 on new-st-gpu-2 device 0 [0x10] NVIDIA A100-SXM4-80GB
# Rank 9 Group 0 Pid 95402 on new-st-gpu-2 device 1 [0x10] NVIDIA A100-SXM4-80GB
# Rank 10 Group 0 Pid 95403 on new-st-gpu-2 device 2 [0x20] NVIDIA A100-SXM4-80GB
# Rank 11 Group 0 Pid 95404 on new-st-gpu-2 device 3 [0x20] NVIDIA A100-SXM4-80GB
# Rank 12 Group 0 Pid 95405 on new-st-gpu-2 device 4 [0x90] NVIDIA A100-SXM4-80GB
# Rank 13 Group 0 Pid 95406 on new-st-gpu-2 device 5 [0x90] NVIDIA A100-SXM4-80GB
# Rank 14 Group 0 Pid 95407 on new-st-gpu-2 device 6 [0xa0] NVIDIA A100-SXM4-80GB
# Rank 15 Group 0 Pid 95408 on new-st-gpu-2 device 7 [0xa0] NVIDIA A100-SXM4-80GB
```
This tells us we have 16 GPUs (one rank is one GPU) and 2 nodes in the job.


We can check if EFA is indeed used. You should see these parts of text:
```bash
cat nccl.out

....

new-st-gpu-2:8802:8863 [5] NCCL INFO NET/OFI Using aws-ofi-nccl 1.5.0aws
new-st-gpu-2:8802:8863 [5] NCCL INFO NET/OFI Configuring AWS-specific options

...

new-st-gpu-2:8804:8862 [6] NCCL INFO NET/OFI Selected Provider is efa (found 4 nics)
new-st-gpu-2:8804:8862 [6] NCCL INFO Using network AWS Libfabric
....

```

Alternatively we could check counter:
```bash
cat /sys/class/infiniband/rdmap*/ports/1/hw_counters/rx_bytes
```
for read bytes (this counter should be bigger after test) - see full [workshop](https://catalog.us-east-1.prod.workshops.aws/workshops/5563d004-a892-4c83-8d82-d8fa6baa0517/en-US/monitor) for more details.
`rdmap*` is for P4 instances, you will see other names (`efa_0` for example) in other instance types.


We can check end of the result file (`~/nccl.out` set in `~nccl_test.sbatch` as `#SBATCH --output=nccl.out` - you can also see it as `StdOut` in `scontrol show job ${YOUR_JOB_ID}`) like:
```bash
tail -n 60 nccl.out
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 174.2 0.00 0.00 0 171.7 0.00 0.00 0
16 4 float sum -1 171.1 0.00 0.00 0 167.0 0.00 0.00 0
32 8 float sum -1 162.4 0.00 0.00 0 158.1 0.00 0.00 0
64 16 float sum -1 157.9 0.00 0.00 0 157.9 0.00 0.00 0
128 32 float sum -1 158.7 0.00 0.00 0 158.3 0.00 0.00 0
256 64 float sum -1 158.7 0.00 0.00 0 158.7 0.00 0.00 0
512 128 float sum -1 158.7 0.00 0.01 0 159.1 0.00 0.01 0
1024 256 float sum -1 161.3 0.01 0.01 0 161.4 0.01 0.01 0
2048 512 float sum -1 176.5 0.01 0.02 0 175.7 0.01 0.02 0
4096 1024 float sum -1 165.7 0.02 0.05 0 165.7 0.02 0.05 0
8192 2048 float sum -1 172.1 0.05 0.09 0 171.5 0.05 0.09 0
16384 4096 float sum -1 189.9 0.09 0.16 0 189.0 0.09 0.16 0
32768 8192 float sum -1 220.4 0.15 0.28 0 218.2 0.15 0.28 0
65536 16384 float sum -1 224.0 0.29 0.55 0 221.0 0.30 0.56 0
131072 32768 float sum -1 227.3 0.58 1.08 0 223.3 0.59 1.10 0
262144 65536 float sum -1 234.2 1.12 2.10 0 233.2 1.12 2.11 0
524288 131072 float sum -1 257.4 2.04 3.82 0 257.3 2.04 3.82 0
1048576 262144 float sum -1 307.4 3.41 6.40 0 306.7 3.42 6.41 0
2097152 524288 float sum -1 388.3 5.40 10.13 0 388.7 5.40 10.12 0
4194304 1048576 float sum -1 522.7 8.02 15.04 0 521.6 8.04 15.08 0
8388608 2097152 float sum -1 761.2 11.02 20.66 0 757.8 11.07 20.75 0
16777216 4194304 float sum -1 1200.2 13.98 26.21 0 1195.9 14.03 26.30 0
33554432 8388608 float sum -1 1565.5 21.43 40.19 0 1559.6 21.52 40.34 0
67108864 16777216 float sum -1 2724.5 24.63 46.18 0 2727.9 24.60 46.13 0
134217728 33554432 float sum -1 4071.8 32.96 61.80 0 4070.3 32.98 61.83 0
268435456 67108864 float sum -1 7390.4 36.32 68.10 0 7387.7 36.34 68.13 0
536870912 134217728 float sum -1 13605 39.46 73.99 0 13594 39.49 74.05 0
1073741824 268435456 float sum -1 25940 41.39 77.61 0 25985 41.32 77.48 0
```
Loading