-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First round of ML updates #413
base: ml
Are you sure you want to change the base?
Conversation
Lets create `ami.yml` (see [DLAMI release notes](https://docs.aws.amazon.com/dlami/latest/devguide/appendix-ami-release-notes.html) to get AMI ARN (`ParentImage` in config bellow)): | ||
``` | ||
Build: | ||
SecurityGroupIds: [<insert you SG - in requires outbound traffic>] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you automate this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's automated with pcluster CLI, we could add cloudformation that sets up VPC, subnet, SG and runs pcluster CLI via bash runner (or even better trigger lambda to run it). I though to create short bash script with templating for config yaml file but I don't think it's worth the effort since it's not abstracting anything and adds boilerplate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100% chance someone won't know how to find a security group and or even what it is. Provide the steps to create one and retrieve ID or provide the retrieve security group id step.
First let's fetch the assets required to build the image: | ||
|
||
```bash | ||
wget https://ml.hpcworkshops.com/scripts/packer/packer.tar.gz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain the content of the archive before proposing to download.
For sake of clarity, I suggest you make the reader download the 3 files separately. At least they can review the file on github before hands. That's also give an opportunity to reviewer to look at the content of the files that are of this workshop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sean-smith you added this and I just moved it, any specific reason for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this anymore. This can just be:
git clone [email protected]:aws-samples/parallelcluster-efa-gpu-preflight-ami.git
You can install Packer using [Brew](https://brew.sh/) on OSX or Linux as follows: | ||
|
||
```bash | ||
brew install packer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Standardize on Cloud9 or cloudshell..
If I have windows how do I do?
Provide a specific version to prevent regression in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll standardize on Cloud9. Cloudshell storage space is too limited. IMHO most ML devops don't need instructions on how to use the cli. This is different than HPC.
brew install packer | ||
``` | ||
|
||
Alternatively, you can download the Packer binary through the [tool website](https://www.packer.io/). Ensure your `PATH` is set to use the binary or use its absolute path. Once Packer installed, proceed to the next stage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
too many options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll remove.
Co-authored-by: mhuguesaws <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left comments
|
||
Now run [pcluster command](https://docs.aws.amazon.com/parallelcluster/latest/ug/pcluster.build-image-v3.html) that will add all pcluster dependencies to your DLAMI of choice: | ||
``` | ||
pcluster build-image -c ami.yml -i NEW_AMI_ID -r REGION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
variables...
pcluster build-image -c ami.yml -i NEW_AMI_ID -r REGION | |
pcluster build-image -c ami.yml -i $NEW_AMI_ID -r $AWS_REGION |
@@ -0,0 +1,10 @@ | |||
--- | |||
title: "b. Download, compile and run the NCCL tests" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
title: "b. Download, compile and run the NCCL tests" | |
title: "b. Run the NCCL tests" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed.
```bash | ||
cd ~ | ||
|
||
cat > compile_nccl.sh << EOF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolute path
cat > compile_nccl.sh << EOF | |
cat > ~/compile_nccl.sh << EOF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
Create your job submission script for *OSU Latency* and use **sbatch** to submit your job: | ||
|
||
```bash | ||
cat > nccl_test.sbatch << \EOF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cat > nccl_test.sbatch << \EOF | |
cat > ~/nccl_test.sbatch << EOF |
NCCL_TEST_PATH=${HOME}/nccl-tests/build | ||
MPI_PATH=/opt/amazon/openmpi | ||
|
||
export LD_LIBRARY_PATH=${HOME}/nccl/build/lib:${HOME}/aws-ofi-nccl/install/lib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use module....
#SBATCH --output=nccl.out | ||
|
||
NCCL_TEST_PATH=${HOME}/nccl-tests/build | ||
MPI_PATH=/opt/amazon/openmpi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
module load...
|
||
git clone -b v2.17.1-1 https://github.com/NVIDIA/nccl.git | ||
cd nccl | ||
make -j src.build CUDA_HOME=/usr/local/cuda NVCC_GENCODE='-gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope cuda version does not change...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's default "system" cuda - new AMIs will have new CUDA on the same path. I can add a note that if they have custom cuda they change this path, I assumed if someone is more advanced to add specific CUDA version they will be familiar with these parameters. But good point, I'll add the note.
|
||
git clone -b v2.13.6 https://github.com/NVIDIA/nccl-tests.git | ||
cd nccl-tests | ||
make MPI=1 CUDA_HOME=/usr/local/cuda MPI_HOME=/opt/amazon/openmpi NCCL_HOME=${HOME}/nccl/build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
module load
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's using OpenMPI form pcluster AMI, no need for IntelMPI to get correct performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
module load openmpi. That's what I said.
export NCCL_DEBUG=INFO | ||
export FI_LOG_LEVEL=1 | ||
|
||
${MPI_PATH}/bin/mpirun --map-by ppr:8:node --rank-by slot \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One the openmpi is loaded no need to have path like this.
export NCCL_DEBUG=INFO | ||
export FI_LOG_LEVEL=1 | ||
|
||
${MPI_PATH}/bin/mpirun --map-by ppr:8:node --rank-by slot \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
${MPI_PATH}/bin/mpirun --map-by ppr:8:node --rank-by slot \ | |
${MPI_PATH}/bin/mpirun --map-by ppr:4:socket \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EbsSettings: | ||
VolumeType: gp3 | ||
Size: 200 | ||
Throughput: 300 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
tags : ["Huggingface", "data", "ML", "srun", "slurm"] | ||
--- | ||
|
||
In this section, you will learn how to run script from Huggingface examples with PyTorch FSDP and DDP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What am I running exactly? a Script? What does it do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left comments.
Co-authored-by: mhuguesaws <[email protected]>
Issue #, if available:
Description of changes:
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.