Any example script to run multi-node training for slurm? #1378

wavy-jung · 2024-07-20T10:21:44Z

Hi, I was trying to run multi-node training on slurm nodes but I have no idea how to configure composer arguments and commands.
Is there any example script to run training on slurm nodes with composer?

The text was updated successfully, but these errors were encountered:

dakinggg · 2024-07-21T05:50:02Z

We don't have a slurm example, but here are the environment variables that the composer launcher sets/requires: https://github.com/mosaicml/composer/blob/6d4628a1043d1f118dc38eb359ede5524e0a9aa0/composer/cli/launcher.py#L344-L352. It should just be the normal torch distributed env vars.

And here are the env vars that mcli sets for you: https://docs.mosaicml.com/projects/mcli/en/latest/quick_start/environment.html#runtime-environment-variables

wavy-jung · 2024-07-22T13:26:37Z

Thanks for helping me! @dakinggg
I have configured the environment variables as you provided in the attached link and ran it.
Below is the script I ran for the training:

#!/bin/bash
#SBATCH --job-name=wavy-llmfoundry-test
#SBATCH --nodes=16
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --mem-per-cpu=8G
#SBATCH --gres=gpu:8
#SBATCH --output=slurm-logs/%x-%j.out

GPUS_PER_NODE=8
NNODES=$SLURM_NNODES
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
MASTER_PORT=19963
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
WORK_DIR="/mnt/datafs/ib-a100-cluster-a-pri/lmt/users/wavy/llm-foundry"

export CUDA_DEVICE_MAX_CONNECTIONS=1
export CUDA_LAUNCH_BLOCKING=1
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export NCCL_DEBUG=INFO

export RANK=$NNODES
export WORLD_SIZE=$WORLD_SIZE
export MASTER_ADDR=$MASTER_ADDR
export MASTER_PORT=$MASTER_PORT
export LOCAL_WORLD_SIZE=$GPUS_PER_NODE
export NUM_NODES=$NNODES


export LAUNCHER="composer --world_size $WORLD_SIZE \
    --master_addr $MASTER_ADDR \
    --master_port 19963"

export CMD="$WORK_DIR/scripts/train/train.py \
    $WORK_DIR/scripts/train/yamls/pretrain/llama3-8b.yaml"

srun \
--container-image /mnt/datafs/ib-a100-cluster-a-pri/lmt/images/wavy-llm-foundry-v0.10.0.sqsh \
--container-mounts /mnt/datafs:/mnt/datafs \
--container-workdir $WORK_DIR \
--jobid $SLURM_JOBID \
bash -c "export NODE_RANK=$SLURM_PROCID && $LAUNCHER --node_rank $SLURM_PROCID $CMD \
    save_folder=/mnt/datafs/ib-a100-cluster-a-pri/lmt/users/wavy/checkpoints/composer/llama3-8b-slurm"

However, the error below was thrown:

So I tried with torchrun launcher and it passes the initialization stage, but stuck in the tokenizer building stage like below:

# export LAUNCHER="composer --world_size $WORLD_SIZE \
#     --master_addr $MASTER_ADDR \
#     --master_port 19963"

export LAUNCHER="torchrun \
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \
    --rdzv_backend c10d "

I would like to ask if there are any potential causes you can think of.
+) The yaml file used for the training is very similar to the example files and so I assume the problem is nothing to do with the yaml file!
+) sqsh image used for the training was built upon the latest version of docker image and added layer with pip install -e ".[gpu]" to setup the llmfoundry.

dakinggg · 2024-07-22T17:39:00Z

Ah looks like an issue on a shared fs (see #1253 (comment) for more discussion of this). I haven't quite finished fixing that yet.

dakinggg · 2024-07-22T18:07:14Z

Could you try this PR? #1381. You may also need composer with this pr mosaicml/composer#3485

wavy-jung · 2024-07-23T01:50:53Z

@dakinggg Thanks! I'll try with those PRs

dmakhervaks · 2024-07-25T18:41:37Z

@dakinggg It seems that 1381 was reverted -> 221d3e2

I tried pulling the latest docker image (mosaicml/llm-foundry:2.3.1_cu121-e882658) but I am still getting this error when trying to run in a multi-node setting:

[rank7]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.20.5
[rank7]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
[rank7]: Last error:
[rank7]: socketStartConnect: Connect to 172.20.1.119<43385> failed : Software caused connection abort

Is this expected? Thanks in advance!

dakinggg · 2024-07-25T18:43:24Z

Yes, we will reapply soon, but you can still try with that PR. unhandled error seems different though and suggest your distributed env is not set up correctly

wavy-jung added the enhancement New feature or request label Jul 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any example script to run multi-node training for slurm? #1378

Any example script to run multi-node training for slurm? #1378

wavy-jung commented Jul 20, 2024

dakinggg commented Jul 21, 2024 •

edited

Loading

wavy-jung commented Jul 22, 2024 •

edited

Loading

dakinggg commented Jul 22, 2024

dakinggg commented Jul 22, 2024 •

edited

Loading

wavy-jung commented Jul 23, 2024

dmakhervaks commented Jul 25, 2024 •

edited

Loading

dakinggg commented Jul 25, 2024

Any example script to run multi-node training for slurm? #1378

Any example script to run multi-node training for slurm? #1378

Comments

wavy-jung commented Jul 20, 2024

dakinggg commented Jul 21, 2024 • edited Loading

wavy-jung commented Jul 22, 2024 • edited Loading

dakinggg commented Jul 22, 2024

dakinggg commented Jul 22, 2024 • edited Loading

wavy-jung commented Jul 23, 2024

dmakhervaks commented Jul 25, 2024 • edited Loading

dakinggg commented Jul 25, 2024

dakinggg commented Jul 21, 2024 •

edited

Loading

wavy-jung commented Jul 22, 2024 •

edited

Loading

dakinggg commented Jul 22, 2024 •

edited

Loading

dmakhervaks commented Jul 25, 2024 •

edited

Loading