Multi-gpu problem: Running stuck when -ntasks-per-node larger than 1 #170

chlwjd1234 · 2023-09-22T06:19:43Z

chlwjd1234
Sep 22, 2023

I'm really struggling to run the distributed training.
As far as I understood from the code, the world_size needs to be larger than one for distributed training.
But when I make ntasks-per-node as 2, nothing proceeds , no log files made. Below is my batch file.

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --job-name=test
#SBATCH --partition=normal
#SBATCH --ntasks-per-node=1
#SBATCH --gres=gpu:2
#SBATCH --gres-flags=enforce-binding
#SBATCH --time=120:00:00

echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"
ms_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_ADDR=$ms_addr
export MASTER_PORT=12321
export WORLD_SIZE=$(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE))

echo MASTER_ADDR=$MASTER_ADDR
echo MASTER_PORT=$MASTER_PORT
echo WORLD_SIZE=$WORLD_SIZE

python ../mace/scripts/run_train.py
--distributed \ ... and so on.

And the log file says

CUDA_VISIBLE_DEVICES=0,1
MASTER_ADDR=normal1
MASTER_PORT=12321
WORLD_SIZE=2

And then no more logs for mace are written, and i've waited for ~45 min, still nothing happened. If I set --ntasks-per-node=1 , then
it runs without problem, exactly same speed with single gpu training.

What could be the reason for problem?
Also, is using hdf5 data format necessary for multi-gpu ? Now i'm using xyz data.

Thanks in advance.

Answered by samwaltonnorwood

Sep 22, 2023

Hi - for distributed training, you'll need to launch the script with srun, so the line in your batch file should be changed to "srun python /path/to/run_train.py [args...]". Can you give that a try with ntasks-per-node set to 2?

If you still have trouble, it may be due to the environment variables you're setting. The distributed environment will be configured automatically, so you shouldn't need to manually set the port/address/world size. (For reference, there's a Slurm template in the multi-GPU branch under scripts/distributed_example.sbatch.)

Data-wise, both hdf5 and xyz will work - the former might be faster.

View full answer

samwaltonnorwood · 2023-09-22T10:02:39Z

samwaltonnorwood
Sep 22, 2023

Hi - for distributed training, you'll need to launch the script with srun, so the line in your batch file should be changed to "srun python /path/to/run_train.py [args...]". Can you give that a try with ntasks-per-node set to 2?

If you still have trouble, it may be due to the environment variables you're setting. The distributed environment will be configured automatically, so you shouldn't need to manually set the port/address/world size. (For reference, there's a Slurm template in the multi-GPU branch under scripts/distributed_example.sbatch.)

Data-wise, both hdf5 and xyz will work - the former might be faster.

0 replies

chlwjd1234 · 2023-09-22T12:38:40Z

chlwjd1234
Sep 22, 2023
Author

Thanks it works!
I have one more question.
In the distributed example file, it says --ntasks-per-node=10 , --gpus-per-node=10 .
I wonder if using same value of ntasks-per-node and gpu-per-node is necessary to get best efficiency.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-gpu problem: Running stuck when -ntasks-per-node larger than 1 #170

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Multi-gpu problem: Running stuck when -ntasks-per-node larger than 1 #170

chlwjd1234 Sep 22, 2023

Replies: 2 comments

samwaltonnorwood Sep 22, 2023

chlwjd1234 Sep 22, 2023 Author

chlwjd1234
Sep 22, 2023

samwaltonnorwood
Sep 22, 2023

chlwjd1234
Sep 22, 2023
Author