Multi-gpu problem: Running stuck when -ntasks-per-node larger than 1 #170
-
I'm really struggling to run the distributed training. #!/bin/bash echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES" echo MASTER_ADDR=$MASTER_ADDR python ../mace/scripts/run_train.py And the log file says CUDA_VISIBLE_DEVICES=0,1 And then no more logs for mace are written, and i've waited for ~45 min, still nothing happened. If I set --ntasks-per-node=1 , then What could be the reason for problem? Thanks in advance. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Hi - for distributed training, you'll need to launch the script with srun, so the line in your batch file should be changed to "srun python /path/to/run_train.py [args...]". Can you give that a try with ntasks-per-node set to 2? If you still have trouble, it may be due to the environment variables you're setting. The distributed environment will be configured automatically, so you shouldn't need to manually set the port/address/world size. (For reference, there's a Slurm template in the multi-GPU branch under scripts/distributed_example.sbatch.) Data-wise, both hdf5 and xyz will work - the former might be faster. |
Beta Was this translation helpful? Give feedback.
-
Thanks it works! |
Beta Was this translation helpful? Give feedback.
Hi - for distributed training, you'll need to launch the script with srun, so the line in your batch file should be changed to "srun python /path/to/run_train.py [args...]". Can you give that a try with ntasks-per-node set to 2?
If you still have trouble, it may be due to the environment variables you're setting. The distributed environment will be configured automatically, so you shouldn't need to manually set the port/address/world size. (For reference, there's a Slurm template in the multi-GPU branch under scripts/distributed_example.sbatch.)
Data-wise, both hdf5 and xyz will work - the former might be faster.