Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed computing (eg, multi-GPU) support #13

Open
erensezener opened this issue Aug 27, 2021 · 7 comments
Open

Distributed computing (eg, multi-GPU) support #13

erensezener opened this issue Aug 27, 2021 · 7 comments

Comments

@erensezener
Copy link

I see that there is some code supporting multi-GPUs, eg here and here.

However, I don't see an option/flag to actually utilize distributed computing. Could you clarify?

Thank you.

@fabrahman
Copy link

@erensezener Did you figure this out? :)

@fabrahman
Copy link

@gizacard Would you mind providing some instruction on this? which options should be set? Thanks

@fabrahman
Copy link

@gizacard I wanted to train using Multi-GPU (4 gpus) , and for that I used the local_rank=0, and set following env variables:

RANK=0
NGPU=4
WORLD_SIZE=4

Although I am not aiming for slurs job the code here require me to set MASTER_ADDR and MASTER_PORT as well. why? Anyway I set them to be my server ip and a port.

After setting this parameters, when I run the code, the training never starts. Though without distributed_training (single gpu) it works fine.

Can you guide me if I am doing correct? Thanks

@gowtham1997
Copy link

NGPU=<num of gpus in one node> python -m torch.distributed.launch --nproc_per_node=<num of gpus in one node> train_reader.py \
        --use_checkpoint \
        --lr 0.00005 \
        --optim adamw \
        --scheduler linear \
        --weight_decay 0.01 \
        --text_maxlength 250 \
        --per_gpu_batch_size <bs> \
        --n_context 100 \
        --total_step 15000 \
        --warmup_step 1000 \
        --train_data open_domain_data/NQ/train.json \
        --eval_data open_domain_data/NQ/dev.json \
        --model_size base \
        --name testing_base_model_nq \
        --checkpoint_dir pretrained_models \
        --accumulation_steps <steps>

Something like this worked for me

@Duemoo
Copy link

Duemoo commented Nov 27, 2022

@fabrahman Could you provide an update on this issue? I have exactly the same issue, and I found that the code freezes without any error message after executing line 194 of train_reader.py

After setting this parameters, when I run the code, the training never starts. Though without distributed_training (single gpu) it works fine.

@szh-max
Copy link

szh-max commented Jul 11, 2023

@Duemoo I also encountered this problem, using multiple gpu, I found that the code freezes without any error message after executing line 194 of train_reader.py
how did you solve it? Thanks!

@bobbyfyb
Copy link

NGPU=<num of gpus in one node> python -m torch.distributed.launch --nproc_per_node=<num of gpus in one node> train_reader.py \
        --use_checkpoint \
        --lr 0.00005 \
        --optim adamw \
        --scheduler linear \
        --weight_decay 0.01 \
        --text_maxlength 250 \
        --per_gpu_batch_size <bs> \
        --n_context 100 \
        --total_step 15000 \
        --warmup_step 1000 \
        --train_data open_domain_data/NQ/train.json \
        --eval_data open_domain_data/NQ/dev.json \
        --model_size base \
        --name testing_base_model_nq \
        --checkpoint_dir pretrained_models \
        --accumulation_steps <steps>

Something like this worked for me

@szh-max Hi I also encountered this problem and I solved it by updating torch version to torch==1.10.0 and export NCCL_P2P_DISABLE=1 from here:
https://discuss.pytorch.org/t/distributed-data-parallel-freezes-without-error-message/8009/29

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants