You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
I am working on one of the extended mmf projects. But when I run it with below command, I get the following error. Of course, it should be noted that I have encountered this error in other extended pythia frameworks. Command for running: python -m torch.distributed.launch --nproc_per_node 1 tools/run.py --pretrain --tasks vqa --datasets m4c_textvqa --model m4c_split --seed 13 --config configs/vqa/m4c_textvqa/tap_base_pretrain.yml --save_dir save/m4c_split_pretrain_test training_parameters.distributed True
Error:
////////////////
I install environment with below information
python=3.8
pytorch,cuda with command= conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
GPU= 1 geforce RTX 3090 (24 GPU-RAM)
/////////////////
Could you help me to solve this problem?
Is this error because of using 1 GPU?
Do I need to change the initial value of a some parameters(like local_rank)?
Could the reason for this error be due to lack of GPU-memory?
It is very important to me to solve this problem and I would be very grateful if you could guide me.
The text was updated successfully, but these errors were encountered:
Hello, it is hard to find the root cause from these logs as anything causing the child to crash would cause this. Often times this is caused due to running out of ram or gpu ram. So one quick check you could do would be to lower the batch size and see if that stops the issue. Otherwise please try to get a traceback and share it here.
Hi,
I am working on one of the extended mmf projects. But when I run it with below command, I get the following error. Of course, it should be noted that I have encountered this error in other extended pythia frameworks.
Command for running:
python -m torch.distributed.launch --nproc_per_node 1 tools/run.py --pretrain --tasks vqa --datasets m4c_textvqa --model m4c_split --seed 13 --config configs/vqa/m4c_textvqa/tap_base_pretrain.yml --save_dir save/m4c_split_pretrain_test training_parameters.distributed True
Error:
////////////////
I install environment with below information
python=3.8
pytorch,cuda with command=
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
GPU= 1 geforce RTX 3090 (24 GPU-RAM)
/////////////////
Could you help me to solve this problem?
Is this error because of using 1 GPU?
Do I need to change the initial value of a some parameters(like local_rank)?
Could the reason for this error be due to lack of GPU-memory?
It is very important to me to solve this problem and I would be very grateful if you could guide me.
The text was updated successfully, but these errors were encountered: