Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems encountered when using multiple Gpus for training #95

Open
123456789live opened this issue Dec 3, 2021 · 1 comment
Open

Comments

@123456789live
Copy link

Dear author, I encountered this problem when using two gpu. How to solve this problem?
(zq) omnisky@node01:/data01/zq/CaDDN/tools$ python -m torch.distributed.launch --nproc_per_node=2 train.py --launcher pytorch --batch_size 2 --cfg_file cfgs/kitti_models/CaDDN.yaml


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


Traceback (most recent call last):
File "train.py", line 197, in
main()
File "train.py", line 72, in main
assert args.batch_size % total_gpus == 0, 'Batch size should match the number of gpus'
AssertionError: Batch size should match the number of gpus
Traceback (most recent call last):
File "train.py", line 197, in
main()
File "train.py", line 72, in main
assert args.batch_size % total_gpus == 0, 'Batch size should match the number of gpus'
AssertionError: Batch size should match the number of gpus
Traceback (most recent call last):
File "/usr/local/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/local/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/omnisky/zq/lib/python3.6/site-packages/torch/distributed/launch.py", line 263, in
main()
File "/home/omnisky/zq/lib/python3.6/site-packages/torch/distributed/launch.py", line 259, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/home/omnisky/zq/bin/python3', '-u', 'train.py', '--local_rank=1', '--launcher', 'pytorch', '--batch_size', '2', '--cfg_file', 'cfgs/kitti_models/CaDDN.yaml']' returned non-zero exit status 1.
(zq) omnisky@node01:/data01/zq/CaDDN/tools$ python -m torch.distributed.launch --nproc_per_node=2 train.py --launcher pytorch --batch_size 2 --cfg_file cfgs/kitti_models/CaDDN.yaml^C

@fgqile
Copy link

fgqile commented Jan 19, 2022

i MET IT TOO

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants