Multi gpu training; RuntimeError: [...] LightningModule has parameters that were not used in producing the loss returned by training_step. #8

kunibald413 · 2023-11-22T09:09:17Z

added devices="auto" in train.py to utilize multiple gpus

    trainer: Trainer = hydra.utils.instantiate(cfg.trainer,
                                               callbacks=callbacks,
                                               logger=logger,
                                               devices="auto")

Training terminates shortly after start with this error:

RuntimeError: It looks like your LightningModule has parameters that were not used in producing the loss returned by training_step. If this is intentional, you must enable the detection of unused parameters in DDP, either by setting the string value `strategy='ddp_find_unused_parameters_true'` or by setting the flag in the strategy with `strategy=DDPStrategy(find_unused_parameters=True)`.

adding
strategy='ddp_find_unused_parameters_true'
to the trainer instantiate fixes it (all gpus used):

    trainer: Trainer = hydra.utils.instantiate(cfg.trainer,
                                               callbacks=callbacks,
                                               logger=logger,
                                               devices="auto",
                                               # for multi device training
                                               strategy='ddp_find_unused_parameters_true'
                                               )

the batch_idx is not used when returning the loss by def training_step(self, batch: Any, batch_idx: int): in baselightingmodule.py
but i'm not sure about the consequences/effect on training/quality. making issue for visibility

The text was updated successfully, but these errors were encountered:

egorsmkv · 2024-01-01T17:09:15Z

Having the same issue with multiple GPUs

p0p4k · 2024-01-01T17:40:27Z

There is a ddp config in the config folder, which does the ddp strategy flag true to use multi gpu. About the consequences for unused parameters, I am not very well versed with that at the moment.

skypro1111 · 2024-01-11T00:44:47Z

in file /configs/trainer/ddp.yaml set strategy: ddp_find_unused_parameters_true

defaults:
  - default

strategy: ddp_find_unused_parameters_true

accelerator: gpu
devices: [0,1,2]
num_nodes: 1
sync_batchnorm: True

in file /configs/train.yaml set trainer: ddp

# @package _global_

# specify here default configuration
# order of defaults determines the order in which configs override each other
defaults:
  - _self_
  - data: ljspeech
  - model: pflow
  - callbacks: default
  - logger: tensorboard # set logger here or use command line (e.g. `python train.py logger=tensorboard`)
  - trainer: ddp
  - paths: default
  - extras: default
  - hydra: default

SAnsAN-9119 · 2024-08-16T13:56:09Z

в файле / configs/trainer/ddp.yaml установлена стратегия: ddp_find_unused_parameters_true

defaults:
  - default

strategy: ddp_find_unused_parameters_true

accelerator: gpu
devices: [0,1,2]
num_nodes: 1
sync_batchnorm: True

в файле / configs/train.yaml set trainer: ddp

# @package _global_

# specify here default configuration
# order of defaults determines the order in which configs override each other
defaults:
  - _self_
  - data: ljspeech
  - model: pflow
  - callbacks: default
  - logger: tensorboard # set logger here or use command line (e.g. `python train.py logger=tensorboard`)
  - trainer: ddp
  - paths: default
  - extras: default
  - hydra: default

This solution also helped me when running on multiple GPUs

archei2500 mentioned this issue Nov 4, 2024

Quick run in Google Colab doesn't work #48

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi gpu training; RuntimeError: [...] LightningModule has parameters that were not used in producing the loss returned by training_step. #8

Multi gpu training; RuntimeError: [...] LightningModule has parameters that were not used in producing the loss returned by training_step. #8

kunibald413 commented Nov 22, 2023

egorsmkv commented Jan 1, 2024

p0p4k commented Jan 1, 2024

skypro1111 commented Jan 11, 2024

SAnsAN-9119 commented Aug 16, 2024

Multi gpu training; RuntimeError: [...] LightningModule has parameters that were not used in producing the loss returned by training_step. #8

Multi gpu training; RuntimeError: [...] LightningModule has parameters that were not used in producing the loss returned by training_step. #8

Comments

kunibald413 commented Nov 22, 2023

egorsmkv commented Jan 1, 2024

p0p4k commented Jan 1, 2024

skypro1111 commented Jan 11, 2024

SAnsAN-9119 commented Aug 16, 2024