Train Stucked #44

YinengXiong · 2021-03-29T10:18:57Z

Hi ~
I Use
if torch.cuda.device_count() > 1: model = torch.nn.DataParallel(model) model = bnconvert(model) model.cuda()
to use sync-bn during multi-gpu training, but when training the network, it looks like training procedure stucked at final batch in one epoch

The text was updated successfully, but these errors were encountered:

vacancy · 2021-03-29T14:59:54Z

The implementation requires that each module on different devices should invoke the batchnorm for exactly SAME amount of times in each forward pass. For example, you can not only call batchnorm on GPU0 but not on GPU1. The #i (i = 1, 2, 3, ...) calls of the batchnorm on each device will be viewed as a whole and the statistics will be reduced. This is tricky but is a good way to handle PyTorch's dynamic computation graph. Although sounds complicated, this will usually not be the issue for most of the models.

Can you check this?

YinengXiong · 2021-04-01T10:32:31Z

So if I want to use model e.g. torchvision, which built with nn.BatchNorm
I should:
0. model = torchvision.models.resnet50()

model = bnconvert(model)
model = DataParallelWithCallback(model)
model.cuda()

Am I right?

vacancy · 2021-04-07T18:36:07Z

Correct. I suspect the reason is the following:

The implementation requires that each module on different devices should invoke the batchnorm for exactly SAME amount of times in each forward pass. For example, you can not only call batchnorm on GPU0 but not on GPU1. The #i (i = 1, 2, 3, ...) calls of the batchnorm on each device will be viewed as a whole and the statistics will be reduced. This is tricky but is a good way to handle PyTorch's dynamic computation graph. Although sounds complicated, this will usually not be the issue for most of the models.

YinengXiong · 2021-04-08T03:57:04Z

Correct. I suspect the reason is the following:

The implementation requires that each module on different devices should invoke the batchnorm for exactly SAME amount of times in each forward pass. For example, you can not only call batchnorm on GPU0 but not on GPU1. The #i (i = 1, 2, 3, ...) calls of the batchnorm on each device will be viewed as a whole and the statistics will be reduced. This is tricky but is a good way to handle PyTorch's dynamic computation graph. Although sounds complicated, this will usually not be the issue for most of the models.

thanks a lot

This was referenced Apr 8, 2021

raining couldn 't start #45

Closed

Training cannot start #40

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train Stucked #44

Train Stucked #44

YinengXiong commented Mar 29, 2021

vacancy commented Mar 29, 2021 •

edited

Loading

YinengXiong commented Apr 1, 2021

vacancy commented Apr 7, 2021 •

edited

Loading

YinengXiong commented Apr 8, 2021

Train Stucked #44

Train Stucked #44

Comments

YinengXiong commented Mar 29, 2021

vacancy commented Mar 29, 2021 • edited Loading

YinengXiong commented Apr 1, 2021

vacancy commented Apr 7, 2021 • edited Loading

YinengXiong commented Apr 8, 2021

vacancy commented Mar 29, 2021 •

edited

Loading

vacancy commented Apr 7, 2021 •

edited

Loading