Batch device print statement shows only device 0 on multi-GPU setup #11

sasank-desaraju · 2022-10-31T04:07:47Z

When running on HPG, the print output that gives us "Training batch is on device _" is only reading "device 0". Is this missing computations on the other GPU (i.e. "device 1"), which the program states is indeed there at the beginning, or is this second GPU not getting used during training for some reason?

sasank-desaraju · 2022-10-31T04:11:42Z

Oh what lol.
In Wandb, which is where I was checking, there are actually two runs going on. The second one has less information in the logs but just says which device the operations are on and both training and validation are on "device 1", which is the second GPU. Cool! (I think).

sasank-desaraju · 2022-10-31T04:17:32Z

Okay, now I can't find those initial logs but both runs just show that all batches are on device 1. A question for another day...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch device print statement shows only device 0 on multi-GPU setup #11

Batch device print statement shows only device 0 on multi-GPU setup #11

sasank-desaraju commented Oct 31, 2022

sasank-desaraju commented Oct 31, 2022

sasank-desaraju commented Oct 31, 2022

Batch device print statement shows only device 0 on multi-GPU setup #11

Batch device print statement shows only device 0 on multi-GPU setup #11

Comments

sasank-desaraju commented Oct 31, 2022

sasank-desaraju commented Oct 31, 2022

sasank-desaraju commented Oct 31, 2022