Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch device print statement shows only device 0 on multi-GPU setup #11

Open
sasank-desaraju opened this issue Oct 31, 2022 · 2 comments

Comments

@sasank-desaraju
Copy link
Contributor

When running on HPG, the print output that gives us "Training batch is on device _" is only reading "device 0". Is this missing computations on the other GPU (i.e. "device 1"), which the program states is indeed there at the beginning, or is this second GPU not getting used during training for some reason?

@sasank-desaraju
Copy link
Contributor Author

Oh what lol.
In Wandb, which is where I was checking, there are actually two runs going on. The second one has less information in the logs but just says which device the operations are on and both training and validation are on "device 1", which is the second GPU. Cool! (I think).

@sasank-desaraju
Copy link
Contributor Author

Okay, now I can't find those initial logs but both runs just show that all batches are on device 1. A question for another day...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant