Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with StreamingDataset when not using all GPUs on host. #91

Closed
gkroiz opened this issue Apr 3, 2024 · 6 comments · Fixed by #95
Closed

Issue with StreamingDataset when not using all GPUs on host. #91

gkroiz opened this issue Apr 3, 2024 · 6 comments · Fixed by #95
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@gkroiz
Copy link
Contributor

gkroiz commented Apr 3, 2024

🐛 Bug

When initializing a StreamingDataset object, _DistributedEnv.detect() is called (code) and in the detect() function there a world_size check (code). This check fails for my use case when I am not using all GPUs on a host, such that world_size is set to 6 but torch.cuda.device_count() will return 8. 6 % 8 != 0, thus raising the error.

Perhaps this check failure is the intended behavior, but I do not know enough about the litdata repository to understand why the code should raise an error when world_size % device_count != 0. I would imagine that torch would run various checks when setting up a torch distributed environment, such that this check would not be needed unless it provides a particular purpose to StreamingDataset() or another object. If there is any insight here on (1) if this check is needed and (2) why it is needed, that would be great!

To Reproduce

This issue came up when experimenting with LitGPT. I don't think this section or the following sections are necessary and I will leave the blank for now. Please let me know if any of these sections would help and I can add more description.

Code sample

Not needed at the moment.

Expected behavior

Not needed at the moment.

Environment

Not needed at the moment.

Additional context

@gkroiz gkroiz added bug Something isn't working help wanted Extra attention is needed labels Apr 3, 2024
Copy link

github-actions bot commented Apr 3, 2024

Hi! thanks for your contribution!, great first issue!

@tchaton
Copy link
Collaborator

tchaton commented Apr 4, 2024

Hey gkroiz, do you want to submit a fix for it ?

@gkroiz
Copy link
Contributor Author

gkroiz commented Apr 4, 2024

I realized that when using _DistributedEnv.detect() in a multi-node environment where you are not using all available GPUs, it is not possible (at least with torch), to determine number of nodes. I will create a PR that mentions this.

@gkroiz
Copy link
Contributor Author

gkroiz commented Apr 4, 2024

After putting some more thought here, a few things:

  1. it makes sense that _DistributedEnv.detect() produces the error that I was seeing before, since the detect() function does not have the needed information through pytorch's APIs to determine the node count. For example, if If my world_size is 4 (meaning I'm using 4 gpus), but torch.cuda.device_count()=8, how can pytorch tell if there is 1 node being used where 4 GPUs are idle versus 4 nodes being used where on each node only 1 gpu is being used. There is too much ambiguity here, meaning that it is good for the code to raise an error. Perhaps leaving this as is makes the most sense, or perhaps adding a few lines of additional comments.

  2. Currently, users using StreamingDataset will automatically use _DistributedEnv.detect() (ref) unless they edit the source code. Maybe it makes sense to provide some optionality so that users can provide a _DistributedEnv object when initializing a StreamingDataset object for when _DistributedEnv.detect() is insufficient, such as when using multi-CPU or when not using all of the GPUs available.

What are your thoughts here @tchaton

@tchaton
Copy link
Collaborator

tchaton commented Apr 4, 2024

@gkroiz Lot of great comments. Maybe we could have the following behaviour to unblock you.

By default, we are breaking as we don't have enough information. However, if the user provides extra environement variables: https://github.com/Lightning-AI/pytorch-lightning/tree/master/src/lightning/fabric/plugins/environments, we could accept the execution.

Passing a _DistributedEnv is also a good option.

Do you want to give a try to any of them ?

@gkroiz
Copy link
Contributor Author

gkroiz commented Apr 5, 2024

Yes, I can put together a PR with one of the solutions when I get the time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants