-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with StreamingDataset when not using all GPUs on host. #91
Comments
Hi! thanks for your contribution!, great first issue! |
Hey gkroiz, do you want to submit a fix for it ? |
I realized that when using |
After putting some more thought here, a few things:
What are your thoughts here @tchaton |
@gkroiz Lot of great comments. Maybe we could have the following behaviour to unblock you. By default, we are breaking as we don't have enough information. However, if the user provides extra environement variables: https://github.com/Lightning-AI/pytorch-lightning/tree/master/src/lightning/fabric/plugins/environments, we could accept the execution. Passing a Do you want to give a try to any of them ? |
Yes, I can put together a PR with one of the solutions when I get the time. |
🐛 Bug
When initializing a StreamingDataset object,
_DistributedEnv.detect()
is called (code) and in thedetect()
function there a world_size check (code). This check fails for my use case when I am not using all GPUs on a host, such that world_size is set to 6 buttorch.cuda.device_count()
will return 8.6 % 8 != 0
, thus raising the error.Perhaps this check failure is the intended behavior, but I do not know enough about the
litdata
repository to understand why the code should raise an error whenworld_size % device_count != 0
. I would imagine that torch would run various checks when setting up a torch distributed environment, such that this check would not be needed unless it provides a particular purpose toStreamingDataset()
or another object. If there is any insight here on (1) if this check is needed and (2) why it is needed, that would be great!To Reproduce
This issue came up when experimenting with LitGPT. I don't think this section or the following sections are necessary and I will leave the blank for now. Please let me know if any of these sections would help and I can add more description.
Code sample
Not needed at the moment.
Expected behavior
Not needed at the moment.
Environment
Not needed at the moment.
Additional context
The text was updated successfully, but these errors were encountered: