-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resuming Training w/ Streaming Dataset on DDP with Multiple Nodes Fails #248
Comments
Hi! thanks for your contribution!, great first issue! |
Hey @schopra8, Thanks for reporting the issue. Yes, we are aware of that :) so we are already looking into it. If we do fix it next week, we would make a new release cc @awaelchli Unfortunately, this is coming as a series of fixes (#237) that aren't backward compatible (we won't be able to load old checkpoints as the core logic has changed too much). |
np! thanks for the heads up |
Also wanted to flag -- I tried running resume on 1 node with N devices. Everything worked for the first couple hundred steps, but then I hit the same error. So, it looks like there is a similar issue in DDP on 1 Node, as well. |
Hey @schopra8 Here are the release notes: https://github.com/Lightning-AI/litdata/releases/tag/v0.2.17. Would you mind trying again with the latest version: 0.2.17 ? Old checkpoints won't work unfortunately. |
Awesome! I'll try in the next 1-2 days and report back my results |
Thanks @schopra8. |
π Bug
We trained a model for several epochs on multiple nodes, and we wanted to continue training with PyTorch Lightning and LitData.
β When we resume training on a single device, resumption works as expected.
β When we resume training on a single node with N devices, resumption works as expected.
β When we resume training on multiple nodes with N devices, resumption fails.
To Reproduce
Run
trainer.fit
with an existing checkpoint with DDP on multiple devices:StackTrace:
Code sample
I've scrubbed my code below --
Expected behavior
Resume training on multiple nodes
Environment
conda
,pip
, source): poetryThe text was updated successfully, but these errors were encountered: