-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reset state_dict after resume #330
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #330 +/- ##
=====================================
Coverage ? 79%
=====================================
Files ? 34
Lines ? 4989
Branches ? 0
=====================================
Hits ? 3949
Misses ? 1040
Partials ? 0 |
Hey @vgurev. Great catch ! Can you add a test ? |
OK, I will |
Hi @vgurev, Thanks for reporting the bug and getting started on it!
Test script Create Optimized Datasetfrom litdata import optimize
def random_data(index):
return index
if __name__ == "__main__":
optimize(
fn=random_data,
inputs=list(range(100)),
output_dir="my_optimized_dataset",
num_workers=4,
chunk_bytes="64MB",
) from litdata import StreamingDataLoader, StreamingDataset
dataset = StreamingDataset("my_optimized_dataset")
dataloader = StreamingDataLoader(dataset, batch_size=4, num_workers=2)
for batch_idx, batch in enumerate(dataloader):
if batch_idx == 10:
break
assert dataset._state_dict is None
dataloader.load_state_dict(dataloader.state_dict())
assert dataset._state_dict is not None
for batch_idx, batch in enumerate(dataloader):
pass
assert dataset._state_dict is None After testing, I noticed that the state reset issue still occurs when
Thanks again for bringing this to our attention and contributing to the improvements! |
Yes, we need to reset it from the DataLoader when the stopIteration occurs with num_workers > 0 |
Reset of state_dict after resume.
What does this PR do?
Currently, resume from restart is triggered by a check that self._state_dict is not None when dataset iterator is created
However, the self._state_dict is never reset after restart. At the next epoch, when a new dataset iterator is created, the resume is triggered again from the same state_dict. To fix this bug, I assign None to the self._state_dict after resume.
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.
Did you have fun?
Yes