-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix uneven batches in distributed dataloading #237
Conversation
…htning-AI/litdata into fix_uneven_number_of_batches
for more information, see https://pre-commit.ci
a7e425c
to
265c4e9
Compare
ebc7e3b
to
b0096c5
Compare
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
chunk_size=190, | ||
num_workers=4, | ||
num_workers=1, # TODO: Want 4 here, but optimize() has deletion race condition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like everywhere in the tests we use num_workers=1, and here I wanted 4 but there seems to be race conditions (?) on the copying/deletion of chunks, causing this test to fail because of missing chunks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__________________ test_dataset_resume_on_future_chunks[True] __________________
shuffle = True
tmpdir = local('/tmp/pytest-of-runner/pytest-0/test_dataset_resume_on_future_0')
monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x7f6a4124f460>
@pytest.mark.skipif(sys.platform == "win32", reason="Not tested on windows and MacOs")
@mock.patch.dict(os.environ, {}, clear=True)
@pytest.mark.timeout(60)
@pytest.mark.parametrize("shuffle", [True, False])
def test_dataset_resume_on_future_chunks(shuffle, tmpdir, monkeypatch):
"""This test is constructed to test resuming from a chunk past the first chunk, when subsequent chunks don't have
the same size."""
s3_cache_dir = str(tmpdir / "s3cache")
optimize_data_cache_dir = str(tmpdir / "optimize_data_cache")
optimize_cache_dir = str(tmpdir / "optimize_cache")
data_dir = str(tmpdir / "optimized")
monkeypatch.setenv("DATA_OPTIMIZER_DATA_CACHE_FOLDER", optimize_data_cache_dir)
monkeypatch.setenv("DATA_OPTIMIZER_CACHE_FOLDER", optimize_cache_dir)
> optimize(
fn=_simple_preprocess,
inputs=list(range(8)),
output_dir=data_dir,
chunk_size=190,
num_workers=4,
num_uploaders=1,
copying /tmp/pytest-of-runner/pytest-0/test_dataset_resume_on_future_0/optimize_cache/chunk-3-1.bin to /tmp/pytest-of-runner/pytest-0/test_dataset_resume_on_future_0/optimized/chunk-3-1.bin
putting /tmp/pytest-of-runner/pytest-0/test_dataset_resume_on_future_0/optimize_cache/chunk-3-1.bin on the remove queue
Worker 1 is done.
Worker 2 is done.
Worker 3 is done.
Worker 0 is done.
Workers are finished.
----------------------------- Captured stderr call -----------------------------
Progress: 0%| | 0/8 [00:00<?, ?it/s]Process Process-85:1:
Traceback (most recent call last):
File "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/runner/work/litdata/litdata/src/litdata/processing/data_processor.py", line 259, in _upload_fn
shutil.copy(local_filepath, output_filepath)
File "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/shutil.py", line 427, in copy
copyfile(src, dst, follow_symlinks=follow_symlinks)
File "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/shutil.py", line 264, in copyfile
with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pytest-of-runner/pytest-0/test_dataset_resume_on_future_0/optimize_cache/chunk-0-0.bin'
Progress: 100%|██████████| 8/8 [00:00<00:00, 122.77it/s]
=========================== short test summary info ============================
FAILED tests/streaming/test_dataset.py::test_dataset_resume_on_future_chunks[True] - RuntimeError: All the chunks should have been deleted. Found ['chunk-0-1.bin']
====== 1 failed, 191 passed, 8 skipped, 11 warnings in [247](https://github.com/Lightning-AI/litdata/actions/runs/10010459328/job/27671682379?pr=237#step:10:248).94s (0:04:07) =======
8350fc2
to
bc64b77
Compare
Co-authored-by: thomas chaton <[email protected]>
6fc442e
to
66017e8
Compare
Awesome work @awaelchli ! |
Fixes #233
This PR changes/fixes the implementation of how items are assigned to workers.
Before: Chunks are first assigned to ranks, then samples from ranks assigned to workers
Now: Assign samples directly across combined world size of all workers/ranks.
This allows us to correctly apply
drop_last
and ensure that each rank returns the same amount of data. However, this means this PR is a breaking change.IMPORTANT:
This changes the order in which samples are batched and returned. A consequence of this also is resuming from checkpoints made prior to this PR are not going to be restored correctly.
TODOS