Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drop_last is not respected #442

Open
robmarkcole opened this issue Jan 3, 2025 · 8 comments · May be fixed by #449
Open

drop_last is not respected #442

robmarkcole opened this issue Jan 3, 2025 · 8 comments · May be fixed by #449
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@robmarkcole
Copy link
Contributor

🐛 Bug

I pass

StreamingDataLoader(
            dataset=dataset,
            batch_size=self.batch_size,
            shuffle=(split == "train"),
            num_workers=self.num_workers,
            collate_fn=self.collate_fn,
            prefetch_factor=self.prefetch_factor,
            persistent_workers=self.persistent_workers,
            multiprocessing_context=self.multiprocessing_context,
            drop_last=(split == "train"),
        )

Configure batch_size=2 and log the actual batch sizes received, the final has a size of 1.

Epoch 0:  97%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍    | 99/102 [00:12<00:00,  8.22it/s, v_num=749d]train size of y:  torch.Size([2, 320, 320])
Epoch 0:  98%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉   | 100/102 [00:12<00:00,  8.27it/s, v_num=749d]train size of y:  torch.Size([1, 320, 320])
Traceback (most recent call last):
  File "/code/lightning_ai/cli.py", line 157, in <module>
    run(obj={})
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/code/lightning_ai/cli.py", line 135, in train
    run_trainer(lightning_cli.model, lightning_cli.trainer, lightning_cli.datamodule)
  File "/code/common/mlclient.py", line 712, in wrapper
    result = func(*args, **kwargs)
  File "/code/lightning_ai/cli.py", line 123, in run_trainer
    lightning_cli.trainer.fit(model, datamodule=datamodule)
  File "/usr/local/lib/python3.10/site-packages/mlflow/utils/autologging_utils/safety.py", line 578, in safe_patch_function
    patch_function(call_original, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/mlflow/utils/autologging_utils/safety.py", line 251, in patch_with_managed_run
    result = patch_function(original, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/mlflow/pytorch/_lightning_autolog.py", line 537, in patched_fit
    result = original(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/mlflow/utils/autologging_utils/safety.py", line 559, in call_original
    return call_original_fn_with_event_logging(_original_fn, og_args, og_kwargs)
  File "/usr/local/lib/python3.10/site-packages/mlflow/utils/autologging_utils/safety.py", line 494, in call_original_fn_with_event_logging
    original_fn_result = original_fn(*og_args, **og_kwargs)
  File "/usr/local/lib/python3.10/site-packages/mlflow/utils/autologging_utils/safety.py", line 556, in _original_fn
    original_result = original(*_og_args, **_og_kwargs)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 538, in fit
    call._call_and_handle_interrupt(
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 47, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 574, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 981, in _run
    results = self._run_stage()
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1025, in _run_stage
    self.fit_loop.run()
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run
    self.advance()
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 140, in run
    self.advance(data_fetcher)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 250, in advance
    batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 190, in run
    self._optimizer_step(batch_idx, closure)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 268, in _optimizer_step
    call._call_lightning_module_hook(
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 167, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/core/module.py", line 1307, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/core/optimizer.py", line 153, in step
    step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 238, in optimizer_step
    return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/plugins/precision/precision.py", line 122, in optimizer_step
    return optimizer.step(closure=closure, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/optim/optimizer.py", line 487, in wrapper
    out = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/optim/optimizer.py", line 91, in _use_grad
    ret = func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/optim/adam.py", line 202, in step
    loss = closure()
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/plugins/precision/precision.py", line 108, in _wrap_closure
    closure_result = closure()
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 144, in __call__
    self._result = self.closure(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 129, in closure
    step_output = self._step_fn()
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 317, in _training_step
    training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 319, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 390, in training_step
    return self.lightning_module.training_step(*args, **kwargs)
  File "/code/common/model/base_lightning_net.py", line 320, in training_step
    logits = self.forward(x)
  File "/code/common/model/base_lightning_net.py", line 282, in forward
    return self.net(x)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/segmentation_models_pytorch/base/model.py", line 30, in forward
    decoder_output = self.decoder(*features)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/segmentation_models_pytorch/decoders/deeplabv3/decoder.py", line 99, in forward
    aspp_features = self.aspp(features[-1])
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/container.py", line 250, in forward
    input = module(input)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/segmentation_models_pytorch/decoders/deeplabv3/decoder.py", line 187, in forward
    res.append(conv(x))
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/segmentation_models_pytorch/decoders/deeplabv3/decoder.py", line 151, in forward
    x = mod(x)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/batchnorm.py", line 193, in forward
    return F.batch_norm(
  File "/usr/local/lib/python3.10/site-packages/torch/nn/functional.py", line 2810, in batch_norm
    _verify_batch_size(input.size())
  File "/usr/local/lib/python3.10/site-packages/torch/nn/functional.py", line 2776, in _verify_batch_size
    raise ValueError(
ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 256, 1, 1])

To Reproduce

Steps to reproduce the behavior...

Code sample
# Ideally attach a minimal code sample to reproduce the decried issue.
# Minimal means having the shortest code but still preserving the bug.

Expected behavior

Additional context

litdata==0.2.34

@robmarkcole robmarkcole added bug Something isn't working help wanted Extra attention is needed labels Jan 3, 2025
@robmarkcole
Copy link
Contributor Author

This actually appears to be related to the num_workers, when I change it from 24 to 4:

    batch_size: 24
    num_workers: 4

No issue now

@bhimrazy
Copy link
Collaborator

bhimrazy commented Jan 3, 2025

@robmarkcole, I have a quick question as I’m trying to understand the issue better.
Shouldn't this last batch be dropped during the training phase, regardless of the number of workers being used? Or is there a specific condition where this behavior might differ?

@robmarkcole
Copy link
Contributor Author

@bhimrazy yes it should be dropped due to drop_last=(split == "train"), however with num_workers: 24 the error ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 256, 1, 1]) is raised

@tchaton
Copy link
Collaborator

tchaton commented Jan 14, 2025

@robmarkcole Could you write a simple reproducible example using only integer for data ?

@tchaton
Copy link
Collaborator

tchaton commented Jan 14, 2025

Or even gives us the dataset length and the batch_size and num workers causing the issue

@robmarkcole
Copy link
Contributor Author

robmarkcole commented Jan 14, 2025

I have a small dataset:

Train dataset length: 200
Val dataset length: 24
Test dataset length: 24  # I am logging how many are processed

batch_size: 2

With

    batch_size: 2
    num_workers: 2

I log Test batch size: 2 and all 24 test images are processed.

batch_size: 4

I have

    batch_size: 4
    num_workers: 4

and no errors. I log Test batch size: 4 except for the final 2 batches which log Test batch size: 2. I have 20 images processed, i.e. 4 missing

Instead with

    batch_size: 4
    num_workers: 2 

I log Test batch size: 4 and all 24 images are processed.

batch_size: 8

When I increase both to

    batch_size: 8
    num_workers: 8

I then get the error ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 256, 1, 1])

Now if I set

    batch_size: 8
    num_workers: 4

I am logging the batch size and get Test batch size: 6 but ONLY 3 batches, (18 images processed).

Note: with num_workers: 1 I get no errors for any batch size and all 24 images are processed

Test index.json

{
    "chunks": [
        {
            "chunk_bytes": 19341730,
            "chunk_size": 6,
            "dim": null,
            "filename": "chunk-0-0.bin"
        },
        {
            "chunk_bytes": 20316676,
            "chunk_size": 6,
            "dim": null,
            "filename": "chunk-1-0.bin"
        },
        {
            "chunk_bytes": 20219153,
            "chunk_size": 6,
            "dim": null,
            "filename": "chunk-2-0.bin"
        },
        {
            "chunk_bytes": 20284337,
            "chunk_size": 6,
            "dim": null,
            "filename": "chunk-3-0.bin"
        }
    ],
    "config": {
        "chunk_bytes": 128000000,
        "chunk_size": null,
        "compression": null,
        "data_format": [
            "str",
            "tifffile",
            "tifffile"
        ],
        "data_spec": "[1, {\"type\": \"builtins.dict\", \"context\": \"[\\\"image_id\\\", \\\"mask\\\", \\\"image\\\"]\", \"children_spec\": [{\"type\": null, \"context\": null, \"children_spec\": []}, {\"type\": null, \"context\": null, \"children_spec\": []}, {\"type\": null, \"context\": null, \"children_spec\": []}]}]",
        "encryption": null,
        "item_loader": "PyTreeLoader"
    },
    "updated_at": "1735919317.2183607"
}

@tchaton
Copy link
Collaborator

tchaton commented Jan 14, 2025

Hey @robmarkcole

Yes, you are right. After double checking the logic for chunk association, it turns out to be wrong. But this would take some time to fix it.

@tchaton tchaton linked a pull request Jan 14, 2025 that will close this issue
4 tasks
@tchaton
Copy link
Collaborator

tchaton commented Jan 14, 2025

@robmarkcole Can you try this PR and let me know if this fixes it: https://github.com/Lightning-AI/litdata/pull/449/files ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants