Skip to content

Commit

Permalink
Merge branch 'main' into prevent_dataset_to_break_if_already_exists
Browse files Browse the repository at this point in the history
  • Loading branch information
tchaton committed Feb 21, 2024
2 parents 6772f33 + 2279a48 commit 69bffd0
Show file tree
Hide file tree
Showing 7 changed files with 59 additions and 34 deletions.
6 changes: 4 additions & 2 deletions .github/workflows/ci-checks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,11 @@ jobs:
uses: Lightning-AI/utilities/.github/workflows/[email protected]

check-typing:
uses: Lightning-AI/utilities/.github/workflows/[email protected]
# TODO: switch to main after fix lends
uses: Lightning-AI/utilities/.github/workflows/check-typing.yml@ci/mypy-dir
with:
actions-ref: v0.10.1
actions-ref: ci/mypy-dir
source-dir: ""

check-schema:
uses: Lightning-AI/utilities/.github/workflows/[email protected]
Expand Down
68 changes: 48 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,15 +13,54 @@

We developed `StreamingDataset` to optimize training of large datasets stored on the cloud while prioritizing speed, affordability, and scalability.

Specifically crafted for multi-node, distributed training with large models, it enhances accuracy, performance, and user-friendliness. Now, training efficiently is possible regardless of the data's location. Simply stream in the required data when needed.
Specifically crafted for multi-gpu & multi-node (with [DDP](https://lightning.ai/docs/pytorch/stable/accelerators/gpu_intermediate.html), [FSDP](https://lightning.ai/docs/pytorch/stable/advanced/model_parallel/fsdp.html), etc...), distributed training with large models, it enhances accuracy, performance, and user-friendliness. Now, training efficiently is possible regardless of the data's location. Simply stream in the required data when needed.

The `StreamingDataset` is compatible with any data type, including **images, text, video, and multimodal data** and it is a drop-in replacement for your PyTorch [IterableDataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset) class. For example, it is used by [Lit-GPT](https://github.com/Lightning-AI/lit-gpt/blob/main/pretrain/tinyllama.py) to pretrain LLMs.
The `StreamingDataset` is compatible with any data type, including **images, text, video, audio, geo-spatial, and multimodal data** and it is a drop-in replacement for your PyTorch [IterableDataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset) class. For example, it is used by [Lit-GPT](https://github.com/Lightning-AI/lit-gpt/blob/main/pretrain/tinyllama.py) to pretrain LLMs.

Finally, the `StreamingDataset` is fast! Check out our [benchmark](https://lightning.ai/lightning-ai/studios/benchmark-cloud-data-loading-libraries).
# 🚀 Benchmarks

Here is an illustration showing how the `StreamingDataset` works.
[Imagenet-1.2M](https://www.image-net.org/) is a commonly used dataset to compare computer vision models. Its training dataset contains `1,281,167 images`.

![An illustration showing how the Streaming Dataset works.](https://pl-flash-data.s3.amazonaws.com/streaming_dataset.gif)
In this benchmark, we measured the streaming speed (`images per second`) loaded from [AWS S3](https://aws.amazon.com/s3/) for several frameworks.

Find the full reproducible [Lightning Studio Benchmark](https://lightning.ai/) [here](https://lightning.ai/lightning-ai/studios/benchmark-cloud-data-loading-libraries).

### Imagenet-1.2M Streaming from AWS S3

| Framework | Images / sec 1st Epoch (float32) | Images / sec 2nd Epoch (float32) | Images / sec 1st Epoch (torch16) | Images / sec 2nd Epoch (torch16) |
|---|---|---|---|---|
| PL Data | ${\textbf{\color{Fuchsia}5800.34}}$ | ${\textbf{\color{Fuchsia}6589.98}}$ | ${\textbf{\color{Fuchsia}6282.17}}$ | ${\textbf{\color{Fuchsia}7221.88}}$ |
| Web Dataset | 3134.42 | 3924.95 | 3343.40 | 4424.62 |
| Mosaic ML | 2898.61 | 5099.93 | 2809.69 | 5158.98 |

Higher is better.

### Imagenet-1.2M Conversion

| Framework |Train Conversion Time | Val Conversion Time | Dataset Size | # Files |
|---|---|---|---|---|
| PL Data | ${\textbf{\color{Fuchsia}10:05 min}}$ | ${\textbf{\color{Fuchsia}00:30 min}}$ | ${\textbf{\color{Fuchsia}143.1 GB}}$ | 2.339 |
| Web Dataset | 32:36 min | 01:22 min | 147.8 GB | 1.144 |
| Mosaic ML | 49:49 min | 01:04 min | ${\textbf{\color{Fuchsia}143.1 GB}}$ | 2.298 |

The dataset needs to be converted into an optimized format for cloud streaming. We measured how fast the 1.2 million images are converted.

Faster is better.

# 📚 Real World Examples

We have built end-to-end free [Studios](https://lightning.ai) showing all the steps to prepare the following datasets:

| Dataset | Data type | Studio |
| -------------------------------------------------------------------------------------------------------------------------------------------- | :-----------------: | --------------------------------------------------------------------------------------------------------------------------------------: |
| [LAION-400M](https://laion.ai/blog/laion-400-open-dataset/) | Image & description | [Use or explore LAION-400MILLION dataset](https://lightning.ai/lightning-ai/studios/use-or-explore-laion-400million-dataset) |
| [Chesapeake Roads Spatial Context](https://github.com/isaaccorley/chesapeakersc) | Image & Mask | [Convert GeoSpatial data to Lightning Streaming](https://lightning.ai/lightning-ai/studios/convert-spatial-data-to-lightning-streaming) |
| [Imagenet 1M](https://paperswithcode.com/sota/image-classification-on-imagenet?tag_filter=171) | Image & Label | [Benchmark cloud data-loading libraries](https://lightning.ai/lightning-ai/studios/benchmark-cloud-data-loading-libraries) |
| [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) & [StartCoder](https://huggingface.co/datasets/bigcode/starcoderdata) | Text | [Prepare the TinyLlama 1T token dataset](https://lightning.ai/lightning-ai/studios/prepare-the-tinyllama-1t-token-dataset) |
| [English Wikepedia](https://huggingface.co/datasets/wikipedia) | Text | [Embed English Wikipedia under 5 dollars](https://lightning.ai/lightning-ai/studios/embed-english-wikipedia-under-5-dollars) |
| Generated | Parquet Files | [Convert parquets to Lightning Streaming](https://lightning.ai/lightning-ai/studios/convert-parquets-to-lightning-streaming) |

[Lightning Studios](https://lightning.ai) are fully reproducible cloud IDE with data, code, dependencies, etc...

# 🎬 Getting Started

Expand Down Expand Up @@ -102,6 +141,10 @@ cls = sample['class']
dataloader = DataLoader(dataset)
```

Here is an illustration showing how the `StreamingDataset` works under the hood.

![An illustration showing how the Streaming Dataset works.](https://pl-flash-data.s3.amazonaws.com/streaming_dataset.gif)

## Transform data

Similar to `optimize`, the `map` operator can be used to transform data by applying a function over a list of item and persist all the files written inside the output directory.
Expand Down Expand Up @@ -154,21 +197,6 @@ if __name__ == "__main__":
)
```

# 📚 End-to-end Lightning Studio Templates

We have end-to-end free [Studios](https://lightning.ai) showing all the steps to prepare the following datasets:

| Dataset | Data type | Studio |
| -------------------------------------------------------------------------------------------------------------------------------------------- | :-----------------: | --------------------------------------------------------------------------------------------------------------------------------------: |
| [LAION-400M](https://laion.ai/blog/laion-400-open-dataset/) | Image & description | [Use or explore LAION-400MILLION dataset](https://lightning.ai/lightning-ai/studios/use-or-explore-laion-400million-dataset) |
| [Chesapeake Roads Spatial Context](https://github.com/isaaccorley/chesapeakersc) | Image & Mask | [Convert GeoSpatial data to Lightning Streaming](https://lightning.ai/lightning-ai/studios/convert-spatial-data-to-lightning-streaming) |
| [Imagenet 1M](https://paperswithcode.com/sota/image-classification-on-imagenet?tag_filter=171) | Image & Label | [Benchmark cloud data-loading libraries](https://lightning.ai/lightning-ai/studios/benchmark-cloud-data-loading-libraries) |
| [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) & [StartCoder](https://huggingface.co/datasets/bigcode/starcoderdata) | Text | [Prepare the TinyLlama 1T token dataset](https://lightning.ai/lightning-ai/studios/prepare-the-tinyllama-1t-token-dataset) |
| [English Wikepedia](https://huggingface.co/datasets/wikipedia) | Text | [Embed English Wikipedia under 5 dollars](https://lightning.ai/lightning-ai/studios/embed-english-wikipedia-under-5-dollars) |
| Generated | Parquet Files | [Convert parquets to Lightning Streaming](https://lightning.ai/lightning-ai/studios/convert-parquets-to-lightning-streaming) |

[Lightning Studios](https://lightning.ai) are fully reproducible cloud IDE with data, code, dependencies, etc... Finally reproducible science.

# 📈 Easily scale data processing

To scale data processing, create a free account on [lightning.ai](https://lightning.ai/) platform. With the platform, the `optimize` and `map` can start multiple machines to make data processing drastically faster as follows:
Expand Down
3 changes: 1 addition & 2 deletions lightning_data/__about__.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@

import time

__version__ = "0.2.0.dev"
__author__ = "Lightning AI et al."
__author_email__ = "[email protected]"
__license__ = "Apache-2.0"
Expand All @@ -39,5 +40,3 @@
"__license__",
"__version__",
]

__version__ = "0.2.0.dev"
6 changes: 2 additions & 4 deletions lightning_data/__init__.py
Original file line number Diff line number Diff line change
@@ -1,22 +1,20 @@
from lightning_utilities.core.imports import RequirementCache

from lightning_data.__about__ import * # noqa: F403
from lightning_data.processing.functions import map, optimize, walk
from lightning_data.streaming.combined import CombinedStreamingDataset
from lightning_data.streaming.dataloader import StreamingDataLoader
from lightning_data.streaming.dataset import StreamingDataset

__all__ = [
"LightningDataset",
"StreamingDataset",
"CombinedStreamingDataset",
"StreamingDataLoader",
"LightningIterableDataset",
"map",
"optimize",
"walk",
]

if RequirementCache("lightning_sdk"):
from lightning_sdk import Machine # noqa: F401

__all__.append("Machine")
__all__ + ["Machine"]
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,7 @@ max-complexity = 10

[tool.mypy]
files = [
"src/lightning",
"lightning_data",
]
# This section is for folders with "-" as they are not valid python modules
exclude = [
Expand Down
3 changes: 3 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,11 @@ lightning-utilities >=0.8.0, <0.10.0
lightning-cloud == 0.5.64 # Must be pinned to ensure compatibility
# to be able to include also PL 2.0 and preserve `>` needed for CI min version bypass
torch >=2.1.0, <=2.2.0
lightning >=2.2.0
filelock
tqdm
numpy
torchvision
pillow
viztracer
pyarrow
5 changes: 0 additions & 5 deletions requirements/test.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,4 @@ pytest-cov ==4.1.0
pytest-timeout ==2.1.0
pytest-rerunfailures ==12.0
pytest-random-order ==1.1.0
viztracer
pandas
pyarrow
pillow
lightning
mypy

0 comments on commit 69bffd0

Please sign in to comment.