Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve README #10

Merged
merged 56 commits into from
Feb 21, 2024
Merged
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
a3b2bc6
resolve setup
tchaton Feb 15, 2024
298f3f0
update
tchaton Feb 16, 2024
387dae6
update
tchaton Feb 16, 2024
4a93a03
update
tchaton Feb 16, 2024
50529dd
update
tchaton Feb 16, 2024
f1c4c4a
update
tchaton Feb 16, 2024
d27e5d7
update
tchaton Feb 16, 2024
621ffef
update
tchaton Feb 16, 2024
5a500f6
update
tchaton Feb 16, 2024
5f8c9b4
update
tchaton Feb 16, 2024
2085e81
update
tchaton Feb 16, 2024
b7eb720
update
tchaton Feb 16, 2024
1b837f9
upate
tchaton Feb 16, 2024
24efee8
update
tchaton Feb 16, 2024
fadc5b2
update
tchaton Feb 16, 2024
7a29e67
update
tchaton Feb 16, 2024
e0a5adb
update
tchaton Feb 16, 2024
2d85494
update
tchaton Feb 16, 2024
d91e219
update
tchaton Feb 16, 2024
9df7780
Merge pull request #3 from Lightning-AI/add_ci
tchaton Feb 16, 2024
71fdcf2
Add python files to manifest (#4)
tchaton Feb 16, 2024
3103fa7
Merge branch 'main' of https://github.com/Lightning-AI/lit-data
tchaton Feb 16, 2024
142c890
Merge branch 'main' of https://github.com/Lightning-AI/lit-data
tchaton Feb 19, 2024
d891715
Merge branch 'main' of https://github.com/Lightning-AI/lit-data
tchaton Feb 19, 2024
014d75a
Merge branch 'main' of https://github.com/Lightning-AI/lit-data
tchaton Feb 21, 2024
0d18e1b
update
tchaton Feb 21, 2024
bea0832
update
tchaton Feb 21, 2024
28bdde2
update
tchaton Feb 21, 2024
af7c501
update
tchaton Feb 21, 2024
0f5082a
update
tchaton Feb 21, 2024
b1b9006
update
tchaton Feb 21, 2024
65b0499
update
tchaton Feb 21, 2024
cb9372d
update
tchaton Feb 21, 2024
23c723b
update
tchaton Feb 21, 2024
dd152f4
update
tchaton Feb 21, 2024
0b82ce8
update
tchaton Feb 21, 2024
976c2f7
update
tchaton Feb 21, 2024
5c9405f
update
tchaton Feb 21, 2024
26f7cf9
update
tchaton Feb 21, 2024
0a21b52
update
tchaton Feb 21, 2024
56757fb
update
tchaton Feb 21, 2024
7f694b6
update
tchaton Feb 21, 2024
0426a69
update
tchaton Feb 21, 2024
0003174
update
tchaton Feb 21, 2024
d9bea80
update
tchaton Feb 21, 2024
6bb65d2
update
tchaton Feb 21, 2024
6fac1ff
update
tchaton Feb 21, 2024
e8cdf9f
update
tchaton Feb 21, 2024
303fa34
update
tchaton Feb 21, 2024
c7e9b00
update
tchaton Feb 21, 2024
7866628
update
tchaton Feb 21, 2024
7c92fcb
update
tchaton Feb 21, 2024
73545ca
update
tchaton Feb 21, 2024
3703094
update
tchaton Feb 21, 2024
c65f522
update
tchaton Feb 21, 2024
1d0bf8d
update
tchaton Feb 21, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 48 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,15 +13,54 @@

We developed `StreamingDataset` to optimize training of large datasets stored on the cloud while prioritizing speed, affordability, and scalability.

Specifically crafted for multi-node, distributed training with large models, it enhances accuracy, performance, and user-friendliness. Now, training efficiently is possible regardless of the data's location. Simply stream in the required data when needed.
Specifically crafted for multi-gpu & multi-node (with [DDP](https://lightning.ai/docs/pytorch/stable/accelerators/gpu_intermediate.html), [FSDP](https://lightning.ai/docs/pytorch/stable/advanced/model_parallel/fsdp.html), etc...), distributed training with large models, it enhances accuracy, performance, and user-friendliness. Now, training efficiently is possible regardless of the data's location. Simply stream in the required data when needed.

The `StreamingDataset` is compatible with any data type, including **images, text, video, and multimodal data** and it is a drop-in replacement for your PyTorch [IterableDataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset) class. For example, it is used by [Lit-GPT](https://github.com/Lightning-AI/lit-gpt/blob/main/pretrain/tinyllama.py) to pretrain LLMs.
The `StreamingDataset` is compatible with any data type, including **images, text, video, audio, geo-spatial, and multimodal data** and it is a drop-in replacement for your PyTorch [IterableDataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset) class. For example, it is used by [Lit-GPT](https://github.com/Lightning-AI/lit-gpt/blob/main/pretrain/tinyllama.py) to pretrain LLMs.

Finally, the `StreamingDataset` is fast! Check out our [benchmark](https://lightning.ai/lightning-ai/studios/benchmark-cloud-data-loading-libraries).
# 🚀 Benchmarks

Here is an illustration showing how the `StreamingDataset` works.
[Imagenet-1.2M](https://www.image-net.org/) is a commonly used dataset to compare computer vision models. Its training dataset contains `1,281,167 images`.

![An illustration showing how the Streaming Dataset works.](https://pl-flash-data.s3.amazonaws.com/streaming_dataset.gif)
In this benchmark, we measured the streaming speed (`images per second`) loaded from [AWS S3](https://aws.amazon.com/s3/) for several frameworks.

Find the full reproducible [Lightning Studio Benchmark](https://lightning.ai/) [here](https://lightning.ai/lightning-ai/studios/benchmark-cloud-data-loading-libraries).

### Imagenet-1.2M Streaming from AWS S3

| Framework | Images / sec 1st Epoch (float32) | Images / sec 2nd Epoch (float32) | Images / sec 1st Epoch (torch16) | Images / sec 2nd Epoch (torch16) |
|---|---|---|---|---|
| PL Data | ${\textbf{\color{Fuchsia}5800.34}}$ | ${\textbf{\color{Fuchsia}6589.98}}$ | ${\textbf{\color{Fuchsia}6282.17}}$ | ${\textbf{\color{Fuchsia}7221.88}}$ |
| Web Dataset | 3134.42 | 3924.95 | 3343.40 | 4424.62 |
| Mosaic ML | 2898.61 | 5099.93 | 2809.69 | 5158.98 |

Higher is better.

### Imagenet-1.2M Conversion

| Framework |Train Conversion Time | Val Conversion Time | Dataset Size | # Files |
|---|---|---|---|---|
| PL Data | ${\textbf{\color{Fuchsia}10:05 min}}$ | ${\textbf{\color{Fuchsia}00:30 min}}$ | ${\textbf{\color{Fuchsia}143.1 GB}}$ | 2.339 |
| Web Dataset | 32:36 min | 01:22 min | 147.8 GB | 1.144 |
| Mosaic ML | 49:49 min | 01:04 min | ${\textbf{\color{Fuchsia}143.1 GB}}$ | 2.298 |

The dataset needs to be converted into an optimized format for cloud streaming. We measured how fast the 1.2 million images are converted.

Faster is better.

# 📚 Real World Examples

We have built end-to-end free [Studios](https://lightning.ai) showing all the steps to prepare the following datasets:

| Dataset | Data type | Studio |
| -------------------------------------------------------------------------------------------------------------------------------------------- | :-----------------: | --------------------------------------------------------------------------------------------------------------------------------------: |
| [LAION-400M](https://laion.ai/blog/laion-400-open-dataset/) | Image & description | [Use or explore LAION-400MILLION dataset](https://lightning.ai/lightning-ai/studios/use-or-explore-laion-400million-dataset) |
| [Chesapeake Roads Spatial Context](https://github.com/isaaccorley/chesapeakersc) | Image & Mask | [Convert GeoSpatial data to Lightning Streaming](https://lightning.ai/lightning-ai/studios/convert-spatial-data-to-lightning-streaming) |
| [Imagenet 1M](https://paperswithcode.com/sota/image-classification-on-imagenet?tag_filter=171) | Image & Label | [Benchmark cloud data-loading libraries](https://lightning.ai/lightning-ai/studios/benchmark-cloud-data-loading-libraries) |
| [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) & [StartCoder](https://huggingface.co/datasets/bigcode/starcoderdata) | Text | [Prepare the TinyLlama 1T token dataset](https://lightning.ai/lightning-ai/studios/prepare-the-tinyllama-1t-token-dataset) |
| [English Wikepedia](https://huggingface.co/datasets/wikipedia) | Text | [Embed English Wikipedia under 5 dollars](https://lightning.ai/lightning-ai/studios/embed-english-wikipedia-under-5-dollars) |
| Generated | Parquet Files | [Convert parquets to Lightning Streaming](https://lightning.ai/lightning-ai/studios/convert-parquets-to-lightning-streaming) |

[Lightning Studios](https://lightning.ai) are fully reproducible cloud IDE with data, code, dependencies, etc...

# 🎬 Getting Started

Expand Down Expand Up @@ -102,6 +141,10 @@ cls = sample['class']
dataloader = DataLoader(dataset)
```

Here is an illustration showing how the `StreamingDataset` works under the hood.

![An illustration showing how the Streaming Dataset works.](https://pl-flash-data.s3.amazonaws.com/streaming_dataset.gif)

## Transform data

Similar to `optimize`, the `map` operator can be used to transform data by applying a function over a list of item and persist all the files written inside the output directory.
Expand Down Expand Up @@ -154,21 +197,6 @@ if __name__ == "__main__":
)
```

# 📚 End-to-end Lightning Studio Templates

We have end-to-end free [Studios](https://lightning.ai) showing all the steps to prepare the following datasets:

| Dataset | Data type | Studio |
| -------------------------------------------------------------------------------------------------------------------------------------------- | :-----------------: | --------------------------------------------------------------------------------------------------------------------------------------: |
| [LAION-400M](https://laion.ai/blog/laion-400-open-dataset/) | Image & description | [Use or explore LAION-400MILLION dataset](https://lightning.ai/lightning-ai/studios/use-or-explore-laion-400million-dataset) |
| [Chesapeake Roads Spatial Context](https://github.com/isaaccorley/chesapeakersc) | Image & Mask | [Convert GeoSpatial data to Lightning Streaming](https://lightning.ai/lightning-ai/studios/convert-spatial-data-to-lightning-streaming) |
| [Imagenet 1M](https://paperswithcode.com/sota/image-classification-on-imagenet?tag_filter=171) | Image & Label | [Benchmark cloud data-loading libraries](https://lightning.ai/lightning-ai/studios/benchmark-cloud-data-loading-libraries) |
| [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) & [StartCoder](https://huggingface.co/datasets/bigcode/starcoderdata) | Text | [Prepare the TinyLlama 1T token dataset](https://lightning.ai/lightning-ai/studios/prepare-the-tinyllama-1t-token-dataset) |
| [English Wikepedia](https://huggingface.co/datasets/wikipedia) | Text | [Embed English Wikipedia under 5 dollars](https://lightning.ai/lightning-ai/studios/embed-english-wikipedia-under-5-dollars) |
| Generated | Parquet Files | [Convert parquets to Lightning Streaming](https://lightning.ai/lightning-ai/studios/convert-parquets-to-lightning-streaming) |

[Lightning Studios](https://lightning.ai) are fully reproducible cloud IDE with data, code, dependencies, etc... Finally reproducible science.

# 📈 Easily scale data processing

To scale data processing, create a free account on [lightning.ai](https://lightning.ai/) platform. With the platform, the `optimize` and `map` can start multiple machines to make data processing drastically faster as follows:
Expand Down
Loading