Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README.md #18

Merged
merged 1 commit into from
Feb 23, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 15 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Specifically crafted for multi-gpu & multi-node (with [DDP](https://lightning.ai

The `StreamingDataset` is compatible with any data type, including **images, text, video, audio, geo-spatial, and multimodal data** and it is a drop-in replacement for your PyTorch [IterableDataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset) class. For example, it is used by [Lit-GPT](https://github.com/Lightning-AI/lit-gpt/blob/main/pretrain/tinyllama.py) to pretrain LLMs.

# 🚀 Benchmarks
# Benchmarks

[Imagenet-1.2M](https://www.image-net.org/) is a commonly used dataset to compare computer vision models. Its training dataset contains `1,281,167 images`.

Expand Down Expand Up @@ -47,7 +47,7 @@ The dataset needs to be converted into an optimized format for cloud streaming.

Faster is better.

# 📚 Real World Examples
# Real World Examples

We have built end-to-end free [Studios](https://lightning.ai) showing all the steps to prepare the following datasets:

Expand All @@ -62,9 +62,9 @@ We have built end-to-end free [Studios](https://lightning.ai) showing all the st

[Lightning Studios](https://lightning.ai) are fully reproducible cloud IDE with data, code, dependencies, etc...

# 🎬 Getting Started
# Getting Started

## 💾 Installation
## Installation

Lightning Data can be installed with `pip`:

Expand All @@ -74,7 +74,7 @@ Lightning Data can be installed with `pip`:
pip install -U litdata
```

## 🏁 Quick Start
## Quick Start

### 1. Prepare Your Data

Expand Down Expand Up @@ -197,7 +197,7 @@ if __name__ == "__main__":
)
```

# 📈 Easily scale data processing
# Easily scale data processing

To scale data processing, create a free account on [lightning.ai](https://lightning.ai/) platform. With the platform, the `optimize` and `map` can start multiple machines to make data processing drastically faster as follows:

Expand Down Expand Up @@ -233,13 +233,13 @@ The Data Prep Job UI from the [LAION 400M Studio](https://lightning.ai/lightning

</div>

# 🔑 Key Features
# Key Features

## 🚀 Multi-GPU / Multi-Node
## Multi-GPU / Multi-Node

The `StreamingDataset` and `StreamingDataLoader` takes care of everything for you. They automatically make sure each rank receives different batch of data. There is nothing for you to do if you use them.

## 🎨 Easy data mixing
## Easy data mixing

You can easily experiment with dataset mixtures using the CombinedStreamingDataset.

Expand Down Expand Up @@ -276,7 +276,7 @@ for batch in tqdm(train_dataloader):
pass
```

## 🔘 Stateful StreamingDataLoader
## Stateful StreamingDataLoader

Lightning Data provides a stateful `StreamingDataLoader`. This simplifies resuming training over large datasets.

Expand All @@ -303,7 +303,7 @@ for batch_idx, batch in enumerate(dataloader):
torch.save(dataloader.state_dict(), "dataloader_state.pt")
```

## 🎥 Profiling
## Profiling

The `StreamingDataLoader` supports profiling your data loading. Simply use the `profile_batches` argument as follows:

Expand All @@ -315,7 +315,7 @@ StreamingDataLoader(..., profile_batches=5)

This generates a Chrome trace called `result.json`. You can visualize this trace by opening Chrome browser at the `chrome://tracing` URL and load the trace inside.

## 🪇 Random access
## Random access

Access the data you need when you need it.

Expand All @@ -329,7 +329,7 @@ print(len(dataset)) # display the length of your data
print(dataset[42]) # show the 42th element of the dataset
```

## Use data transforms
## Use data transforms

```python
from litdata import StreamingDataset, StreamingDataLoader
Expand All @@ -349,7 +349,7 @@ for batch in dataloader:
# Out: (4, 3, 224, 224)
```

## ⚙️ Disk usage limits
## Disk usage limits

Limit the size of the cache holding the chunks.

Expand All @@ -359,7 +359,7 @@ from litdata import StreamingDataset
dataset = StreamingDataset(..., max_cache_size="10GB")
```

## 💾 Support yield
## Support yield

When processing large files like compressed [parquet files](https://en.wikipedia.org/wiki/Apache_Parquet), you can use python yield to process and store one item at the time.

Expand Down
Loading