diff --git a/README.md b/README.md
index 9229cdcd..46674769 100644
--- a/README.md
+++ b/README.md
@@ -178,19 +178,6 @@ ld.map(
## Features for transforming datasets
-
- ✅ Multi-GPU / Multi-Node Support
-
-
-
-The `StreamingDataset` and `StreamingDataLoader` automatically make sure each rank receives the same quantity of varied batches of data, so it works out of the box with your favorite frameworks ([PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/), [Lightning Fabric](https://lightning.ai/docs/fabric/stable/), or [PyTorch](https://pytorch.org/docs/stable/index.html)) to do distributed training.
-
-Here you can see an illustration showing how the Streaming Dataset works with multi node / multi gpu under the hood.
-
-![An illustration showing how the Streaming Dataset works with multi node.](https://pl-flash-data.s3.amazonaws.com/streaming_dataset.gif)
-
-
-
✅ Map transformations
@@ -223,6 +210,44 @@ map(
+
+ ✅ Support S3-Compatible Object Storage
+
+
+Integrate S3-compatible object storage servers like [MinIO](https://min.io/) with litdata, ideal for on-premises infrastructure setups. Configure the endpoint and credentials using environment variables or configuration files.
+
+Set up the environment variables to connect to MinIO:
+
+```bash
+export AWS_ACCESS_KEY_ID=access_key
+export AWS_SECRET_ACCESS_KEY=secret_key
+export AWS_ENDPOINT_URL=http://localhost:9000 # MinIO endpoint
+```
+
+Alternatively, configure credentials and endpoint in `~/.aws/{credentials,config}`:
+
+```bash
+mkdir -p ~/.aws && \
+cat <> ~/.aws/credentials
+[default]
+aws_access_key_id = access_key
+aws_secret_access_key = secret_key
+EOL
+
+cat <> ~/.aws/config
+[default]
+endpoint_url = http://localhost:9000 # MinIO endpoint
+EOL
+```
+Explore an example setup of litdata with MinIO in the [LitData with MinIO](https://github.com/bhimrazy/litdata-with-minio) repository for practical implementation details.
+
+
+
+
+
+## Features for optimizing and streaming datasets for model training
+
+
✅ Stream datasets
@@ -244,51 +269,22 @@ for batch in dataloader:
- ✅ Combine datasets
-
-
-
-Easily experiment with dataset mixtures using the `CombinedStreamingDataset` class.
-
-As an example, this mixture of [Slimpajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) & [StarCoder](https://huggingface.co/datasets/bigcode/starcoderdata) was used in the [TinyLLAMA](https://github.com/jzhang38/TinyLlama) project to pretrain a 1.1B Llama model on 3 trillion tokens.
+ ✅ Multi-GPU / Multi-Node Support
-```python
-from litdata import StreamingDataset, CombinedStreamingDataset, StreamingDataLoader, TokensLoader
-from tqdm import tqdm
-import os
+
-train_datasets = [
- StreamingDataset(
- input_dir="s3://tinyllama-template/slimpajama/train/",
- item_loader=TokensLoader(block_size=2048 + 1), # Optimized loader for tokens used by LLMs
- shuffle=True,
- drop_last=True,
- ),
- StreamingDataset(
- input_dir="s3://tinyllama-template/starcoder/",
- item_loader=TokensLoader(block_size=2048 + 1), # Optimized loader for tokens used by LLMs
- shuffle=True,
- drop_last=True,
- ),
-]
+The `StreamingDataset` and `StreamingDataLoader` automatically make sure each rank receives the same quantity of varied batches of data, so it works out of the box with your favorite frameworks ([PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/), [Lightning Fabric](https://lightning.ai/docs/fabric/stable/), or [PyTorch](https://pytorch.org/docs/stable/index.html)) to do distributed training.
-# Mix SlimPajama data and Starcoder data with these proportions:
-weights = (0.693584, 0.306416)
-combined_dataset = CombinedStreamingDataset(datasets=train_datasets, seed=42, weights=weights)
+Here you can see an illustration showing how the Streaming Dataset works with multi node / multi gpu under the hood.
-train_dataloader = StreamingDataLoader(combined_dataset, batch_size=8, pin_memory=True, num_workers=os.cpu_count())
+![An illustration showing how the Streaming Dataset works with multi node.](https://pl-flash-data.s3.amazonaws.com/streaming_dataset.gif)
-# Iterate over the combined datasets
-for batch in tqdm(train_dataloader):
- pass
-```
✅ Pause & Resume data streaming
-
LitData provides a stateful `Streaming DataLoader` e.g. you can `pause` and `resume` your training whenever you want.
Info: The `Streaming DataLoader` was used by [Lit-GPT](https://github.com/Lightning-AI/lit-gpt/blob/main/pretrain/tinyllama.py) to pretrain LLMs. Restarting from an older checkpoint was critical to get to pretrain the full model due to several failures (network, CUDA Errors, etc..).
@@ -316,49 +312,54 @@ for batch_idx, batch in enumerate(dataloader):
+
- ✅ Support S3-Compatible Object Storage
+ ✅ Combine datasets
-Integrate S3-compatible object storage servers like [MinIO](https://min.io/) with litdata, ideal for on-premises infrastructure setups. Configure the endpoint and credentials using environment variables or configuration files.
-Set up the environment variables to connect to MinIO:
-
-```bash
-export AWS_ACCESS_KEY_ID=access_key
-export AWS_SECRET_ACCESS_KEY=secret_key
-export AWS_ENDPOINT_URL=http://localhost:9000 # MinIO endpoint
-```
+Easily experiment with dataset mixtures using the `CombinedStreamingDataset` class.
-Alternatively, configure credentials and endpoint in `~/.aws/{credentials,config}`:
+As an example, this mixture of [Slimpajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) & [StarCoder](https://huggingface.co/datasets/bigcode/starcoderdata) was used in the [TinyLLAMA](https://github.com/jzhang38/TinyLlama) project to pretrain a 1.1B Llama model on 3 trillion tokens.
-```bash
-mkdir -p ~/.aws && \
-cat <> ~/.aws/credentials
-[default]
-aws_access_key_id = access_key
-aws_secret_access_key = secret_key
-EOL
+```python
+from litdata import StreamingDataset, CombinedStreamingDataset, StreamingDataLoader, TokensLoader
+from tqdm import tqdm
+import os
-cat <> ~/.aws/config
-[default]
-endpoint_url = http://localhost:9000 # MinIO endpoint
-EOL
-```
-Explore an example setup of litdata with MinIO in the [LitData with MinIO](https://github.com/bhimrazy/litdata-with-minio) repository for practical implementation details.
+train_datasets = [
+ StreamingDataset(
+ input_dir="s3://tinyllama-template/slimpajama/train/",
+ item_loader=TokensLoader(block_size=2048 + 1), # Optimized loader for tokens used by LLMs
+ shuffle=True,
+ drop_last=True,
+ ),
+ StreamingDataset(
+ input_dir="s3://tinyllama-template/starcoder/",
+ item_loader=TokensLoader(block_size=2048 + 1), # Optimized loader for tokens used by LLMs
+ shuffle=True,
+ drop_last=True,
+ ),
+]
-
+# Mix SlimPajama data and Starcoder data with these proportions:
+weights = (0.693584, 0.306416)
+combined_dataset = CombinedStreamingDataset(datasets=train_datasets, seed=42, weights=weights)
-
+train_dataloader = StreamingDataLoader(combined_dataset, batch_size=8, pin_memory=True, num_workers=os.cpu_count())
-## Features for optimizing datasets
+# Iterate over the combined datasets
+for batch in tqdm(train_dataloader):
+ pass
+```
+
✅ Subsample and split datasets
-You can split your dataset with more ease with `train_test_split`.
+Split a dataset into train, val, test splits with `train_test_split`.
```python
from litdata import StreamingDataset, train_test_split
@@ -405,7 +406,7 @@ print(len(dataset)) # display the length of your data
- ✅ Append or Overwrite optimized datasets
+ ✅ Append or overwrite optimized datasets
@@ -490,7 +491,7 @@ for batch in dataloader:
- ✅ Support Profiling
+ ✅ Profile loading speed
The `StreamingDataLoader` supports profiling of your data loading process. Simply use the `profile_batches` argument to specify the number of batches you want to profile:
@@ -543,7 +544,7 @@ outputs = optimize(
- ✅ Configure Cache Size Limit
+ ✅ Reduce disk space with caching limits
Adapt the local caching limit of the `StreamingDataset`. This is useful to make sure the downloaded data chunks are deleted when used and the disk usage stays low.