Lightning-AI · tchaton · Feb 23, 2024 · Feb 23, 2024
@@ -17,7 +17,7 @@ Specifically crafted for multi-gpu & multi-node (with [DDP](https://lightning.ai
 
 The `StreamingDataset` is compatible with any data type, including **images, text, video, audio, geo-spatial, and multimodal data** and it is a drop-in replacement for your PyTorch [IterableDataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset) class. For example, it is used by [Lit-GPT](https://github.com/Lightning-AI/lit-gpt/blob/main/pretrain/tinyllama.py) to pretrain LLMs.
 
-# 🚀 Benchmarks
+# Benchmarks
 
 [Imagenet-1.2M](https://www.image-net.org/) is a commonly used dataset to compare computer vision models. Its training dataset contains `1,281,167 images`. 
 
@@ -47,7 +47,7 @@ The dataset needs to be converted into an optimized format for cloud streaming.
 
 Faster is better.
 
-# 📚 Real World Examples
+# Real World Examples
 
 We have built end-to-end free [Studios](https://lightning.ai) showing all the steps to prepare the following datasets:
 
@@ -62,9 +62,9 @@ We have built end-to-end free [Studios](https://lightning.ai) showing all the st
 
 [Lightning Studios](https://lightning.ai) are fully reproducible cloud IDE with data, code, dependencies, etc...
 
-# 🎬 Getting Started
+# Getting Started
 
-## 💾 Installation
+## Installation
 
 Lightning Data can be installed with `pip`:
 
@@ -74,7 +74,7 @@ Lightning Data can be installed with `pip`:
 pip install -U litdata
 ```
 
-## 🏁 Quick Start
+## Quick Start
 
 ### 1. Prepare Your Data
 
@@ -197,7 +197,7 @@ if __name__ == "__main__":
     )
 ```
 
-# 📈 Easily scale data processing
+# Easily scale data processing
 
 To scale data processing, create a free account on [lightning.ai](https://lightning.ai/) platform. With the platform, the `optimize` and `map` can start multiple machines to make data processing drastically faster as follows:
 
@@ -233,13 +233,13 @@ The Data Prep Job UI from the [LAION 400M Studio](https://lightning.ai/lightning
 
 </div>
 
-# 🔑 Key Features
+# Key Features
 
-## 🚀 Multi-GPU / Multi-Node
+## Multi-GPU / Multi-Node
 
 The `StreamingDataset` and `StreamingDataLoader` takes care of everything for you. They automatically make sure each rank receives different batch of data. There is nothing for you to do if you use them.
 
-## 🎨 Easy data mixing
+## Easy data mixing
 
 You can easily experiment with dataset mixtures using the CombinedStreamingDataset.
 
@@ -276,7 +276,7 @@ for batch in tqdm(train_dataloader):
     pass
 ```
 
-## 🔘 Stateful StreamingDataLoader
+## Stateful StreamingDataLoader
 
 Lightning Data provides a stateful `StreamingDataLoader`. This simplifies resuming training over large datasets.
 
@@ -303,7 +303,7 @@ for batch_idx, batch in enumerate(dataloader):
         torch.save(dataloader.state_dict(), "dataloader_state.pt")
 ```
 
-## 🎥 Profiling
+## Profiling
 
 The `StreamingDataLoader` supports profiling your data loading. Simply use the `profile_batches` argument as follows:
 
@@ -315,7 +315,7 @@ StreamingDataLoader(..., profile_batches=5)
 
 This generates a Chrome trace called `result.json`. You can visualize this trace by opening Chrome browser at the `chrome://tracing` URL and load the trace inside.
 
-## 🪇 Random access
+## Random access
 
 Access the data you need when you need it.
 
@@ -329,7 +329,7 @@ print(len(dataset)) # display the length of your data
 print(dataset[42]) # show the 42th element of the dataset
 ```
 
-## ✢ Use data transforms
+## Use data transforms
 
 ```python
 from litdata import StreamingDataset, StreamingDataLoader
@@ -349,7 +349,7 @@ for batch in dataloader:
     # Out: (4, 3, 224, 224)
 ```
 
-## ⚙️ Disk usage limits
+## Disk usage limits
 
 Limit the size of the cache holding the chunks.
 
@@ -359,7 +359,7 @@ from litdata import StreamingDataset
 dataset = StreamingDataset(..., max_cache_size="10GB")
 ```
 
-## 💾 Support yield
+## Support yield
 
 When processing large files like compressed [parquet files](https://en.wikipedia.org/wiki/Apache_Parquet), you can use python yield to process and store one item at the time.