diff --git a/README.md b/README.md
index 0c263b8e..96b2b5fc 100644
--- a/README.md
+++ b/README.md
@@ -21,7 +21,7 @@ Transform Optimize
![PyPI](https://img.shields.io/pypi/v/litdata)
![Downloads](https://img.shields.io/pypi/dm/litdata)
![License](https://img.shields.io/github/license/Lightning-AI/litdata)
-[![Discord](https://img.shields.io/discord/822497400078196796?label=Get%20Help%20on%20Discord)](https://discord.gg/VptPCZkGNa)
+[![Discord](https://img.shields.io/discord/1077906959069626439?label=Get%20Help%20on%20Discord)](https://discord.gg/VptPCZkGNa)
Lightning AI •
@@ -45,20 +45,22 @@ Transform Optimize
# Transform data at scale. Optimize for fast model training.
-LitData enables two key data workflows [transform datasets](#transform-datasets) and [optimize to speed up AI model training](#speed-up-model-training):
+LitData helps scale and speed up two key data workflows:
+
+[Transform datasets](#transform-datasets) - Parallelize (map) transforms across 1000s of machines.
+[Optimize datasets](#speed-up-model-training) - Accelerate AI model training by 20x.
+
+
+✅ Speed up training: Speed up model training by 20x with optimized datasets.
+✅ Stream cloud datasets: Work with huge datasets directly from cloud storage without downloading.
+✅ Pytorch-first: Works with PyTorch libraries like PyTorch Lightning, Lightning Fabric, Hugging Face.
+✅ Easy collaboration: Share and access datasets in the cloud, streamlining team projects.
+✅ Scale across GPUs: Streamed data automatically scales to all GPUs.
+✅ Flexible storage: Use S3, GCS, Azure, or your own cloud account for data storage.
+✅ Run local or cloud: Auto-scale to 1000s of cloud GPUs with Lightning Studios.
+✅ Enterprise security: Self host or process data on your cloud account with Lightning Studios.
+
-[Transform](#transform-datasets) - datasets across 1000s of machines.
-[Optimize](#speed-up-model-training) - datasets for fast loading to speed up AI training by 20x.
-
-✅ **Blazing fast training** - Speed up model training by 20x with optimized datasets.
-✅ **Stream from the cloud** - Work with huge datasets directly from cloud storage without downloading.
-✅ **Pytorch-first** - Works with PyTorch Lightning, Lightning Fabric, and PyTorch.
-✅ **Easy collaboration** - Works with PyTorch Lightning, Lightning Fabric, and PyTorch.
-✅ **Scale across GPUs** - Share and access datasets in the cloud, streamlining team projects.
-✅ **Flexible storage options** - Use S3, GCS, Azure, or your own cloud account for data storage.
-✅ **Run local or cloud-** Auto-scale to 1000s of cloud GPUs with Lightning Studios.
-✅ **Own VPC or cloud account-** Self host or process data on your cloud account with Lightning Studios.
-
# Quick start
@@ -207,7 +209,7 @@ for batch in dataloader:
- ✅ Scale across multiple GPUs or machines
+ ✅ Streams on multi-GPU, multi-node
@@ -222,7 +224,7 @@ Here you can see an illustration showing how the Streaming Dataset works with mu
- ✅ Pause & Resume data streaming
+ ✅ Pause, resume data streaming
Stream data during long training, if interrupted, pick up right where you left off without any issues.
@@ -296,7 +298,7 @@ for batch in tqdm(train_dataloader):
- ✅ Split datasets for training, validation, and testing
+ ✅ Split datasets for train, val, test
@@ -325,10 +327,11 @@ print(test_dataset)
- ✅ Work with smaller subsets of a dataset
+ ✅ Load a subset of the remote dataset
-Work on a smaller, manageable portion of your data to save time and resources.
+Work on a smaller, manageable portion of your data to save time and resources.
+
```python
from litdata import StreamingDataset, train_test_split
@@ -342,7 +345,7 @@ print(len(dataset)) # display the length of your data
- ✅ Add or replace data in an optimized dataset
+ ✅ Easily modify optimized cloud datasets
Add new data to an existing dataset or start fresh if needed, providing flexibility in data management.
@@ -383,7 +386,7 @@ The `overwrite` mode will delete the existing data and start from fresh.
- ✅ Access dataset parts without downloading everything
+ ✅ Access samples without full data download
Look at specific parts of a large dataset without downloading the whole thing or loading it on a local machine.
@@ -429,7 +432,7 @@ for batch in dataloader:
- ✅ Measure and optimize data loading speed
+ ✅ Profile data loading speed
Measure and optimize how fast your data is being loaded, improving efficiency.
@@ -485,7 +488,7 @@ outputs = optimize(
- ✅ Reduce disk space with caching limits
+ ✅ Limit local cache space
Limit the amount of disk space used by temporary files, preventing storage issues.
@@ -501,7 +504,7 @@ dataset = StreamingDataset(..., max_cache_size="10GB")
- ✅ Optimize data loading on networked drives
+ ✅ Optimize loading on networked drives
Optimize data handling for computers on a local network to improve performance for on-site setups.
@@ -521,7 +524,7 @@ dataset = StreamingDataset(input_dir="local:/data/shared-drive/some-data")
## Features for transforming datasets
- ✅ Map transformations
+ ✅ Parallelize data transformations (map)
Apply the same change to different parts of the dataset at once to save time and effort.
@@ -712,5 +715,5 @@ Below are templates for real-world applications of LitData at scale.
# Community
LitData is a community project accepting contributions - Let's make the world's most advanced AI data processing framework.
-💬 [Get help from 5,0000+ developers on our Discord](https://discord.com/invite/XncpTy7DSt)
-📋 [Licensed under the Apache 2.0 License](https://github.com/Lightning-AI/litdata/blob/main/LICENSE)
+💬 [Get help on Discord](https://discord.com/invite/XncpTy7DSt)
+📋 [License: Apache 2.0](https://github.com/Lightning-AI/litdata/blob/main/LICENSE)