diff --git a/README.md b/README.md index 0c263b8e..96b2b5fc 100644 --- a/README.md +++ b/README.md @@ -21,7 +21,7 @@ Transform Optimize ![PyPI](https://img.shields.io/pypi/v/litdata) ![Downloads](https://img.shields.io/pypi/dm/litdata) ![License](https://img.shields.io/github/license/Lightning-AI/litdata) -[![Discord](https://img.shields.io/discord/822497400078196796?label=Get%20Help%20on%20Discord)](https://discord.gg/VptPCZkGNa) +[![Discord](https://img.shields.io/discord/1077906959069626439?label=Get%20Help%20on%20Discord)](https://discord.gg/VptPCZkGNa)

Lightning AI • @@ -45,20 +45,22 @@ Transform Optimize   # Transform data at scale. Optimize for fast model training. -LitData enables two key data workflows [transform datasets](#transform-datasets) and [optimize to speed up AI model training](#speed-up-model-training): +LitData helps scale and speed up two key data workflows: + +[Transform datasets](#transform-datasets) - Parallelize (map) transforms across 1000s of machines. +[Optimize datasets](#speed-up-model-training)   - Accelerate AI model training by 20x. + +

+✅ Speed up training:         Speed up model training by 20x with optimized datasets.   
+✅ Stream cloud datasets:     Work with huge datasets directly from cloud storage without downloading.    
+✅ Pytorch-first:             Works with PyTorch libraries like PyTorch Lightning, Lightning Fabric, Hugging Face.    
+✅ Easy collaboration:        Share and access datasets in the cloud, streamlining team projects.     
+✅ Scale across GPUs:         Streamed data automatically scales to all GPUs.      
+✅ Flexible storage:          Use S3, GCS, Azure, or your own cloud account for data storage.    
+✅ Run local or cloud:        Auto-scale to 1000s of cloud GPUs with Lightning Studios.     
+✅ Enterprise security:       Self host or process data on your cloud account with Lightning Studios.  
+
-[Transform](#transform-datasets) - datasets across 1000s of machines. -[Optimize](#speed-up-model-training) - datasets for fast loading to speed up AI training by 20x. - -✅ **Blazing fast training** - Speed up model training by 20x with optimized datasets. -✅ **Stream from the cloud** - Work with huge datasets directly from cloud storage without downloading. -✅ **Pytorch-first** - Works with PyTorch Lightning, Lightning Fabric, and PyTorch. -✅ **Easy collaboration** - Works with PyTorch Lightning, Lightning Fabric, and PyTorch. -✅ **Scale across GPUs** - Share and access datasets in the cloud, streamlining team projects. -✅ **Flexible storage options** - Use S3, GCS, Azure, or your own cloud account for data storage. -✅ **Run local or cloud-** Auto-scale to 1000s of cloud GPUs with Lightning Studios. -✅ **Own VPC or cloud account-** Self host or process data on your cloud account with Lightning Studios. -   # Quick start @@ -207,7 +209,7 @@ for batch in dataloader:
- ✅ Scale across multiple GPUs or machines + ✅ Streams on multi-GPU, multi-node   @@ -222,7 +224,7 @@ Here you can see an illustration showing how the Streaming Dataset works with mu
- ✅ Pause & Resume data streaming + ✅ Pause, resume data streaming   Stream data during long training, if interrupted, pick up right where you left off without any issues. @@ -296,7 +298,7 @@ for batch in tqdm(train_dataloader):
- ✅ Split datasets for training, validation, and testing + ✅ Split datasets for train, val, test   @@ -325,10 +327,11 @@ print(test_dataset)
- ✅ Work with smaller subsets of a dataset + ✅ Load a subset of the remote dataset -Work on a smaller, manageable portion of your data to save time and resources.   +Work on a smaller, manageable portion of your data to save time and resources. + ```python from litdata import StreamingDataset, train_test_split @@ -342,7 +345,7 @@ print(len(dataset)) # display the length of your data
- ✅ Add or replace data in an optimized dataset + ✅ Easily modify optimized cloud datasets   Add new data to an existing dataset or start fresh if needed, providing flexibility in data management. @@ -383,7 +386,7 @@ The `overwrite` mode will delete the existing data and start from fresh.
- ✅ Access dataset parts without downloading everything + ✅ Access samples without full data download   Look at specific parts of a large dataset without downloading the whole thing or loading it on a local machine. @@ -429,7 +432,7 @@ for batch in dataloader:
- ✅ Measure and optimize data loading speed + ✅ Profile data loading speed   Measure and optimize how fast your data is being loaded, improving efficiency. @@ -485,7 +488,7 @@ outputs = optimize(
- ✅ Reduce disk space with caching limits + ✅ Limit local cache space   Limit the amount of disk space used by temporary files, preventing storage issues. @@ -501,7 +504,7 @@ dataset = StreamingDataset(..., max_cache_size="10GB")
- ✅ Optimize data loading on networked drives + ✅ Optimize loading on networked drives   Optimize data handling for computers on a local network to improve performance for on-site setups. @@ -521,7 +524,7 @@ dataset = StreamingDataset(input_dir="local:/data/shared-drive/some-data") ## Features for transforming datasets
- ✅ Map transformations + ✅ Parallelize data transformations (map)   Apply the same change to different parts of the dataset at once to save time and effort. @@ -712,5 +715,5 @@ Below are templates for real-world applications of LitData at scale. # Community LitData is a community project accepting contributions - Let's make the world's most advanced AI data processing framework. -💬 [Get help from 5,0000+ developers on our Discord](https://discord.com/invite/XncpTy7DSt) -📋 [Licensed under the Apache 2.0 License](https://github.com/Lightning-AI/litdata/blob/main/LICENSE) +💬 [Get help on Discord](https://discord.com/invite/XncpTy7DSt) +📋 [License: Apache 2.0](https://github.com/Lightning-AI/litdata/blob/main/LICENSE)