From 70997219487be39f90f39914719564f57149778e Mon Sep 17 00:00:00 2001 From: William Falcon Date: Sat, 6 Jul 2024 08:18:50 -0400 Subject: [PATCH 1/5] Update README.md --- README.md | 49 ++++++++++++++++++++----------------------------- 1 file changed, 20 insertions(+), 29 deletions(-) diff --git a/README.md b/README.md index ddbfc259..242be937 100644 --- a/README.md +++ b/README.md @@ -46,34 +46,7 @@ Transform Optimize   # Transform data at scale. Optimize for fast model training. -LitData scales data processing tasks (data scraping, image resizing, distributed inference, embedding creation) on local or cloud machines. It also enables optimizing datasets to accelerate AI model training and work with large remote datasets without local loading. - -### Transform datasets -Accelerate data processing tasks (data scraping, image resizing, embedding creation) by parallelizing (map) the work across many machines at once. - -
-✅ Paralellize processing:    Reduce processing time by transforming data across multiple machines simultaneously.    
-✅ Scale to large data:       Increase the size of datasets you can efficiently handle.    
-✅ Flexible usecases:         Resize images, create embeddings, scrape the internet, etc...    
-✅ Run local or cloud:        Run on your own machines or auto-scale to 1000s of cloud GPUs with Lightning Studios.         
-✅ Enterprise security:       Self host or process data on your cloud account with Lightning Studios.  
-
- -  - -### Optimize datasets -Accelerate model training (20x faster) by optimizing datasets for streaming directly from cloud storage. Work with remote data without local downloads with features like loading data subsets, accessing individual samples, and resumable streaming. - -
-✅ Speed up training:         Speed up model training by 20x with optimized datasets.   
-✅ Stream cloud datasets:     Work with cloud data without downloading it.    
-✅ Pytorch-first:             Works with PyTorch libraries like PyTorch Lightning, Lightning Fabric, Hugging Face.    
-✅ Easy collaboration:        Share and access datasets in the cloud, streamlining team projects.     
-✅ Scale across GPUs:         Streamed data automatically scales to all GPUs.      
-✅ Flexible storage:          Use S3, GCS, Azure, or your own cloud account for data storage.    
-✅ Run local or cloud:        Run on your own machines or auto-scale to 1000s of cloud GPUs with Lightning Studios.         
-✅ Enterprise security:       Self host or process data on your cloud account with Lightning Studios.  
-
+LitData scales [data processing tasks](#transform-datasets) (data scraping, image resizing, distributed inference, embedding creation) on local or cloud machines. It also enables [optimizing datasets](#speed-up-model-training) to accelerate AI model training and work with large remote datasets without local loading.   @@ -108,7 +81,6 @@ pip install 'litdata[extras]' # Speed up model training Accelerate model training (20x faster) by optimizing datasets for streaming directly from cloud storage. Work with remote data without local downloads with features like loading data subsets, accessing individual samples, and resumable streaming. - **Step 1: Optimize the data** This step will format the dataset for fast loading (binary, chunked, etc...) @@ -161,6 +133,17 @@ for sample in dataloader: img, cls = sample['image'], sample['class'] ``` +**Key benefits:** + +✅ Speed up training: Speed up model training by 20x with optimized datasets. +✅ Stream cloud datasets: Work with cloud data without downloading it. +✅ Pytorch-first: Works with PyTorch libraries like PyTorch Lightning, Lightning Fabric, Hugging Face. +✅ Easy collaboration: Share and access datasets in the cloud, streamlining team projects. +✅ Scale across GPUs: Streamed data automatically scales to all GPUs. +✅ Flexible storage: Use S3, GCS, Azure, or your own cloud account for data storage. +✅ Run local or cloud: Run on your own machines or auto-scale to 1000s of cloud GPUs with Lightning Studios. +✅ Enterprise security: Self host or process data on your cloud account with Lightning Studios. +   ---- @@ -192,6 +175,14 @@ ld.map( ) ``` +**Key benefits:** + +✅ Paralellize processing: Reduce processing time by transforming data across multiple machines simultaneously. +✅ Scale to large data: Increase the size of datasets you can efficiently handle. +✅ Flexible usecases: Resize images, create embeddings, scrape the internet, etc... +✅ Run local or cloud: Run on your own machines or auto-scale to 1000s of cloud GPUs with Lightning Studios. +✅ Enterprise security: Self host or process data on your cloud account with Lightning Studios. +   ---- From b675184adebde6b8937c59ebe593fb6f5cca9486 Mon Sep 17 00:00:00 2001 From: William Falcon Date: Sat, 6 Jul 2024 08:25:16 -0400 Subject: [PATCH 2/5] Update README.md --- README.md | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index 242be937..a07775a1 100644 --- a/README.md +++ b/README.md @@ -98,17 +98,16 @@ def random_images(index): "class": fake_labels } - # The data is serialized into bytes and stored into data chunks by the optimize operator. return data if __name__ == "__main__": - # optimize supports any data structures and types + # the optimize function formats data in an optimized format (chunked, binerized, etc...) ld.optimize( - fn=random_images, # The function applied over each input. - inputs=list(range(1000)), # Provide any inputs. The fn is applied on each item. - output_dir="my_optimized_dataset", # The directory where the optimized data are stored. - num_workers=4, # The number of workers. The inputs are distributed among them. - chunk_bytes="64MB" # The maximum number of bytes to write into a data chunk. + fn=random_images, # the function applied to each input + inputs=list(range(1000)), # the inputs to the function (here it's a list of numbers) + output_dir="my_optimized_dataset", # optimized data is stored here + num_workers=4, # The number of workers on the same machine + chunk_bytes="64MB" # size of each chunk ) ``` From 9135d45188390dfd18bcf3332147d02e31e2467b Mon Sep 17 00:00:00 2001 From: William Falcon Date: Sat, 6 Jul 2024 08:26:58 -0400 Subject: [PATCH 3/5] Update README.md --- README.md | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index a07775a1..51c881f9 100644 --- a/README.md +++ b/README.md @@ -91,17 +91,15 @@ import litdata as ld def random_images(index): fake_images = Image.fromarray(np.random.randint(0, 256, (32, 32, 3), dtype=np.uint8)) - fake_labels = np.random.randint(10) - data = { - "index": index, - "image": fake_images, - "class": fake_labels - } + fake_labels = np.random.randint(10) + + # use any key:value pairs + data = {"index": index, "image": fake_images, "class": fake_labels} return data if __name__ == "__main__": - # the optimize function formats data in an optimized format (chunked, binerized, etc...) + # the optimize function outputs data in an optimized format (chunked, binerized, etc...) ld.optimize( fn=random_images, # the function applied to each input inputs=list(range(1000)), # the inputs to the function (here it's a list of numbers) From 5152d26dec114bb2425131e1f5e56c677dab0274 Mon Sep 17 00:00:00 2001 From: William Falcon Date: Sat, 6 Jul 2024 08:36:42 -0400 Subject: [PATCH 4/5] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 51c881f9..162c3295 100644 --- a/README.md +++ b/README.md @@ -132,7 +132,7 @@ for sample in dataloader: **Key benefits:** -✅ Speed up training: Speed up model training by 20x with optimized datasets. +✅ Accelerate training: Optimized datasets load 20x faster. ✅ Stream cloud datasets: Work with cloud data without downloading it. ✅ Pytorch-first: Works with PyTorch libraries like PyTorch Lightning, Lightning Fabric, Hugging Face. ✅ Easy collaboration: Share and access datasets in the cloud, streamlining team projects. From 70dc4a4b3c378bf0ee5740dab5a175c6ddb29b90 Mon Sep 17 00:00:00 2001 From: thomas chaton Date: Sun, 7 Jul 2024 09:35:53 +0100 Subject: [PATCH 5/5] Update README.md (#211) --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 162c3295..0b6e6575 100644 --- a/README.md +++ b/README.md @@ -645,9 +645,9 @@ Time to optimize 1.2 million ImageNet images (Faster is better): ## Parallelize data transforms -Transformations with LitServe are linearly parallelizable across machines. +Transformations with LitData are linearly parallelizable across machines. -For example, let's say that it takes 56 hours to embed a dataset on a single A10G machine. With LitServe, +For example, let's say that it takes 56 hours to embed a dataset on a single A10G machine. With LitData, this can be speed up by adding more machines in parallel | Number of machines | Hours |