Skip to content

Commit

Permalink
Merge branch 'main' into feat/adds-mosaic-mds-support
Browse files Browse the repository at this point in the history
  • Loading branch information
bhimrazy authored Jul 6, 2024
2 parents 2dedd0b + bb3d46a commit 281329f
Showing 1 changed file with 26 additions and 11 deletions.
37 changes: 26 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,10 @@ Optimize data for fast AI model training.**
<pre>
Transform Optimize

✅ Parallelize data processing ✅ Stream large cloud datasets
✅ Create vector embeddings ✅ Accelerate training by 20x
✅ Transform any data type ✅ Pause and resume data streaming
✅ Parallelize data processing ✅ Stream large cloud datasets
✅ Create vector embeddings ✅ Accelerate training by 20x
✅ Run distributed inference ✅ Pause and resume data streaming
✅ Scrape websites at scale ✅ Use remote data without local loading
</pre>

---
Expand Down Expand Up @@ -45,19 +46,32 @@ Transform Optimize
&nbsp;

# Transform data at scale. Optimize for fast model training.
LitData helps scale and speed up two key data workflows:
LitData scales data processing tasks (data scraping, image resizing, distributed inference, embedding creation) on local or cloud machines. It also enables optimizing datasets to accelerate AI model training and work with large remote datasets without local loading.

[Transform datasets](#transform-datasets) - Parallelize (map) transforms across 1000s of machines.
[Optimize datasets](#speed-up-model-training) &nbsp; - Accelerate AI model training by 20x.
### Transform datasets
Accelerate data processing tasks (data scraping, image resizing, embedding creation) by parallelizing (map) the work across many machines at once.

<pre style="background-color: transparent !important;">
<pre>
✅ Paralellize processing: Reduce processing time by transforming data across multiple machines simultaneously.
✅ Scale to large data: Increase the size of datasets you can efficiently handle.
✅ Flexible usecases: Resize images, create embeddings, scrape the internet, etc...
✅ Run local or cloud: Run on your own machines or auto-scale to 1000s of cloud GPUs with Lightning Studios.
✅ Enterprise security: Self host or process data on your cloud account with Lightning Studios.
</pre>

&nbsp;

### Optimize datasets
Accelerate model training (20x faster) by optimizing datasets for streaming directly from cloud storage. Work with remote data without local downloads with features like loading data subsets, accessing individual samples, and resumable streaming.

<pre>
✅ Speed up training: Speed up model training by 20x with optimized datasets.
✅ Stream cloud datasets: Work with huge datasets directly from cloud storage without downloading.
✅ Stream cloud datasets: Work with cloud data without downloading it.
✅ Pytorch-first: Works with PyTorch libraries like PyTorch Lightning, Lightning Fabric, Hugging Face.
✅ Easy collaboration: Share and access datasets in the cloud, streamlining team projects.
✅ Scale across GPUs: Streamed data automatically scales to all GPUs.
✅ Flexible storage: Use S3, GCS, Azure, or your own cloud account for data storage.
✅ Run local or cloud: Auto-scale to 1000s of cloud GPUs with Lightning Studios.
✅ Run local or cloud: Run on your own machines or auto-scale to 1000s of cloud GPUs with Lightning Studios.
✅ Enterprise security: Self host or process data on your cloud account with Lightning Studios.
</pre>

Expand Down Expand Up @@ -92,7 +106,8 @@ pip install 'litdata[extras]'
----

# Speed up model training
Significantly speed up model training by optimizing datasets for fast loading (20x faster) and streaming from cloud storage.
Accelerate model training (20x faster) by optimizing datasets for streaming directly from cloud storage. Work with remote data without local downloads with features like loading data subsets, accessing individual samples, and resumable streaming.


**Step 1: Optimize the data**
This step will format the dataset for fast loading (binary, chunked, etc...)
Expand Down Expand Up @@ -151,7 +166,7 @@ for sample in dataloader:
----

# Transform datasets
Use LitData to apply transforms to large datasets across multiple machines in parallel. Common usecases are to create vector embeddings, run distributed inference and more.
Accelerate data processing tasks (data scraping, image resizing, embedding creation, distributed inference) by parallelizing (map) the work across many machines at once.

Here's an example that resizes and crops a large image dataset:

Expand Down

0 comments on commit 281329f

Please sign in to comment.