Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
williamFalcon authored Jul 5, 2024
1 parent 25e2879 commit 51173c5
Showing 1 changed file with 24 additions and 6 deletions.
30 changes: 24 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,9 @@ ld.map(
<summary> ✅ Stream large cloud datasets</summary>
&nbsp;

Most large datasets are stored on the cloud and may not fit on local disks. Streaming enables fast data transfer from remote locations to training machines. With optimized formatting like chunking in litserve, data transfer can be faster than local disk access.
Use data stored on the cloud without needing to download it all to your computer, saving time and space.

Imagine you're working on a project with a huge amount of data stored online. Instead of waiting hours to download it all, you can start working with the data almost immediately by streaming it.

Once you've optimized the dataset with LitData, stream it as follows:
```python
Expand All @@ -208,6 +210,8 @@ for batch in dataloader:

&nbsp;

Data optimized and loaded with Lightning automatically streams efficiently in distributed training across GPUs or multi-node.

The `StreamingDataset` and `StreamingDataLoader` automatically make sure each rank receives the same quantity of varied batches of data, so it works out of the box with your favorite frameworks ([PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/), [Lightning Fabric](https://lightning.ai/docs/fabric/stable/), or [PyTorch](https://pytorch.org/docs/stable/index.html)) to do distributed training.

Here you can see an illustration showing how the Streaming Dataset works with multi node / multi gpu under the hood.
Expand All @@ -220,6 +224,8 @@ Here you can see an illustration showing how the Streaming Dataset works with mu
<summary> ✅ Pause & Resume data streaming</summary>
&nbsp;

Stream data during long training, if interrupted, pick up right where you left off without any issues.

LitData provides a stateful `Streaming DataLoader` e.g. you can `pause` and `resume` your training whenever you want.

Info: The `Streaming DataLoader` was used by [Lit-GPT](https://github.com/Lightning-AI/lit-gpt/blob/main/pretrain/tinyllama.py) to pretrain LLMs. Restarting from an older checkpoint was critical to get to pretrain the full model due to several failures (network, CUDA Errors, etc..).
Expand Down Expand Up @@ -252,10 +258,9 @@ for batch_idx, batch in enumerate(dataloader):
<summary> ✅ Combine datasets</summary>
&nbsp;

Mix and match different sets of data to experiment and create better models.

Easily experiment with dataset mixtures using the `CombinedStreamingDataset` class.

As an example, this mixture of [Slimpajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) & [StarCoder](https://huggingface.co/datasets/bigcode/starcoderdata) was used in the [TinyLLAMA](https://github.com/jzhang38/TinyLlama) project to pretrain a 1.1B Llama model on 3 trillion tokens.
Combine datasets with `CombinedStreamingDataset`. As an example, this mixture of [Slimpajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) & [StarCoder](https://huggingface.co/datasets/bigcode/starcoderdata) was used in the [TinyLLAMA](https://github.com/jzhang38/TinyLlama) project to pretrain a 1.1B Llama model on 3 trillion tokens.

```python
from litdata import StreamingDataset, CombinedStreamingDataset, StreamingDataLoader, TokensLoader
Expand Down Expand Up @@ -319,8 +324,9 @@ print(test_dataset)
</details>

<details>
<summary> ✅ Subsample datasets</summary>
<summary> ✅ Load a dataset subsample</summary>

Work on a smaller, manageable portion of your data to save time and resources.
&nbsp;

```python
Expand All @@ -338,6 +344,7 @@ print(len(dataset)) # display the length of your data
<summary> ✅ Append or overwrite optimized datasets</summary>
&nbsp;

Add new data to an existing dataset or start fresh if needed, providing flexibility in data management.

LitData optimized datasets are assumed to be immutable. However, you can make the decision to modify them by changing the mode to either `append` or `overwrite`.

Expand Down Expand Up @@ -378,7 +385,7 @@ The `overwrite` mode will delete the existing data and start from fresh.
<summary> ✅ Access subsets of large cloud datasets</summary>
&nbsp;

Access a subset of a dataset or individual items without loading all data on the local machine.
Look at specific parts of a large dataset without downloading the whole thing or loading it on a local machine.

```python
from litdata import StreamingDataset
Expand All @@ -396,6 +403,7 @@ print(dataset[42]) # show the 42th element of the dataset
<summary> ✅ Use any data transforms</summary>
&nbsp;

Customize how your data is processed to better fit your needs.

Subclass the `StreamingDataset` and override its `__getitem__` method to add any extra data transformations.

Expand Down Expand Up @@ -423,6 +431,8 @@ for batch in dataloader:
<summary> ✅ Profile loading speed</summary>
&nbsp;

Measure and optimize how fast your data is being loaded, improving efficiency.

The `StreamingDataLoader` supports profiling of your data loading process. Simply use the `profile_batches` argument to specify the number of batches you want to profile:

```python
Expand All @@ -439,6 +449,7 @@ This generates a Chrome trace called `result.json`. Then, visualize this trace b
<summary> ✅ Reduce memory footprint</summary>
&nbsp;

Handle large data files efficiently without using too much of your computer's memory.

When processing large files like compressed [parquet files](https://en.wikipedia.org/wiki/Apache_Parquet), use the Python yield keyword to process and store one item at the time, reducing the memory footprint of the entire program.

Expand Down Expand Up @@ -476,6 +487,8 @@ outputs = optimize(
<summary> ✅ Reduce disk space with caching limits</summary>
&nbsp;

Limit the amount of disk space used by temporary files, preventing storage issues.

Adapt the local caching limit of the `StreamingDataset`. This is useful to make sure the downloaded data chunks are deleted when used and the disk usage stays low.

```python
Expand All @@ -489,6 +502,8 @@ dataset = StreamingDataset(..., max_cache_size="10GB")
<details>
<summary> ✅ On-Prem Optimizations</summary>
&nbsp;

Optimize data handling for computers on a local network to improve performance for on-site setups.

On-prem compute nodes can mount and use a network drive. A network drive is a shared storage device on a local area network. In order to reduce their network overload, the `StreamingDataset` supports `caching` the data chunks.

Expand All @@ -508,6 +523,7 @@ dataset = StreamingDataset(input_dir="local:/data/shared-drive/some-data")
<summary> ✅ Map transformations</summary>
&nbsp;

Apply the same change to different parts of the dataset at once to save time and effort.

The `map` operator can be used to apply a function over a list of inputs.

Expand Down Expand Up @@ -540,6 +556,8 @@ map(
<summary> ✅ Support S3-Compatible Object Storage</summary>
&nbsp;

Use different cloud storage services, offering data storage flexibility and cost-saving options.

Integrate S3-compatible object storage servers like [MinIO](https://min.io/) with litdata, ideal for on-premises infrastructure setups. Configure the endpoint and credentials using environment variables or configuration files.

Set up the environment variables to connect to MinIO:
Expand Down

0 comments on commit 51173c5

Please sign in to comment.