Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
williamFalcon authored Jul 5, 2024
1 parent 5b47672 commit 8d12ea6
Showing 1 changed file with 78 additions and 77 deletions.
155 changes: 78 additions & 77 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -178,19 +178,6 @@ ld.map(

## Features for transforming datasets

<details>
<summary> ✅ Multi-GPU / Multi-Node Support</summary>

&nbsp;

The `StreamingDataset` and `StreamingDataLoader` automatically make sure each rank receives the same quantity of varied batches of data, so it works out of the box with your favorite frameworks ([PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/), [Lightning Fabric](https://lightning.ai/docs/fabric/stable/), or [PyTorch](https://pytorch.org/docs/stable/index.html)) to do distributed training.

Here you can see an illustration showing how the Streaming Dataset works with multi node / multi gpu under the hood.

![An illustration showing how the Streaming Dataset works with multi node.](https://pl-flash-data.s3.amazonaws.com/streaming_dataset.gif)

</details>

<details>
<summary> ✅ Map transformations</summary>
&nbsp;
Expand Down Expand Up @@ -223,6 +210,44 @@ map(

</details>

<details>
<summary> ✅ Support S3-Compatible Object Storage</summary>
&nbsp;

Integrate S3-compatible object storage servers like [MinIO](https://min.io/) with litdata, ideal for on-premises infrastructure setups. Configure the endpoint and credentials using environment variables or configuration files.

Set up the environment variables to connect to MinIO:

```bash
export AWS_ACCESS_KEY_ID=access_key
export AWS_SECRET_ACCESS_KEY=secret_key
export AWS_ENDPOINT_URL=http://localhost:9000 # MinIO endpoint
```

Alternatively, configure credentials and endpoint in `~/.aws/{credentials,config}`:

```bash
mkdir -p ~/.aws && \
cat <<EOL >> ~/.aws/credentials
[default]
aws_access_key_id = access_key
aws_secret_access_key = secret_key
EOL

cat <<EOL >> ~/.aws/config
[default]
endpoint_url = http://localhost:9000 # MinIO endpoint
EOL
```
Explore an example setup of litdata with MinIO in the [LitData with MinIO](https://github.com/bhimrazy/litdata-with-minio) repository for practical implementation details.

</details>

&nbsp;

## Features for optimizing and streaming datasets for model training


<details>
<summary> ✅ Stream datasets</summary>
&nbsp;
Expand All @@ -244,51 +269,22 @@ for batch in dataloader:
</details>

<details>
<summary> ✅ Combine datasets</summary>
&nbsp;


Easily experiment with dataset mixtures using the `CombinedStreamingDataset` class.

As an example, this mixture of [Slimpajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) & [StarCoder](https://huggingface.co/datasets/bigcode/starcoderdata) was used in the [TinyLLAMA](https://github.com/jzhang38/TinyLlama) project to pretrain a 1.1B Llama model on 3 trillion tokens.
<summary> ✅ Multi-GPU / Multi-Node Support</summary>

```python
from litdata import StreamingDataset, CombinedStreamingDataset, StreamingDataLoader, TokensLoader
from tqdm import tqdm
import os
&nbsp;

train_datasets = [
StreamingDataset(
input_dir="s3://tinyllama-template/slimpajama/train/",
item_loader=TokensLoader(block_size=2048 + 1), # Optimized loader for tokens used by LLMs
shuffle=True,
drop_last=True,
),
StreamingDataset(
input_dir="s3://tinyllama-template/starcoder/",
item_loader=TokensLoader(block_size=2048 + 1), # Optimized loader for tokens used by LLMs
shuffle=True,
drop_last=True,
),
]
The `StreamingDataset` and `StreamingDataLoader` automatically make sure each rank receives the same quantity of varied batches of data, so it works out of the box with your favorite frameworks ([PyTorch Lightning](https://lightning.ai/docs/pytorch/stable/), [Lightning Fabric](https://lightning.ai/docs/fabric/stable/), or [PyTorch](https://pytorch.org/docs/stable/index.html)) to do distributed training.

# Mix SlimPajama data and Starcoder data with these proportions:
weights = (0.693584, 0.306416)
combined_dataset = CombinedStreamingDataset(datasets=train_datasets, seed=42, weights=weights)
Here you can see an illustration showing how the Streaming Dataset works with multi node / multi gpu under the hood.

train_dataloader = StreamingDataLoader(combined_dataset, batch_size=8, pin_memory=True, num_workers=os.cpu_count())
![An illustration showing how the Streaming Dataset works with multi node.](https://pl-flash-data.s3.amazonaws.com/streaming_dataset.gif)

# Iterate over the combined datasets
for batch in tqdm(train_dataloader):
pass
```
</details>

<details>
<summary> ✅ Pause & Resume data streaming</summary>
&nbsp;


LitData provides a stateful `Streaming DataLoader` e.g. you can `pause` and `resume` your training whenever you want.

Info: The `Streaming DataLoader` was used by [Lit-GPT](https://github.com/Lightning-AI/lit-gpt/blob/main/pretrain/tinyllama.py) to pretrain LLMs. Restarting from an older checkpoint was critical to get to pretrain the full model due to several failures (network, CUDA Errors, etc..).
Expand Down Expand Up @@ -316,49 +312,54 @@ for batch_idx, batch in enumerate(dataloader):

</details>


<details>
<summary> ✅ Support S3-Compatible Object Storage</summary>
<summary> ✅ Combine datasets</summary>
&nbsp;

Integrate S3-compatible object storage servers like [MinIO](https://min.io/) with litdata, ideal for on-premises infrastructure setups. Configure the endpoint and credentials using environment variables or configuration files.

Set up the environment variables to connect to MinIO:

```bash
export AWS_ACCESS_KEY_ID=access_key
export AWS_SECRET_ACCESS_KEY=secret_key
export AWS_ENDPOINT_URL=http://localhost:9000 # MinIO endpoint
```
Easily experiment with dataset mixtures using the `CombinedStreamingDataset` class.

Alternatively, configure credentials and endpoint in `~/.aws/{credentials,config}`:
As an example, this mixture of [Slimpajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) & [StarCoder](https://huggingface.co/datasets/bigcode/starcoderdata) was used in the [TinyLLAMA](https://github.com/jzhang38/TinyLlama) project to pretrain a 1.1B Llama model on 3 trillion tokens.

```bash
mkdir -p ~/.aws && \
cat <<EOL >> ~/.aws/credentials
[default]
aws_access_key_id = access_key
aws_secret_access_key = secret_key
EOL
```python
from litdata import StreamingDataset, CombinedStreamingDataset, StreamingDataLoader, TokensLoader
from tqdm import tqdm
import os

cat <<EOL >> ~/.aws/config
[default]
endpoint_url = http://localhost:9000 # MinIO endpoint
EOL
```
Explore an example setup of litdata with MinIO in the [LitData with MinIO](https://github.com/bhimrazy/litdata-with-minio) repository for practical implementation details.
train_datasets = [
StreamingDataset(
input_dir="s3://tinyllama-template/slimpajama/train/",
item_loader=TokensLoader(block_size=2048 + 1), # Optimized loader for tokens used by LLMs
shuffle=True,
drop_last=True,
),
StreamingDataset(
input_dir="s3://tinyllama-template/starcoder/",
item_loader=TokensLoader(block_size=2048 + 1), # Optimized loader for tokens used by LLMs
shuffle=True,
drop_last=True,
),
]

</details>
# Mix SlimPajama data and Starcoder data with these proportions:
weights = (0.693584, 0.306416)
combined_dataset = CombinedStreamingDataset(datasets=train_datasets, seed=42, weights=weights)

&nbsp;
train_dataloader = StreamingDataLoader(combined_dataset, batch_size=8, pin_memory=True, num_workers=os.cpu_count())

## Features for optimizing datasets
# Iterate over the combined datasets
for batch in tqdm(train_dataloader):
pass
```
</details>

<details>
<summary> ✅ Subsample and split datasets</summary>

&nbsp;

You can split your dataset with more ease with `train_test_split`.
Split a dataset into train, val, test splits with `train_test_split`.

```python
from litdata import StreamingDataset, train_test_split
Expand Down Expand Up @@ -405,7 +406,7 @@ print(len(dataset)) # display the length of your data
</details>

<details>
<summary> ✅ Append or Overwrite optimized datasets</summary>
<summary> ✅ Append or overwrite optimized datasets</summary>
&nbsp;


Expand Down Expand Up @@ -490,7 +491,7 @@ for batch in dataloader:
</details>

<details>
<summary> ✅ Support Profiling</summary>
<summary> ✅ Profile loading speed</summary>
&nbsp;

The `StreamingDataLoader` supports profiling of your data loading process. Simply use the `profile_batches` argument to specify the number of batches you want to profile:
Expand Down Expand Up @@ -543,7 +544,7 @@ outputs = optimize(
</details>

<details>
<summary> ✅ Configure Cache Size Limit</summary>
<summary> ✅ Reduce disk space with caching limits</summary>
&nbsp;

Adapt the local caching limit of the `StreamingDataset`. This is useful to make sure the downloaded data chunks are deleted when used and the disk usage stays low.
Expand Down

0 comments on commit 8d12ea6

Please sign in to comment.