Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

litdata with huggingface instead of S3 #64

Closed
ehartford opened this issue Mar 8, 2024 · 6 comments
Closed

litdata with huggingface instead of S3 #64

ehartford opened this issue Mar 8, 2024 · 6 comments
Labels
enhancement New feature or request

Comments

@ehartford
Copy link

🚀 Feature

I wanna use litdata to stream huggingface dataset cerebras/SlimPajama-627B. (not S3)

Motivation

How can I stream huggingface dataset instead of S3

Pitch

I wanna stream huggingface dataset not S3

Alternatives

just to stream huggingface dataset instead of S3

Additional context

I wanna use huggingface dataset, not S3

@ehartford ehartford added enhancement New feature or request help wanted Extra attention is needed labels Mar 8, 2024
Copy link

github-actions bot commented Mar 8, 2024

Hi! thanks for your contribution!, great first issue!

@tchaton
Copy link
Collaborator

tchaton commented Mar 8, 2024

Hey @ehartford. I have already prepared a version of SlimPajama. It is ready to use on the platform.

@tchaton
Copy link
Collaborator

tchaton commented Mar 8, 2024

Here is the code:

from litdata import StreamingDataset, CombinedStreamingDataset
from litdata.streaming.item_loader import TokensLoader
from tqdm import tqdm
import os
from torch.utils.data import DataLoader

train_datasets = [
    StreamingDataset(
        input_dir="s3://tinyllama-template/slimpajama/train/",
        item_loader=TokensLoader(block_size=2048 + 1), # Optimized loader for tokens used by LLMs 
        shuffle=True,
        drop_last=True,
    ),
    StreamingDataset(
        input_dir="s3://tinyllama-template/starcoder/",
        item_loader=TokensLoader(block_size=2048 + 1), # Optimized loader for tokens used by LLMs 
        shuffle=True,
        drop_last=True,
    ),
]

# Mix SlimPajama data and Starcoder data with these proportions:
weights = (0.693584, 0.306416)
combined_dataset = CombinedStreamingDataset(datasets=train_datasets, seed=42, weights=weights)

train_dataloader = DataLoader(combined_dataset, batch_size=8, pin_memory=True, num_workers=os.cpu_count())

# Iterate over the combined datasets
for batch in tqdm(train_dataloader):
    pass

@ehartford
Copy link
Author

Ok but, is it better to support hugging face instead of having to copy the dataset to s3? Aws charges for ingress and egress

@Borda
Copy link
Member

Borda commented Mar 8, 2024

Ok but, is it better to support hugging face instead of having to copy the dataset to s3?

we used to have some issues with the stability and reachability of HF models and datasets in the past so I may say that S3 is a more reliable alternative...

@tchaton
Copy link
Collaborator

tchaton commented Mar 8, 2024

Hey @ehartford. In order to stream datasets, we need to optimize the dataset first. We could have an auto-optimize version for the HF datasets, but it would still require to download the dataset and convert it.

HF supports some streaming with webdataset backend but I gave up on it as it was too un-reliable for anything serious. The pipe breaks, it doesn't support multi node, etc...

If you are interested in using any particular dataset, I recommend trying out the Lightning AI platform.

Here is an example where I prepare Wikipedia Swedish: https://lightning.ai/lightning-ai/studios/tokenize-2m-swedish-wikipedia-articles

And another one were I prepared SlimPajama & StarCoder: https://lightning.ai/lightning-ai/studios/prepare-the-tinyllama-1t-token-dataset.

Don't hesitate to ask any other questions :)

@Borda Borda removed the help wanted Extra attention is needed label Apr 18, 2024
@tchaton tchaton closed this as completed Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants