-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
litdata with huggingface instead of S3 #64
Comments
Hi! thanks for your contribution!, great first issue! |
Hey @ehartford. I have already prepared a version of SlimPajama. It is ready to use on the platform. |
Here is the code: from litdata import StreamingDataset, CombinedStreamingDataset
from litdata.streaming.item_loader import TokensLoader
from tqdm import tqdm
import os
from torch.utils.data import DataLoader
train_datasets = [
StreamingDataset(
input_dir="s3://tinyllama-template/slimpajama/train/",
item_loader=TokensLoader(block_size=2048 + 1), # Optimized loader for tokens used by LLMs
shuffle=True,
drop_last=True,
),
StreamingDataset(
input_dir="s3://tinyllama-template/starcoder/",
item_loader=TokensLoader(block_size=2048 + 1), # Optimized loader for tokens used by LLMs
shuffle=True,
drop_last=True,
),
]
# Mix SlimPajama data and Starcoder data with these proportions:
weights = (0.693584, 0.306416)
combined_dataset = CombinedStreamingDataset(datasets=train_datasets, seed=42, weights=weights)
train_dataloader = DataLoader(combined_dataset, batch_size=8, pin_memory=True, num_workers=os.cpu_count())
# Iterate over the combined datasets
for batch in tqdm(train_dataloader):
pass |
Ok but, is it better to support hugging face instead of having to copy the dataset to s3? Aws charges for ingress and egress |
we used to have some issues with the stability and reachability of HF models and datasets in the past so I may say that S3 is a more reliable alternative... |
Hey @ehartford. In order to stream datasets, we need to optimize the dataset first. We could have an auto-optimize version for the HF datasets, but it would still require to download the dataset and convert it. HF supports some streaming with webdataset backend but I gave up on it as it was too un-reliable for anything serious. The pipe breaks, it doesn't support multi node, etc... If you are interested in using any particular dataset, I recommend trying out the Lightning AI platform. Here is an example where I prepare Wikipedia Swedish: https://lightning.ai/lightning-ai/studios/tokenize-2m-swedish-wikipedia-articles And another one were I prepared SlimPajama & StarCoder: https://lightning.ai/lightning-ai/studios/prepare-the-tinyllama-1t-token-dataset. Don't hesitate to ask any other questions :) |
🚀 Feature
I wanna use litdata to stream huggingface dataset cerebras/SlimPajama-627B. (not S3)
Motivation
How can I stream huggingface dataset instead of S3
Pitch
I wanna stream huggingface dataset not S3
Alternatives
just to stream huggingface dataset instead of S3
Additional context
I wanna use huggingface dataset, not S3
The text was updated successfully, but these errors were encountered: