Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast random access for StreamingDataset #14

Open
ethanwharris opened this issue Feb 23, 2024 · 2 comments
Open

Fast random access for StreamingDataset #14

ethanwharris opened this issue Feb 23, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@ethanwharris
Copy link
Member

🚀 Feature

Support a way to request just a single sample from a StreamingDataset without internally pulling the whole chunk.

Motivation

Streaming chunks is great for cases where you want to visit the whole dataset but sub-optimal if you just want to view individual samples. Right now, if you just index a StreamingDataset directly the latency is very high. This is a bit of an issue if you want to explore the dataset (e.g. in a streamlit or gradio app).

Pitch

We could have a way to request a single sample from the dataset that would download only the bytes of that sample instead of downloading the whole chunk. This would enable building visualizations etc. on top of streaming datasets.

Alternatives

Additional context

@ethanwharris ethanwharris added enhancement New feature or request help wanted Extra attention is needed labels Feb 23, 2024
@Borda Borda removed the help wanted Extra attention is needed label Apr 18, 2024
@deependujha
Copy link
Collaborator

Hey @ethanwharris , we have the feature to subsample from the dataset. Though, the subsamples are optimized to be from as few chunks as possible. Indexing and slicing is also supported.

They don't exactly fulfill your requirements, but I believe, these features address them effectively.

from litdata import StreamingDataset, train_test_split

dataset = StreamingDataset("s3://my-bucket/my-data", subsample=0.01) # data are stored in the cloud

print(len(dataset)) # display the length of your data
# out: 1000

Without unpacking the bin file, it might be challenging to get the exact item. Encrypted chunks pose another challenge for the same. But, please let us know if this is what you would like to have.

Else, you can close the issue.

@tchaton
Copy link
Collaborator

tchaton commented Jul 26, 2024

The goal here was to add support for multi range fetching from the client side, so we don't fetch the entire binary file but the only what the user requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants