Fast random access for `StreamingDataset` #14

ethanwharris · 2024-02-23T09:09:48Z

🚀 Feature

Support a way to request just a single sample from a StreamingDataset without internally pulling the whole chunk.

Motivation

Streaming chunks is great for cases where you want to visit the whole dataset but sub-optimal if you just want to view individual samples. Right now, if you just index a StreamingDataset directly the latency is very high. This is a bit of an issue if you want to explore the dataset (e.g. in a streamlit or gradio app).

Pitch

We could have a way to request a single sample from the dataset that would download only the bytes of that sample instead of downloading the whole chunk. This would enable building visualizations etc. on top of streaming datasets.

Alternatives

Additional context

deependujha · 2024-07-25T19:12:18Z

Hey @ethanwharris , we have the feature to subsample from the dataset. Though, the subsamples are optimized to be from as few chunks as possible. Indexing and slicing is also supported.

They don't exactly fulfill your requirements, but I believe, these features address them effectively.

from litdata import StreamingDataset, train_test_split

dataset = StreamingDataset("s3://my-bucket/my-data", subsample=0.01) # data are stored in the cloud

print(len(dataset)) # display the length of your data
# out: 1000

Without unpacking the bin file, it might be challenging to get the exact item. Encrypted chunks pose another challenge for the same. But, please let us know if this is what you would like to have.

Else, you can close the issue.

tchaton · 2024-07-26T10:39:39Z

The goal here was to add support for multi range fetching from the client side, so we don't fetch the entire binary file but the only what the user requests.

ethanwharris added enhancement New feature or request help wanted Extra attention is needed labels Feb 23, 2024

Borda removed the help wanted Extra attention is needed label Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast random access for `StreamingDataset` #14

Fast random access for `StreamingDataset` #14

ethanwharris commented Feb 23, 2024

deependujha commented Jul 25, 2024

tchaton commented Jul 26, 2024

Fast random access for StreamingDataset #14

Fast random access for StreamingDataset #14

Comments

ethanwharris commented Feb 23, 2024

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

deependujha commented Jul 25, 2024

tchaton commented Jul 26, 2024

Fast random access for `StreamingDataset` #14

Fast random access for `StreamingDataset` #14