You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Support a way to request just a single sample from a StreamingDataset without internally pulling the whole chunk.
Motivation
Streaming chunks is great for cases where you want to visit the whole dataset but sub-optimal if you just want to view individual samples. Right now, if you just index a StreamingDataset directly the latency is very high. This is a bit of an issue if you want to explore the dataset (e.g. in a streamlit or gradio app).
Pitch
We could have a way to request a single sample from the dataset that would download only the bytes of that sample instead of downloading the whole chunk. This would enable building visualizations etc. on top of streaming datasets.
Alternatives
Additional context
The text was updated successfully, but these errors were encountered:
Hey @ethanwharris , we have the feature to subsample from the dataset. Though, the subsamples are optimized to be from as few chunks as possible. Indexing and slicing is also supported.
They don't exactly fulfill your requirements, but I believe, these features address them effectively.
fromlitdataimportStreamingDataset, train_test_splitdataset=StreamingDataset("s3://my-bucket/my-data", subsample=0.01) # data are stored in the cloudprint(len(dataset)) # display the length of your data# out: 1000
Without unpacking the bin file, it might be challenging to get the exact item. Encrypted chunks pose another challenge for the same. But, please let us know if this is what you would like to have.
The goal here was to add support for multi range fetching from the client side, so we don't fetch the entire binary file but the only what the user requests.
🚀 Feature
Support a way to request just a single sample from a StreamingDataset without internally pulling the whole chunk.
Motivation
Streaming chunks is great for cases where you want to visit the whole dataset but sub-optimal if you just want to view individual samples. Right now, if you just index a StreamingDataset directly the latency is very high. This is a bit of an issue if you want to explore the dataset (e.g. in a streamlit or gradio app).
Pitch
We could have a way to request a single sample from the dataset that would download only the bytes of that sample instead of downloading the whole chunk. This would enable building visualizations etc. on top of streaming datasets.
Alternatives
Additional context
The text was updated successfully, but these errors were encountered: