Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

custom cache directory for local path #224

Closed
csy1204 opened this issue Jul 12, 2024 · 5 comments · Fixed by #229
Closed

custom cache directory for local path #224

csy1204 opened this issue Jul 12, 2024 · 5 comments · Fixed by #229
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@csy1204
Copy link
Contributor

csy1204 commented Jul 12, 2024

🚀 Feature

There have been many situations where it would be beneficial for users to specify the Cache Directory. I would like to contribute by developing a feature that allows passing the desired path as a paramter.

Motivation

We use several file systems that have different performances and features. so we need to choose cache directory for different use cases.

Pitch

from litdata import StreamingDataset

dataset = StreamingDataset(...,  cache_dir="./chunks", max_cache_size="10GB")
  • add a parameter in StreamingDataset

Alternatives

Additional context

@csy1204 csy1204 added enhancement New feature or request help wanted Extra attention is needed labels Jul 12, 2024
Copy link

Hi! thanks for your contribution!, great first issue!

@tchaton
Copy link
Collaborator

tchaton commented Jul 12, 2024

Hey @csy1204, You can already do it right now.

from litdata import StreamingDataset
from litdata.streaming.cache import Dir

dataset = StreamingDataset(input_dir=Dir(cache_dir, data_dir), max_cache_size="10GB")

But I do agree this isn't very straightforward. Feel free to make a PR expose it on the StreamingDataset. This should be fairly simple.

@csy1204
Copy link
Contributor Author

csy1204 commented Jul 12, 2024

@tchaton

Thanks! I would like to use a remote storage for input_dir and store the cache in a custom directory as well.
so, the following code is more accurate. I will work on this feature. 🙋🏻‍♂️

from litdata import StreamingDataset
from litdata.streaming.cache import Dir

dataset = StreamingDataset(input_dir="s3://data-bucket/train", cache_dir="/fast_fs/.cache" max_cache_size="10GB")

def _try_create_cache_dir(input_dir: Optional[str]) -> Optional[str]:
hash_object = hashlib.md5((input_dir or "").encode()) # noqa: S324
if "LIGHTNING_CLUSTER_ID" not in os.environ or "LIGHTNING_CLOUD_PROJECT_ID" not in os.environ:
cache_dir = os.path.join(_DEFAULT_CACHE_DIR, hash_object.hexdigest())
os.makedirs(cache_dir, exist_ok=True)
return cache_dir
cache_dir = os.path.join("/cache", "chunks", hash_object.hexdigest())
os.makedirs(cache_dir, exist_ok=True)
return cache_dir

@deependujha
Copy link
Collaborator

Have you tried:

ds= StreamingDataset(input_dir=Dir(path="/fast_fs/.cache", url="s3://data-bucket/train"), max_cache_size="10GB")

@csy1204
Copy link
Contributor Author

csy1204 commented Jul 12, 2024

@deependujha cc. @tchaton

While working on this feature myself, I gained a precise understanding of its meaning. I realized that in the directory, path and URL can be utilized differently. Fortunately, this helped me to deepen my understanding of litdata. 😂 Thank you for the excellent explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants