custom cache directory for local path #224

csy1204 · 2024-07-12T07:59:20Z

🚀 Feature

There have been many situations where it would be beneficial for users to specify the Cache Directory. I would like to contribute by developing a feature that allows passing the desired path as a paramter.

Motivation

We use several file systems that have different performances and features. so we need to choose cache directory for different use cases.

Pitch

from litdata import StreamingDataset

dataset = StreamingDataset(...,  cache_dir="./chunks", max_cache_size="10GB")

add a parameter in StreamingDataset

Alternatives

Additional context

The text was updated successfully, but these errors were encountered:

github-actions · 2024-07-12T07:59:42Z

Hi! thanks for your contribution!, great first issue!

tchaton · 2024-07-12T08:15:01Z

Hey @csy1204, You can already do it right now.

from litdata import StreamingDataset
from litdata.streaming.cache import Dir

dataset = StreamingDataset(input_dir=Dir(cache_dir, data_dir), max_cache_size="10GB")

But I do agree this isn't very straightforward. Feel free to make a PR expose it on the StreamingDataset. This should be fairly simple.

csy1204 · 2024-07-12T08:31:41Z

@tchaton

Thanks! I would like to use a remote storage for input_dir and store the cache in a custom directory as well.
so, the following code is more accurate. I will work on this feature. 🙋🏻‍♂️

from litdata import StreamingDataset
from litdata.streaming.cache import Dir

dataset = StreamingDataset(input_dir="s3://data-bucket/train", cache_dir="/fast_fs/.cache" max_cache_size="10GB")

litdata/src/litdata/utilities/dataset_utilities.py

Lines 93 to 101 in df8dcd1

    
           def _try_create_cache_dir(input_dir: Optional[str]) -> Optional[str]: 
        
               hash_object = hashlib.md5((input_dir or "").encode())  # noqa: S324 
        
               if "LIGHTNING_CLUSTER_ID" not in os.environ or "LIGHTNING_CLOUD_PROJECT_ID" not in os.environ: 
        
                   cache_dir = os.path.join(_DEFAULT_CACHE_DIR, hash_object.hexdigest()) 
        
                   os.makedirs(cache_dir, exist_ok=True) 
        
                   return cache_dir 
        
               cache_dir = os.path.join("/cache", "chunks", hash_object.hexdigest()) 
        
               os.makedirs(cache_dir, exist_ok=True) 
        
               return cache_dir

deependujha · 2024-07-12T09:28:47Z

Have you tried:

ds= StreamingDataset(input_dir=Dir(path="/fast_fs/.cache", url="s3://data-bucket/train"), max_cache_size="10GB")

csy1204 · 2024-07-12T10:10:21Z

@deependujha cc. @tchaton

While working on this feature myself, I gained a precise understanding of its meaning. I realized that in the directory, path and URL can be utilized differently. Fortunately, this helped me to deepen my understanding of litdata. 😂 Thank you for the excellent explanation.

csy1204 added enhancement New feature or request help wanted Extra attention is needed labels Jul 12, 2024

csy1204 closed this as completed Jul 12, 2024

csy1204 mentioned this issue Jul 13, 2024

docs: add Specify cache directory #229

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

custom cache directory for local path #224

custom cache directory for local path #224

csy1204 commented Jul 12, 2024 •

edited

Loading

github-actions bot commented Jul 12, 2024

tchaton commented Jul 12, 2024

csy1204 commented Jul 12, 2024 •

edited

Loading

deependujha commented Jul 12, 2024

csy1204 commented Jul 12, 2024

custom cache directory for local path #224

custom cache directory for local path #224

Comments

csy1204 commented Jul 12, 2024 • edited Loading

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

github-actions bot commented Jul 12, 2024

tchaton commented Jul 12, 2024

csy1204 commented Jul 12, 2024 • edited Loading

deependujha commented Jul 12, 2024

csy1204 commented Jul 12, 2024

csy1204 commented Jul 12, 2024 •

edited

Loading

csy1204 commented Jul 12, 2024 •

edited

Loading