`litdata.optimize` accidentally deletes files from the local filesystem #93

hubertsiuzdak · 2024-04-05T15:06:14Z

🐛 Bug

When filepaths are passed as inputs to litdata.optimize, it attempts to resolve input_dir. This input_dir is later used in DataWorker to cache these files and manage cleanup.

But _get_input_dir is very error-prone, as it only looks at the first element of inputs:

litdata/src/litdata/processing/functions.py

Line 53 in ee69581

indexed_paths = _get_indexed_paths(inputs[0])

and assumes that input_dir is always three directories deep from the root:

litdata/src/litdata/processing/functions.py

Line 71 in ee69581

return "/" + os.path.join(*str(absolute_path).split("/")[:4])

However, if our input files that don't follow these assumptions, e.g. come from different top-level directories, it can really mess things up. That's because when clearing the cache, filepaths are determined simply by replacing input_dir with cache_dir:

litdata/src/litdata/processing/data_processor.py

Lines 198 to 204 in ee69581

    
           for path in paths: 
        
               if input_dir: 
        
                   if not path.startswith(cache_dir) and input_dir.path is not None: 
        
                       path = path.replace(input_dir.path, cache_dir) 
        
                   if os.path.exists(path): 
        
                       os.remove(path)

But if input_dir.path is not in path, replace does nothing, and then it just proceeds to delete a valid file! Removing these paths should be done with much more caution.

To Reproduce

Create a directory and ensure python can save to it:

sudo mkdir /mnt/litdata-example
sudo chmod 777 /mnt/litdata-example/

Then run a simple python script:

import os
import uuid
from glob import glob

import litdata
import torch

base_dir = "/mnt/litdata-example"

for _ in range(4):  # create 4 random directories with pytorch tensors
    dir_name = os.path.join(base_dir, str(uuid.uuid4())[:8])
    os.makedirs(dir_name, exist_ok=True)
    torch.save(torch.randn(4, 4), os.path.join(dir_name, "tensor.pt"))  # Save a random pytorch tensor

files_before = glob(os.path.join(base_dir, "*/*.pt"))
print(files_before)  # print the paths of the saved tensors to confirm creation

litdata.optimize(fn=lambda x: x, inputs=files_before, output_dir="output_dir", num_workers=1, chunk_bytes="64MB")

files_after = glob(os.path.join(base_dir, "*/*.pt"))
print(files_after)  # some files are gone! 👋
assert len(files_before) == len(files_after)

And yes... this actually happened to me. I was quite astonished to see some of my files just deleted 🤯

Environment

litdata==0.2.3

Additional context

Is caching input files in litdata.optimize actually necessary? The most common use case is to retrieve a file only once during dataset preparation. If we simply set an empty input directory input_dir = Dir() in DataProcessor, we can avoid all of this.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-04-05T15:06:39Z

Hi! thanks for your contribution!, great first issue!

tchaton · 2024-04-05T15:22:45Z

Omg @hubertsiuzdak, dear apology for this. This should happen only on the Lightning AI platform. I will disable this behaviour if you are running it outside of it.

fdalvi · 2024-06-08T06:20:43Z

Hello, is this fixed yet? I had a similar issue where I had files symlinked from various different directories and running optimize deleted quite a bit of data!

tchaton · 2024-06-08T09:44:23Z

Hey @fdalvi, I didn't have time to look into it. Should be quite easy to fix. It is coming from there: https://github.com/Lightning-AI/litdata/blob/main/src/litdata/processing/data_processor.py#L1079

Do you want to make a PR to fix it ?

deependujha · 2024-06-08T10:03:41Z

is it all about running only if it is executing in studio, and do nothing otherwise?

modified code to be something like:

def _cleanup_cache(self) -> None:
    if not _IS_IN_STUDIO:
        return
    cache_dir = _get_cache_dir()

    # Cleanup the cache dir folder to avoid corrupted files from previous run to be there.
    if os.path.exists(cache_dir):
        shutil.rmtree(cache_dir, ignore_errors=True)

    os.makedirs(cache_dir, exist_ok=True)

    cache_data_dir = _get_cache_data_dir()

    # Cleanup the cache data folder to avoid corrupted files from previous run to be there.
    if os.path.exists(cache_data_dir):
        shutil.rmtree(cache_data_dir, ignore_errors=True)

    os.makedirs(cache_data_dir, exist_ok=True)

Or is there more to it to be done if they are using litdata out of studio?

tchaton · 2024-06-08T10:06:04Z

Looks correct for now. If users try to use litdata outside Studios and don't cleanup the cache, then it is possible they would get corrupted chunks locally and need to manually clean it up.

However, the weirdness is this:

def _get_default_cache() -> str:
    return "/cache" if _IS_IN_STUDIO else tempfile.gettempdir()


def _get_cache_dir(name: Optional[str] = None) -> str:
    """Returns the cache directory used by the Cache to store the chunks."""
    cache_dir = os.getenv("DATA_OPTIMIZER_CACHE_FOLDER", f"{_get_default_cache()}/chunks")
    if name is None:
        return cache_dir
    return os.path.join(cache_dir, name.lstrip("/"))

It should be using a tempdir outside of Studios. Can you confirm @fdalvi

deependujha · 2024-06-08T10:08:02Z

How about logging warning for this if they are running it outside?

fdalvi · 2024-06-09T06:21:46Z

It should be using a tempdir outside of Studios. Can you confirm @fdalvi

I'm not 100% I understand the question, but all of this was running locally (I'm not super familiar with studios but I guess thats an online offering?). I did have to set the DATA_OPTIMIZER_CACHE_FOLDER env var since my /var/tmp was using a disk with limited capacity.

I'm happy to send in a PR once I've understood the issue and planned solution; the solution indicated above only stops "cleanup" from happening. Is that the intention, or to not use any cache at all when outside of studio?

tchaton · 2024-06-09T09:14:23Z

Hey @fdalvi. Outside of Studios, it means we won't clean up the cache. So it might lead to un-excepted behaviour. Another option is to change the cache path to ~/.lightning/cache

tchaton · 2024-06-13T06:53:45Z

Hey @hubertsiuzdak @fdalvi,

This should be fixed with this PR: https://github.com/Lightning-AI/litdata/pull/166/files. You can check by trying out master.

import os
import uuid
from glob import glob
import litdata
import torch

def fn(x):
    return x

if __name__ == "__main__":

    base_dir = "/tmp/litdata-example"

    for _ in range(4):  # create 4 random directories with pytorch tensors
        dir_name = os.path.join(base_dir, str(uuid.uuid4())[:8])
        os.makedirs(dir_name, exist_ok=True)
        torch.save(torch.randn(4, 4), os.path.join(dir_name, "tensor.pt"))  # Save a random pytorch tensor

    files_before = glob(os.path.join(base_dir, "*/*.pt"))
    print(files_before)  # print the paths of the saved tensors to confirm creation

    litdata.optimize(fn=fn, inputs=files_before, output_dir="output_dir", num_workers=1, chunk_bytes="64MB")

    files_after = glob(os.path.join(base_dir, "*/*.pt"))
    print(files_after)  # some files are gone! 👋
    assert len(files_before) == len(files_after)

Can you carefully double check ?

hubertsiuzdak · 2024-06-13T20:22:03Z

yes, can confirm this is fixed by #166

yuzc19 · 2024-06-24T22:40:11Z

Hi @tchaton is there a way to set DATA_OPTIMIZER_CACHE_FOLDER in the python script rather than as an environment variable? I didn't find such an interface. Thank you!

deependujha · 2024-06-25T02:16:03Z

Hi, @yuzc19,

You can set the DATA_OPTIMIZER_CACHE_FOLDER environment variable at the top of your script to specify the cache directory. This way, the cache_dir will be set to your desired directory without needing to modify the existing code.

import os

# Set your desired cache directory
os.environ["DATA_OPTIMIZER_CACHE_FOLDER"] = "/path/to/your/cache_dir"
 
# your code ...

hubertsiuzdak added bug Something isn't working help wanted Extra attention is needed labels Apr 5, 2024

hubertsiuzdak closed this as completed Jun 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`litdata.optimize` accidentally deletes files from the local filesystem #93

`litdata.optimize` accidentally deletes files from the local filesystem #93

hubertsiuzdak commented Apr 5, 2024

github-actions bot commented Apr 5, 2024

tchaton commented Apr 5, 2024

fdalvi commented Jun 8, 2024

tchaton commented Jun 8, 2024 •

edited

Loading

deependujha commented Jun 8, 2024

tchaton commented Jun 8, 2024 •

edited

Loading

deependujha commented Jun 8, 2024

fdalvi commented Jun 9, 2024

tchaton commented Jun 9, 2024

tchaton commented Jun 13, 2024 •

edited

Loading

hubertsiuzdak commented Jun 13, 2024

yuzc19 commented Jun 24, 2024

deependujha commented Jun 25, 2024

litdata.optimize accidentally deletes files from the local filesystem #93

litdata.optimize accidentally deletes files from the local filesystem #93

Comments

hubertsiuzdak commented Apr 5, 2024

🐛 Bug

To Reproduce

Environment

Additional context

github-actions bot commented Apr 5, 2024

tchaton commented Apr 5, 2024

fdalvi commented Jun 8, 2024

tchaton commented Jun 8, 2024 • edited Loading

deependujha commented Jun 8, 2024

tchaton commented Jun 8, 2024 • edited Loading

deependujha commented Jun 8, 2024

fdalvi commented Jun 9, 2024

tchaton commented Jun 9, 2024

tchaton commented Jun 13, 2024 • edited Loading

hubertsiuzdak commented Jun 13, 2024

yuzc19 commented Jun 24, 2024

deependujha commented Jun 25, 2024

`litdata.optimize` accidentally deletes files from the local filesystem #93

`litdata.optimize` accidentally deletes files from the local filesystem #93

tchaton commented Jun 8, 2024 •

edited

Loading

tchaton commented Jun 8, 2024 •

edited

Loading

tchaton commented Jun 13, 2024 •

edited

Loading