-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
litdata.optimize
accidentally deletes files from the local filesystem
#93
Comments
Hi! thanks for your contribution!, great first issue! |
Omg @hubertsiuzdak, dear apology for this. This should happen only on the Lightning AI platform. I will disable this behaviour if you are running it outside of it. |
Hello, is this fixed yet? I had a similar issue where I had files symlinked from various different directories and running optimize deleted quite a bit of data! |
Hey @fdalvi, I didn't have time to look into it. Should be quite easy to fix. It is coming from there: https://github.com/Lightning-AI/litdata/blob/main/src/litdata/processing/data_processor.py#L1079 Do you want to make a PR to fix it ? |
is it all about running only if it is executing in studio, and do nothing otherwise? modified code to be something like: def _cleanup_cache(self) -> None:
if not _IS_IN_STUDIO:
return
cache_dir = _get_cache_dir()
# Cleanup the cache dir folder to avoid corrupted files from previous run to be there.
if os.path.exists(cache_dir):
shutil.rmtree(cache_dir, ignore_errors=True)
os.makedirs(cache_dir, exist_ok=True)
cache_data_dir = _get_cache_data_dir()
# Cleanup the cache data folder to avoid corrupted files from previous run to be there.
if os.path.exists(cache_data_dir):
shutil.rmtree(cache_data_dir, ignore_errors=True)
os.makedirs(cache_data_dir, exist_ok=True) Or is there more to it to be done if they are using litdata out of studio? |
Looks correct for now. If users try to use litdata outside Studios and don't cleanup the cache, then it is possible they would get corrupted chunks locally and need to manually clean it up. However, the weirdness is this:
It should be using a tempdir outside of Studios. Can you confirm @fdalvi |
How about logging warning for this if they are running it outside? |
I'm not 100% I understand the question, but all of this was running locally (I'm not super familiar with studios but I guess thats an online offering?). I did have to set the I'm happy to send in a PR once I've understood the issue and planned solution; the solution indicated above only stops "cleanup" from happening. Is that the intention, or to not use any cache at all when outside of studio? |
Hey @fdalvi. Outside of Studios, it means we won't clean up the cache. So it might lead to un-excepted behaviour. Another option is to change the cache path to |
Hey @hubertsiuzdak @fdalvi, This should be fixed with this PR: https://github.com/Lightning-AI/litdata/pull/166/files. You can check by trying out master. import os
import uuid
from glob import glob
import litdata
import torch
def fn(x):
return x
if __name__ == "__main__":
base_dir = "/tmp/litdata-example"
for _ in range(4): # create 4 random directories with pytorch tensors
dir_name = os.path.join(base_dir, str(uuid.uuid4())[:8])
os.makedirs(dir_name, exist_ok=True)
torch.save(torch.randn(4, 4), os.path.join(dir_name, "tensor.pt")) # Save a random pytorch tensor
files_before = glob(os.path.join(base_dir, "*/*.pt"))
print(files_before) # print the paths of the saved tensors to confirm creation
litdata.optimize(fn=fn, inputs=files_before, output_dir="output_dir", num_workers=1, chunk_bytes="64MB")
files_after = glob(os.path.join(base_dir, "*/*.pt"))
print(files_after) # some files are gone! 👋
assert len(files_before) == len(files_after) Can you carefully double check ? |
yes, can confirm this is fixed by #166 |
Hi @tchaton is there a way to set DATA_OPTIMIZER_CACHE_FOLDER in the python script rather than as an environment variable? I didn't find such an interface. Thank you! |
Hi, @yuzc19, You can set the DATA_OPTIMIZER_CACHE_FOLDER environment variable at the top of your script to specify the cache directory. This way, the cache_dir will be set to your desired directory without needing to modify the existing code. import os
# Set your desired cache directory
os.environ["DATA_OPTIMIZER_CACHE_FOLDER"] = "/path/to/your/cache_dir"
# your code ... |
🐛 Bug
When filepaths are passed as inputs to
litdata.optimize
, it attempts to resolveinput_dir
. Thisinput_dir
is later used inDataWorker
to cache these files and manage cleanup.But
_get_input_dir
is very error-prone, as it only looks at the first element ofinputs
:litdata/src/litdata/processing/functions.py
Line 53 in ee69581
and assumes that
input_dir
is always three directories deep from the root:litdata/src/litdata/processing/functions.py
Line 71 in ee69581
However, if our input files that don't follow these assumptions, e.g. come from different top-level directories, it can really mess things up. That's because when clearing the cache, filepaths are determined simply by replacing
input_dir
withcache_dir
:litdata/src/litdata/processing/data_processor.py
Lines 198 to 204 in ee69581
But if
input_dir.path
is not inpath
,replace
does nothing, and then it just proceeds to delete a valid file! Removing these paths should be done with much more caution.To Reproduce
Create a directory and ensure python can save to it:
Then run a simple python script:
And yes... this actually happened to me. I was quite astonished to see some of my files just deleted 🤯
Environment
Additional context
Is caching input files in
litdata.optimize
actually necessary? The most common use case is to retrieve a file only once during dataset preparation. If we simply set an empty input directoryinput_dir = Dir()
inDataProcessor
, we can avoid all of this.The text was updated successfully, but these errors were encountered: