-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up is_dir
on entries returned by iterdir
and glob
#176
Comments
It's a good idea to cache the As you point out, there are probably points in the code where we can populate that cache explicitly when we create the I think we'd definitely take a PR that had this improvement with tests and a set of benchmarks showing improved timing and also reduced network calls. |
I think we'll perhaps want to write down some principles of how we expect this caching to work. Cache invalidation is classically tricky, so we're going to want to be clear what categories of things get cached, how long it gets cached (lifetime of |
The simplest thing to do would be to provide ways in the public API to clear or ignore the cached value. I would start out with that. E.g.:
If you want to get fancy, you could just add an expiration time along with suitable methods for controlling the duration. I think I would probably go with the easy way first. This really is an important optimization because doing individual file/directory checks on thousands of entries in a large data directory kills performance. |
BTW, it probably would also be a good idea to cache other information returned in the directory entries that could be used to populate the |
Note that methods that create an entry should make sure to fix the cached state appropriately. |
You might want to instead cache this information in the client object in a weak dictionary. |
The
is_dir
check is fairly expensive, but at least for S3 and Azure when the entries were created as a result of the client's_list_dir
method, you can tell for each entry whether it is a directory or a file and immediately set the result on the created CloudPath instance.For example for the
S3Client._list_dir
, you could write something like:and modify
S3Path.is_dir
:This makes a HUGE performance difference if you need to call
is_dir
on the entries returned fromiterdir
orglob
(in my case, when implementing a file dialog that works for cloud paths).Not sure if this particular implementation is the best way to do this, but something like this is needed.
The text was updated successfully, but these errors were encountered: