Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up is_dir on entries returned by iterdir and glob #176

Open
analog-cbarber opened this issue Oct 6, 2021 · 6 comments · May be fixed by #275
Open

Speed up is_dir on entries returned by iterdir and glob #176

analog-cbarber opened this issue Oct 6, 2021 · 6 comments · May be fixed by #275

Comments

@analog-cbarber
Copy link

The is_dir check is fairly expensive, but at least for S3 and Azure when the entries were created as a result of the client's _list_dir method, you can tell for each entry whether it is a directory or a file and immediately set the result on the created CloudPath instance.

For example for the S3Client._list_dir, you could write something like:

            paginator = self.client.get_paginator("list_objects_v2")

            for result in paginator.paginate(
                Bucket=cloud_path.bucket, Prefix=prefix, Delimiter="/", MaxKeys=1000
            ):

                # sub directory names
                for result_prefix in result.get("CommonPrefixes", []):
                    path = S3Path(f"s3://{cloud_path.bucket}/{result_prefix.get('Prefix')}")
                    path._is_dir = True
                    yield path

                # files in the directory
                for result_key in result.get("Contents", []):
                    path = S3Path(f"s3://{cloud_path.bucket}/{result_key.get('Key')}")
                    path._is_dir = False
                    yield path

and modify S3Path.is_dir:

    def is_dir(self) -> bool:
        if self._is_dir is None:
            self._is_dir = self.client._is_file_or_dir(self) == "dir"
        return self._is_dir

This makes a HUGE performance difference if you need to call is_dir on the entries returned from iterdir or glob (in my case, when implementing a file dialog that works for cloud paths).

Not sure if this particular implementation is the best way to do this, but something like this is needed.

@pjbull
Copy link
Member

pjbull commented Oct 7, 2021

It's a good idea to cache the _is_file_or_dir result for the life of the CloudPath object—I do think it's unlikely to change from one to the other in normal scenarios.

As you point out, there are probably points in the code where we can populate that cache explicitly when we create the CloudPath object since we know at that point what it is.

I think we'd definitely take a PR that had this improvement with tests and a set of benchmarks showing improved timing and also reduced network calls.

@jayqi
Copy link
Member

jayqi commented Oct 7, 2021

I think we'll perhaps want to write down some principles of how we expect this caching to work. Cache invalidation is classically tricky, so we're going to want to be clear what categories of things get cached, how long it gets cached (lifetime of CloudPath object introduces a new concept not dealt with before in either cloudpathlib or regular pathlib), how to invalidate the cache, etc.

@analog-cbarber
Copy link
Author

The simplest thing to do would be to provide ways in the public API to clear or ignore the cached value. I would start out with that. E.g.:

  • Add a sync keyword on the is_dir, is_file and exists that can be used to force the call to sync.
  • Add a file_type property on CloudPath that can be 'file', 'dir', 'missing', or None (or perhaps use an enum) and allow users to set it to None.
  • Add a clear_cached method or the like to CloudPath.

If you want to get fancy, you could just add an expiration time along with suitable methods for controlling the duration. I think I would probably go with the easy way first.

This really is an important optimization because doing individual file/directory checks on thousands of entries in a large data directory kills performance.

@analog-cbarber
Copy link
Author

BTW, it probably would also be a good idea to cache other information returned in the directory entries that could be used to populate the stat() call, e.g. the file size and last modified time and perhaps the etag as well.

@analog-cbarber
Copy link
Author

Note that methods that create an entry should make sure to fix the cached state appropriately.

@analog-cbarber
Copy link
Author

You might want to instead cache this information in the client object in a weak dictionary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants