Speed up `is_dir` on entries returned by `iterdir` and `glob` #176

analog-cbarber · 2021-10-06T20:10:12Z

The is_dir check is fairly expensive, but at least for S3 and Azure when the entries were created as a result of the client's _list_dir method, you can tell for each entry whether it is a directory or a file and immediately set the result on the created CloudPath instance.

For example for the S3Client._list_dir, you could write something like:

            paginator = self.client.get_paginator("list_objects_v2")

            for result in paginator.paginate(
                Bucket=cloud_path.bucket, Prefix=prefix, Delimiter="/", MaxKeys=1000
            ):

                # sub directory names
                for result_prefix in result.get("CommonPrefixes", []):
                    path = S3Path(f"s3://{cloud_path.bucket}/{result_prefix.get('Prefix')}")
                    path._is_dir = True
                    yield path

                # files in the directory
                for result_key in result.get("Contents", []):
                    path = S3Path(f"s3://{cloud_path.bucket}/{result_key.get('Key')}")
                    path._is_dir = False
                    yield path

and modify S3Path.is_dir:

    def is_dir(self) -> bool:
        if self._is_dir is None:
            self._is_dir = self.client._is_file_or_dir(self) == "dir"
        return self._is_dir

This makes a HUGE performance difference if you need to call is_dir on the entries returned from iterdir or glob (in my case, when implementing a file dialog that works for cloud paths).

Not sure if this particular implementation is the best way to do this, but something like this is needed.

The text was updated successfully, but these errors were encountered:

pjbull · 2021-10-07T03:53:40Z

It's a good idea to cache the _is_file_or_dir result for the life of the CloudPath object—I do think it's unlikely to change from one to the other in normal scenarios.

As you point out, there are probably points in the code where we can populate that cache explicitly when we create the CloudPath object since we know at that point what it is.

I think we'd definitely take a PR that had this improvement with tests and a set of benchmarks showing improved timing and also reduced network calls.

jayqi · 2021-10-07T14:01:11Z

I think we'll perhaps want to write down some principles of how we expect this caching to work. Cache invalidation is classically tricky, so we're going to want to be clear what categories of things get cached, how long it gets cached (lifetime of CloudPath object introduces a new concept not dealt with before in either cloudpathlib or regular pathlib), how to invalidate the cache, etc.

analog-cbarber · 2021-10-07T14:54:05Z

The simplest thing to do would be to provide ways in the public API to clear or ignore the cached value. I would start out with that. E.g.:

Add a sync keyword on the is_dir, is_file and exists that can be used to force the call to sync.
Add a file_type property on CloudPath that can be 'file', 'dir', 'missing', or None (or perhaps use an enum) and allow users to set it to None.
Add a clear_cached method or the like to CloudPath.

If you want to get fancy, you could just add an expiration time along with suitable methods for controlling the duration. I think I would probably go with the easy way first.

This really is an important optimization because doing individual file/directory checks on thousands of entries in a large data directory kills performance.

analog-cbarber · 2021-10-07T15:47:33Z

BTW, it probably would also be a good idea to cache other information returned in the directory entries that could be used to populate the stat() call, e.g. the file size and last modified time and perhaps the etag as well.

analog-cbarber · 2021-10-07T18:33:01Z

Note that methods that create an entry should make sure to fix the cached state appropriately.

analog-cbarber · 2021-10-07T18:48:46Z

You might want to instead cache this information in the client object in a weak dictionary.

pjbull mentioned this issue Feb 5, 2022

Re-implement glob logic #202

Merged

This was referenced Oct 7, 2022

glob and rglob can be slow for large directories or lots of files #274

Closed

Caching file/dir metadata for GSClient #275

Open

pjbull mentioned this issue Dec 18, 2022

Use list_objects_v2 API for S3 #302

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up `is_dir` on entries returned by `iterdir` and `glob` #176

Speed up `is_dir` on entries returned by `iterdir` and `glob` #176

analog-cbarber commented Oct 6, 2021

pjbull commented Oct 7, 2021

jayqi commented Oct 7, 2021

analog-cbarber commented Oct 7, 2021

analog-cbarber commented Oct 7, 2021

analog-cbarber commented Oct 7, 2021

analog-cbarber commented Oct 7, 2021

Speed up is_dir on entries returned by iterdir and glob #176

Speed up is_dir on entries returned by iterdir and glob #176

Comments

analog-cbarber commented Oct 6, 2021

pjbull commented Oct 7, 2021

jayqi commented Oct 7, 2021

analog-cbarber commented Oct 7, 2021

analog-cbarber commented Oct 7, 2021

analog-cbarber commented Oct 7, 2021

analog-cbarber commented Oct 7, 2021

Speed up `is_dir` on entries returned by `iterdir` and `glob` #176

Speed up `is_dir` on entries returned by `iterdir` and `glob` #176