Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Shared Drives #40

Open
rhunwicks opened this issue Mar 1, 2024 · 3 comments
Open

Support for Shared Drives #40

rhunwicks opened this issue Mar 1, 2024 · 3 comments

Comments

@rhunwicks
Copy link
Contributor

rhunwicks commented Mar 1, 2024

Currently, gdrivefs doesn't support shared drives.

I have a setup like:

    root_folder: str = "gdrive://Discovery Folder/Worksheets"
    storage_options: dict = {
        "token": "service_account",
        "access": "read_only",
        "creds": json.loads(os.environ["GOOGLE_APPLICATION_CREDENTIALS"]),
        "root_file_id": "0123456789ABCDEFGH",
    }

If I attempt to access that file (using commit 2b48baa), I get the error:

FileNotFoundError: Directory 0123456789ABCDEFGH has no child named Discovery Folder

  File "./pipelines/assets/base.py", line 210, in original_files
    with p.fs.open(p.path, mode="rb") as f:
  File "./lib/python3.10/site-packages/fsspec/spec.py", line 1295, in open
    f = self._open(
  File "./lib/python3.10/site-packages/gdrivefs/core.py", line 249, in _open
    return GoogleDriveFile(self, path, mode=mode, **kwargs)
  File "./lib/python3.10/site-packages/gdrivefs/core.py", line 270, in __init__
    super().__init__(fs, path, mode, block_size, autocommit=autocommit,
  File "./lib/python3.10/site-packages/fsspec/spec.py", line 1651, in __init__
    self.size = self.details["size"]
  File "./lib/python3.10/site-packages/fsspec/spec.py", line 1664, in details
    self._details = self.fs.info(self.path)
  File "./lib/python3.10/site-packages/fsspec/spec.py", line 662, in info
    out = self.ls(path, detail=True, **kwargs)
  File "./lib/python3.10/site-packages/gdrivefs/core.py", line 174, in ls
    files = self._ls_from_cache(path)
  File "./lib/python3.10/site-packages/fsspec/spec.py", line 372, in _ls_from_cache
    raise FileNotFoundError(path)

The root_file_id is set to the folder id of a GDrive Shared Drive (i.e. https://support.google.com/a/users/answer/7212025?hl=en).

As per https://developers.google.com/drive/api/guides/enable-shareddrives#:~:text=The%20supportsAllDrives%3Dtrue%20parameter%20informs,require%20additional%20shared%20drive%20functionality. we need to set supportsAllDrives=True and includeItemsFromAllDrives=True when calling files.list in order for the API client to find the files.

In my case, if I change the existing:

    def _list_directory_by_id(self, file_id, trashed=False, path_prefix=None):
        all_files = []
        page_token = None
        afields = 'nextPageToken, files(%s)' % fields
        query = f"'{file_id}' in parents  "
        if not trashed:
            query += "and trashed = false "
        while True:
            response = self.service.list(q=query,
                                         spaces=self.spaces, fields=afields,
                                         pageToken=page_token,
                                         ).execute()
            for f in response.get('files', []):
                all_files.append(_finfo_from_response(f, path_prefix))
            more = response.get('incompleteSearch', False)
            page_token = response.get('nextPageToken', None)
            if page_token is None:
                break
        return all_files

to

    def _list_directory_by_id(self, file_id, trashed=False, path_prefix=None):
        all_files = []
        page_token = None
        afields = 'nextPageToken, files(%s)' % fields
        query = f"'{file_id}' in parents  "
        if not trashed:
            query += "and trashed = false "
        while True:
            response = self.service.list(
                q=query,
                spaces=self.spaces, fields=afields,
                pageToken=page_token,
                includeItemsFromAllDrives=True,  # Required for shared drive support
                supportsAllDrives=True,    # Required for shared drive support
            ).execute()
            for f in response.get('files', []):
                all_files.append(_finfo_from_response(f, path_prefix))
            more = response.get('incompleteSearch', False)
            page_token = response.get('nextPageToken', None)
            if page_token is None:
                break
        return all_files

(note the change in the call to self.service.list)

then my code works, and the filesystem can find the file and open it successfully.

I am happy to prepare an MR, but you would need to decide whether you are happy for me to enable shared drive support in all cases, or whether you want to control it via storage_options. And if via storage_options whether it should default to off (completely backwards compatible) or on (may show new files to existing users with shared drives that they don't currently get returned from gdrivefs).

@rhunwicks
Copy link
Contributor Author

Actually, I see there was already a request for this in #26.

@martindurant
Copy link
Member

YEs, exactly so - I believe this is well worth adding, but I am unsure how to expose the possibility to users. I believe simply checking all possible drives every time is probably a substantial slowdown, but I am happy to be told otherwise.

@rhunwicks
Copy link
Contributor Author

@martindurant when you say "checking all possible drives" do you mean in the drives property, or in _list_directory_by_id?

I've only just started using gdrivefs, but it seems that you need to specify an exact path from the root folder set in the storage options, so I don't think enabling shared drives universally would be any slower - if you don't set the shared drive folder (or one of its subfolders) as the root_drive_id in storage_options then the filesystem won't be searching it.

And the mechanism that finds the exact file id executes one request/response per path segment, so the performance of that seems to be dependent on how many levels deep your path is from the root_folder_id rather than how many other folders there are that don't match the path.

rhunwicks added a commit to American-Institutes-for-Research/gdrivefs that referenced this issue Mar 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants