fixes path handling and cache #36

sehnem · 2023-10-31T23:44:05Z

This PR fixes a bad syntax that I sent in the previous PR and other two bugs.

The fs did not work for paths starting with /, and was not consistent with other filesystems (s3, gcs and azure);
In some cases the cached path was not saved starting with / and it was causing issues with the _ls_from_cache;

martindurant · 2023-11-01T00:24:06Z

gdrivefs/core.py

+    @classmethod
+    def _strip_protocol(cls, path):
+        super_path = super()._strip_protocol(path)
+        return super_path.lstrip('/')


I'm surprised this is required. The filesystem has root_marker = '", implying that paths with no leading "/" should be considered canonical. If course, this could be changed, but there should be a justification. Since no other libraries refer to gdrive contents in any URL style, the decision is up to us. Not all other backends for fsspec have "/" (but most do).

I made this change to make GDrive compatible with dlt, currently we support cloud filesystems and local files, and they work like that, if you think that a leading / is not good I can change it in dlt sources for it to remove it there.
It is good that we have a few tests there for each filesystem and we can ensure that all of them work as expected for a few cases, probably we will add others and I will send fixes if needed. And we can adjust things there if we are using a bad approach too.

"/" could be OK, but then we should be sure to change the root_marker, and _strip_protocol will probably fix itself without your change.

I was just reading up on dlt, which I hadn't heard of before. It sounds like there is a lot in common with intake, particularly the upcoming v2. I wonder whether there is a sufficient overlap to share resources (Intake could be a wrapper for dlt pipelines, for example).

Cool! Never heard of Intake, will give a look at it and maybe discuss with the team about it. Thanks!

You may also want to see datasette, which seems to have a similar set of (API) sources, but is specialised to sqlite as the query engine. It does however run in browsers via wasm/pyodide, which is cool - and I suppose dlt would too, since duckdb does. Intake is more data science/analysis focussed, or it least it has been.

martindurant · 2023-11-01T00:26:46Z

gdrivefs/core.py

@@ -88,7 +88,7 @@ def connect(self, method=None):
            cred = self._connect_cache()
        elif method == 'anon':
            cred = AnonymousCredentials()
-        elif method is "service_account":
+        elif method == "service_account":


Good spot. This might have worked in very specific circumstances in tests.

It shows that this repo cannot be considered production ready without CI. I am really hoping there is some better way to do that without vcrpy.

I was looking at the way gcsfs handle the tokens and it quite different from the implementation that I used here. I am thinking on changing it to make it more compatible, as it supports the same authentication methods.
Here is what I am thinking of using, not sure if I should just copy the code or can I make gcsfs a dependency, it should remove pydata_google_auth if so.
Or do you think I should make my own implementation for gdrive?

no, do not make gcsfs a dep

gcsfs generally relies on JSONs of various types or in-GCP machine data

pydata_google_auth allows for the browser notification flow, which is familiar for gdrive users, but not for GCP use (and doesn't work in gcsfs, because the scopes in pydata_google_auth cannot be changed)

do you think I should make my own implementation for gdrive

I'm afraid so, sorry.

No problem, I can use some of the code from there and just allow it to use a path/dict passed in token parameter instead "service_account" and an other attribute, just to make the use of them more compatible, if you think that it would make sense.

sehnem · 2023-11-06T20:43:39Z

@martindurant I made some changes to make the service_account option more inline with the gcs file storage, can you review the changes and let me know what I need to fix in it?
Thanks!

path

martindurant · 2023-11-14T02:40:34Z

Please continue to ping me here until I review - things are pretty busy right now

martindurant · 2023-12-01T14:24:50Z

gdrivefs/core.py

@@ -77,8 +73,8 @@ def __init__(self, root_file_id=None, token="browser",
        self.scopes = [scope_dict[access]]
        self.token = token
        self.spaces = spaces
-        self.root_file_id = root_file_id or 'root'
-        self.creds = creds
+        if not self.root_file_id:


Where would self.root_file_id come from?

It would come from a parameter, like it is used for s3fs

martindurant · 2023-12-01T14:27:20Z

gdrivefs/core.py

+                    with open(method) as f:
+                        sa_creds = json.load(f)
+                except:
+                    raise ValueError(f"Invalid connection method or path `{method}`.")


If neither of the two conditions are met, that is also an error; you would get NameError because sa_creds isn't defined.
Actually, the message on this line isn't right - it only happens if the path doesn't exist or does't contain JSON (but the "connection method" is valid).

martindurant · 2023-12-01T14:28:56Z

gdrivefs/core.py

+        inferred_url = infer_storage_options(path)
+        path = inferred_url["path"]
+        path.partition("?RootId=")
+        if not getattr(cls, "root_file_id", None):


where is this root_file_id?

martindurant · 2023-12-01T14:29:24Z

gdrivefs/core.py

+        if not getattr(cls, "root_file_id", None):
+            query = inferred_url.get("url_query")
+            if query:
+                cls.root_file_id = query.split("RootId=")[-1]


Setting on the class seems like a bad idea. What if multiple instances exist?

Yeah, that is right, will try to figure out an other way of setting it.

martindurant reviewed Nov 1, 2023

View reviewed changes

sehnem force-pushed the master branch from 0ba5266 to 6ec5ccb Compare November 7, 2023 23:00

sehnem mentioned this pull request Nov 7, 2023

gdrive fsspec dlt-hub/dlt#738

Closed

fixes path handling and cache support rootid in

1dab0cb

path

sehnem force-pushed the master branch from 6ec5ccb to 1dab0cb Compare November 11, 2023 21:23

martindurant reviewed Dec 1, 2023

View reviewed changes

rudolfix mentioned this pull request Dec 17, 2023

support gdrive filesystem in filesystem source dlt-hub/verified-sources#308

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixes path handling and cache #36

fixes path handling and cache #36

sehnem commented Oct 31, 2023

martindurant Nov 1, 2023

sehnem Nov 1, 2023

martindurant Nov 1, 2023

sehnem Nov 2, 2023

martindurant Nov 2, 2023

martindurant Nov 1, 2023

sehnem Nov 2, 2023 •

edited

Loading

martindurant Nov 2, 2023

sehnem Nov 2, 2023

sehnem commented Nov 6, 2023

martindurant commented Nov 14, 2023

martindurant Dec 1, 2023

sehnem Dec 4, 2023

martindurant Dec 1, 2023

martindurant Dec 1, 2023

martindurant Dec 1, 2023

sehnem Dec 4, 2023

fixes path handling and cache #36

Are you sure you want to change the base?

fixes path handling and cache #36

Conversation

sehnem commented Oct 31, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sehnem Nov 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sehnem commented Nov 6, 2023

martindurant commented Nov 14, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sehnem Nov 2, 2023 •

edited

Loading