-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fixes path handling and cache #36
base: master
Are you sure you want to change the base?
Conversation
gdrivefs/core.py
Outdated
@classmethod | ||
def _strip_protocol(cls, path): | ||
super_path = super()._strip_protocol(path) | ||
return super_path.lstrip('/') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised this is required. The filesystem has root_marker = '"
, implying that paths with no leading "/" should be considered canonical. If course, this could be changed, but there should be a justification. Since no other libraries refer to gdrive contents in any URL style, the decision is up to us. Not all other backends for fsspec have "/" (but most do).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this change to make GDrive compatible with dlt, currently we support cloud filesystems and local files, and they work like that, if you think that a leading /
is not good I can change it in dlt sources for it to remove it there.
It is good that we have a few tests there for each filesystem and we can ensure that all of them work as expected for a few cases, probably we will add others and I will send fixes if needed. And we can adjust things there if we are using a bad approach too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"/" could be OK, but then we should be sure to change the root_marker, and _strip_protocol will probably fix itself without your change.
I was just reading up on dlt
, which I hadn't heard of before. It sounds like there is a lot in common with intake
, particularly the upcoming v2. I wonder whether there is a sufficient overlap to share resources (Intake could be a wrapper for dlt pipelines, for example).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! Never heard of Intake, will give a look at it and maybe discuss with the team about it. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may also want to see datasette
, which seems to have a similar set of (API) sources, but is specialised to sqlite as the query engine. It does however run in browsers via wasm/pyodide, which is cool - and I suppose dlt
would too, since duckdb
does. Intake
is more data science/analysis focussed, or it least it has been.
gdrivefs/core.py
Outdated
@@ -88,7 +88,7 @@ def connect(self, method=None): | |||
cred = self._connect_cache() | |||
elif method == 'anon': | |||
cred = AnonymousCredentials() | |||
elif method is "service_account": | |||
elif method == "service_account": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good spot. This might have worked in very specific circumstances in tests.
It shows that this repo cannot be considered production ready without CI. I am really hoping there is some better way to do that without vcrpy
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking at the way gcsfs handle the tokens and it quite different from the implementation that I used here. I am thinking on changing it to make it more compatible, as it supports the same authentication methods.
Here is what I am thinking of using, not sure if I should just copy the code or can I make gcsfs a dependency, it should remove pydata_google_auth if so.
Or do you think I should make my own implementation for gdrive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- no, do not make gcsfs a dep
- gcsfs generally relies on JSONs of various types or in-GCP machine data
- pydata_google_auth allows for the browser notification flow, which is familiar for gdrive users, but not for GCP use (and doesn't work in gcsfs, because the scopes in pydata_google_auth cannot be changed)
do you think I should make my own implementation for gdrive
I'm afraid so, sorry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem, I can use some of the code from there and just allow it to use a path/dict passed in token parameter instead "service_account" and an other attribute, just to make the use of them more compatible, if you think that it would make sense.
@martindurant I made some changes to make the service_account option more inline with the gcs file storage, can you review the changes and let me know what I need to fix in it? |
Please continue to ping me here until I review - things are pretty busy right now |
@@ -77,8 +73,8 @@ def __init__(self, root_file_id=None, token="browser", | |||
self.scopes = [scope_dict[access]] | |||
self.token = token | |||
self.spaces = spaces | |||
self.root_file_id = root_file_id or 'root' | |||
self.creds = creds | |||
if not self.root_file_id: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where would self.root_file_id come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would come from a parameter, like it is used for s3fs
with open(method) as f: | ||
sa_creds = json.load(f) | ||
except: | ||
raise ValueError(f"Invalid connection method or path `{method}`.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If neither of the two conditions are met, that is also an error; you would get NameError because sa_creds
isn't defined.
Actually, the message on this line isn't right - it only happens if the path doesn't exist or does't contain JSON (but the "connection method" is valid).
inferred_url = infer_storage_options(path) | ||
path = inferred_url["path"] | ||
path.partition("?RootId=") | ||
if not getattr(cls, "root_file_id", None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is this root_file_id?
if not getattr(cls, "root_file_id", None): | ||
query = inferred_url.get("url_query") | ||
if query: | ||
cls.root_file_id = query.split("RootId=")[-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting on the class seems like a bad idea. What if multiple instances exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that is right, will try to figure out an other way of setting it.
This PR fixes a bad syntax that I sent in the previous PR and other two bugs.
/
, and was not consistent with other filesystems (s3, gcs and azure);/
and it was causing issues with the_ls_from_cache
;