Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixes path handling and cache #36

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

sehnem
Copy link
Contributor

@sehnem sehnem commented Oct 31, 2023

This PR fixes a bad syntax that I sent in the previous PR and other two bugs.

  • The fs did not work for paths starting with /, and was not consistent with other filesystems (s3, gcs and azure);
  • In some cases the cached path was not saved starting with / and it was causing issues with the _ls_from_cache;

gdrivefs/core.py Outdated
@classmethod
def _strip_protocol(cls, path):
super_path = super()._strip_protocol(path)
return super_path.lstrip('/')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised this is required. The filesystem has root_marker = '", implying that paths with no leading "/" should be considered canonical. If course, this could be changed, but there should be a justification. Since no other libraries refer to gdrive contents in any URL style, the decision is up to us. Not all other backends for fsspec have "/" (but most do).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change to make GDrive compatible with dlt, currently we support cloud filesystems and local files, and they work like that, if you think that a leading / is not good I can change it in dlt sources for it to remove it there.
It is good that we have a few tests there for each filesystem and we can ensure that all of them work as expected for a few cases, probably we will add others and I will send fixes if needed. And we can adjust things there if we are using a bad approach too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"/" could be OK, but then we should be sure to change the root_marker, and _strip_protocol will probably fix itself without your change.

I was just reading up on dlt, which I hadn't heard of before. It sounds like there is a lot in common with intake, particularly the upcoming v2. I wonder whether there is a sufficient overlap to share resources (Intake could be a wrapper for dlt pipelines, for example).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Never heard of Intake, will give a look at it and maybe discuss with the team about it. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may also want to see datasette, which seems to have a similar set of (API) sources, but is specialised to sqlite as the query engine. It does however run in browsers via wasm/pyodide, which is cool - and I suppose dlt would too, since duckdb does. Intake is more data science/analysis focussed, or it least it has been.

gdrivefs/core.py Outdated
@@ -88,7 +88,7 @@ def connect(self, method=None):
cred = self._connect_cache()
elif method == 'anon':
cred = AnonymousCredentials()
elif method is "service_account":
elif method == "service_account":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good spot. This might have worked in very specific circumstances in tests.

It shows that this repo cannot be considered production ready without CI. I am really hoping there is some better way to do that without vcrpy.

Copy link
Contributor Author

@sehnem sehnem Nov 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at the way gcsfs handle the tokens and it quite different from the implementation that I used here. I am thinking on changing it to make it more compatible, as it supports the same authentication methods.
Here is what I am thinking of using, not sure if I should just copy the code or can I make gcsfs a dependency, it should remove pydata_google_auth if so.
Or do you think I should make my own implementation for gdrive?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • no, do not make gcsfs a dep
  • gcsfs generally relies on JSONs of various types or in-GCP machine data
  • pydata_google_auth allows for the browser notification flow, which is familiar for gdrive users, but not for GCP use (and doesn't work in gcsfs, because the scopes in pydata_google_auth cannot be changed)

do you think I should make my own implementation for gdrive

I'm afraid so, sorry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, I can use some of the code from there and just allow it to use a path/dict passed in token parameter instead "service_account" and an other attribute, just to make the use of them more compatible, if you think that it would make sense.

@sehnem
Copy link
Contributor Author

sehnem commented Nov 6, 2023

@martindurant I made some changes to make the service_account option more inline with the gcs file storage, can you review the changes and let me know what I need to fix in it?
Thanks!

@martindurant
Copy link
Member

Please continue to ping me here until I review - things are pretty busy right now

@@ -77,8 +73,8 @@ def __init__(self, root_file_id=None, token="browser",
self.scopes = [scope_dict[access]]
self.token = token
self.spaces = spaces
self.root_file_id = root_file_id or 'root'
self.creds = creds
if not self.root_file_id:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where would self.root_file_id come from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would come from a parameter, like it is used for s3fs

with open(method) as f:
sa_creds = json.load(f)
except:
raise ValueError(f"Invalid connection method or path `{method}`.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If neither of the two conditions are met, that is also an error; you would get NameError because sa_creds isn't defined.
Actually, the message on this line isn't right - it only happens if the path doesn't exist or does't contain JSON (but the "connection method" is valid).

inferred_url = infer_storage_options(path)
path = inferred_url["path"]
path.partition("?RootId=")
if not getattr(cls, "root_file_id", None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this root_file_id?

if not getattr(cls, "root_file_id", None):
query = inferred_url.get("url_query")
if query:
cls.root_file_id = query.split("RootId=")[-1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting on the class seems like a bad idea. What if multiple instances exist?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that is right, will try to figure out an other way of setting it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants