Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#484 Add GSClient timeout support #485

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Mchristos
Copy link

@Mchristos Mchristos commented Nov 4, 2024

Adds GSClient option to include a timeout in seconds. When using google.cloud.storage, it's possible to manually increase the timeout from the default of 60 seconds, but with cloudpathlib it's not possible to tweak this. This makes, for example, uploading larger files with slow internet connection impossible to achieve using cloudpathlib

Closes #484


Contributor checklist:

  • I have read and understood CONTRIBUTING.md
  • Confirmed an issue exists for the PR, and the text Closes #issue appears in the PR summary (e.g., Closes #123).
  • Confirmed PR is rebased onto the latest base
  • Confirmed failure before change and success after change
  • Any generic new functionality is replicated across cloud providers if necessary
  • Tested manually against live server backend for at least one provider
  • Added tests for any new functionality
  • Linting passes locally
  • Tests pass locally
  • Updated HISTORY.md with the issue that is addressed and the PR you are submitting. If the top section is not `## UNRELEASED``, then you need to add a new section to the top of the document for your change.

Usage:

from cloudpathlib import CloudPath, GSClient
client = GSClient(timeout=.00001)
with CloudPath("gs://path/to/test_file.txt", client=client).open("wb") as f:
    f.write(os.urandom(10000000))
# Raises Timeout Error

@pjbull
Copy link
Member

pjbull commented Nov 30, 2024

Thanks @Mchristos! Code looks reasonable, but I was wondering if there are other kwargs that are either shared or that we should accept as upload_kwargs and download_kwargs to the client and use in the respective methods so that we can support all of the potential kwargs. Would you mind checking the SDK and seeing what makes sense?

Copy link

codecov bot commented Nov 30, 2024

Codecov Report

Attention: Patch coverage is 85.71429% with 1 line in your changes missing coverage. Please review.

Project coverage is 93.2%. Comparing base (7f1a5dc) to head (9b034b7).

Files with missing lines Patch % Lines
cloudpathlib/gs/gsclient.py 85.7% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##           master    #485     +/-   ##
========================================
- Coverage    94.4%   93.2%   -1.3%     
========================================
  Files          23      23             
  Lines        1776    1780      +4     
========================================
- Hits         1678    1659     -19     
- Misses         98     121     +23     
Files with missing lines Coverage Δ
cloudpathlib/gs/gsclient.py 89.6% <85.7%> (-1.9%) ⬇️

... and 2 files with indirect coverage changes

@Mchristos
Copy link
Author

Thanks @Mchristos! Code looks reasonable, but I was wondering if there are other kwargs that are either shared or that we should accept as upload_kwargs and download_kwargs to the client and use in the respective methods so that we can support all of the potential kwargs. Would you mind checking the SDK and seeing what makes sense?

Hi @pjbull, thanks a lot for reviewing my code! I'll include the 3 function signatures that we use below, and the common kwargs between them. I think only timeout and retry are relevant, so if it makes sense to you, I will add retry and be done with it.

The other option is that we put the onus on the user to specify a valid GCP kwarg, but then of course we may have the possible issue of mismatching kwargs between these functions (possibly even in a future gcsfs release).

Function signatures:

download_to_filename(filename, client=None, start=None, end=None, raw_download=False, if_etag_match=None, if_etag_not_match=None, if_generation_match=None, if_generation_not_match=None, if_metageneration_match=None, if_metageneration_not_match=None, timeout=60, checksum='md5', retry=<google.api_core.retry.retry_unary.Retry object>)

upload_from_filename(filename, content_type=None, num_retries=None, client=None, predefined_acl=None, if_generation_match=None, if_generation_not_match=None, if_metageneration_match=None, if_metageneration_not_match=None, timeout=60, checksum=None, retry=<google.cloud.storage.retry.ConditionalRetryPolicy object>)

copy_blob(blob, destination_bucket, new_name=None, client=None, preserve_acl=True, source_generation=None, if_generation_match=None, if_generation_not_match=None, if_metageneration_match=None, if_metageneration_not_match=None, if_source_generation_match=None, if_source_generation_not_match=None, if_source_metageneration_match=None, if_source_metageneration_not_match=None, timeout=60, retry=<google.cloud.storage.retry.ConditionalRetryPolicy object>)

Common kwargs between these are:

[
    "client",
    "if_generation_match",
    "if_generation_not_match",
    "if_metageneration_match",
    "if_metageneration_not_match",
    "timeout",
    "retry"
]

Of these, the client kwarg is redundant, and the kwargs around generation and metageneration matches are not relevant, since these are are related to copying between blobs, and I don't think we require support for that.

That leaves timeout and retry. It seems relatively involved to add a retry configuration, but provided the user is responsible fr forming it, I don't see why not:

image

@pjbull let me know if you agree and I will include the retry kwarg.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Set a GSPath upload timeout
2 participants