Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FTP support to jobstores #5142

Merged
merged 28 commits into from
Nov 22, 2024
Merged

Add FTP support to jobstores #5142

merged 28 commits into from
Nov 22, 2024

Conversation

stxue1
Copy link
Contributor

@stxue1 stxue1 commented Oct 25, 2024

Closes #5134

Also turns all url_exists requests and size requests to HEAD requests instead, as mentioned here: #5114 (comment)

In my local testing, this significantly improves FTP performance. (Previously an existence check on a small text file on an insecure FTP server took around 5 seconds, though I'm unsure why it would take so long originally)

Changelog Entry

To be copied to the draft changelog by merger:

  • Added proper FTP support for jobstores
  • URL existence and size gets/checks are now done with HEAD requests

Reviewer Checklist

  • Make sure it is coming from issues/XXXX-fix-the-thing in the Toil repo, or from an external repo.
    • If it is coming from an external repo, make sure to pull it in for CI with:
      contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing
      
    • If there is no associated issue, create one.
  • Read through the code changes. Make sure that it doesn't have:
    • Addition of trailing whitespace.
    • New variable or member names in camelCase that want to be in snake_case.
    • New functions without type hints.
    • New functions or classes without informative docstrings.
    • Changes to semantics not reflected in the relevant docstrings.
    • New or changed command line options for Toil workflows that are not reflected in docs/running/{cliOptions,cwl,wdl}.rst
    • New features without tests.
  • Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines.
  • Finish the review with an overall description of your opinion.

Merger Checklist

  • Make sure the PR passes tests.
  • Make sure the PR has been reviewed since its last modification. If not, review it.
  • Merge with the Github "Squash and merge" feature.
    • If there are multiple authors' commits, add Co-authored-by to give credit to all contributing authors.
  • Copy its recommended changelog entry to the Draft Changelog.
  • Append the issue number in parentheses to the changelog entry.

"""
FTP access with upload.

Taken and modified from https://github.com/ohsu-comp-bio/cwl-tes/blob/03f0096f9fae8acd527687d3460a726e09190c3a/cwl_tes/ftp.py#L8
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The license at cwl-tes is also Apache 2.0. Should I add a section to the Toil license that specifically states all code under this class is under cwl-tes's Apache 2.0 license?

Copy link
Contributor

@mr-c mr-c Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to repeat the license.

If there are code quotes, then you could mention the copyright of the original authors / copyright holders:

Copyright 2018 EMBL-EBI <[email protected]> (Michael R. Crusoe's original implementation, written under contract)
Copyright 2019 Adam Struck <[email protected]>
Copyright 2020 Joris Vankerschaver <[email protected]>
Copyright 2020 Dab Brill <[email protected]>
Copyright 2020 Manabu ISHII <[email protected]>

Otherwise if this implementation is different enough, you don't need to mention the authors.

I suggest this metric: if we wanted to change the license of Toil, would you have to get permission from these copyright holders because the code is essentially the same? Or is the code so adapted or different for other reasons that it is a new work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One approach is just add their Apache 2.0 license's copyright line to Toil's LICENSE file along with the others.

You could also drop it in a different file and import it to here, and in that file have Copyright 2017 Oregon Health and Science University and the copyright boilerplate we use for our files. I think.

Also the #8 at the end of this link is pointing to the wrong line number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's easier to do the latter by putting the code in a separate file, so I'll do that

@stxue1 stxue1 changed the title Issues/5134 ftp get size Add FTP support to jobstores Oct 25, 2024
Copy link
Member

@adamnovak adamnovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need some more documentation about how FTP-over-SSL (which I think is rather unusual) is being used here, and how we're handling non-SSL FTP servers vs. those without certificates (which themselves might actually be downgrade attacks).

We need to write to a user who doesn't necessarily know that FTP can happen over SSL.

"""
FTP access with upload.

Taken and modified from https://github.com/ohsu-comp-bio/cwl-tes/blob/03f0096f9fae8acd527687d3460a726e09190c3a/cwl_tes/ftp.py#L8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One approach is just add their Apache 2.0 license's copyright line to Toil's LICENSE file along with the others.

You could also drop it in a different file and import it to here, and in that file have Copyright 2017 Oregon Health and Science University and the copyright boilerplate we use for our files. I think.

Also the #8 at the end of this link is pointing to the wrong line number.

Comment on lines 2000 to 2002
def __init__(
self, cache: Optional[dict[Any, ftplib.FTP]] = None, insecure: bool = False
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have docstrings for this and all the other functions in the class, ideally. It's especially important with a parameter called insecure; what is not being secured, and what is it not being secured from? Does it mean we're vulnerable to password sniffing and someone tampering with the FTP data in transit, or are we now executing arbitrary code from every file we download, or what?

user = env_user
if env_passwd:
passwd = env_passwd
ftp.login(user or "", passwd or "", secure=not self.insecure)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this using an undocumented secure parameter? What does that do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ftp in this case is an FTP_TLS object, which supports connecting over TLS. It's not on the official FTP_TLS python documentation, nor is there an associated docstring, but the code itself simply sets up an SSL connection to the FTP server

docs/appendices/environment_vars.rst Outdated Show resolved Hide resolved
Copy link
Member

@adamnovak adamnovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should cut the feature where we require user approval for an insecure FTP connection. (We might also want to then cut/flag off .netrc support?) If we do implement that, we also need to implement it for the actual requests, not just pass off to urllib.

I also think we should address the cached FTP connections that seem to live forever. Maybe we want them to time out somehow? Or maybe we can just bypass the cache or not store a persistent FtpFsAccess in a class-level variable.

Comment on lines 1957 to 1959
cls._setup_ftp()
# mypy is unable to understand that ftp must exist by this point
assert cls.ftp is not None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you structured this as a method that returns an FtpFsAccess and not one that has a side effect of setting a class-level field, you wouldn't need all these asserts and you also wouldn't need to remember two member names.

Comment on lines 1877 to 1880
ftp = None

@classmethod
def _setup_ftp(cls) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's weird that the ftp member is public, but the function that must be called before it is actually used has an underscore prefix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be better placed in src/toil/lib.

@@ -1872,15 +1874,28 @@ class JobStoreSupport(AbstractJobStore, metaclass=ABCMeta):
stores.
"""

ftp = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep a class-level FtpFsAccess instance that is never destroyed, and the FtpFsAccess instance caches open FTP connections to servers, when do we close our FTP connections? It might be never, and that's probably not what we want.

Comment on lines 43 to 46
:param cache: cache of generated FTP objects
:param insecure: Whether to connect over FTP with TLS
"""
self.cache = cache or {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to be able to turn off using a connection cache at all, if we're going to make an FtpFsAccess that is never destroyed.

user = env_user
if env_passwd:
passwd = env_passwd
ftp.login(user or "", passwd or "", secure=not self.insecure)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so according to the obviously reliable https://tedboy.github.io/python_stdlib/_modules/ftplib.html#FTP_TLS.login what secure does here is call auth() if you haven't already set up SSL/TLS on the connection, before doing the login.

So it looks like unless the user sets TOIL_FTP_USE_SSL=False themselves, they're going to get an error here whenever they use an ftp:// URL, unless the server actually supports SSL/TLS.

One server people might want to use is ftp://ftp-trace.ncbi.nlm.nih.gov. I've just checked and it doesn't support FTP over SSL/TLS.

Another thing we don't do is call the prot_p() method on the connection to actually secure the data channel. And later we fall back to urlopen() and give it the full FTP URL with username and password, and I don't think that urlopen() can use or enforce SSL. We'd have to tinker with urlopen's handler as recommended in https://stackoverflow.com/a/73171311.

So I think that by default we should allow unencrypted FTP connections without user intervention to accept them. That's the current behavior.

If we want to change Toil to enforce secure FTP by default, then we also need to enforce security on the actual reads. And we probably would need a way to individually allow particular insecure servers?

Copy link
Member

@adamnovak adamnovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The promised SSL enforcement doesn't actually work yet because of the fallback to urlopen().

@@ -1872,18 +1874,29 @@ class JobStoreSupport(AbstractJobStore, metaclass=ABCMeta):
stores.
"""

@classmethod
def _setup_ftp(cls) -> FtpFsAccess:
return FtpFsAccess(insecure=strtobool(os.environ.get('TOIL_FTP_USE_SSL', 'False')) is False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it looks like we don't reuse FTP connections at all. But I guess that's fine? There's nothing like a time-based @memoize; we'd need a thread to watch it and that would be a pain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can leave a comment that this doesn't reuse FTP connections, and that implementing it later may be a good idea

"""
if "r" in mode:
host, port, user, passwd, path = self._parse_url(fn)
handle = urlopen("ftp://{}:{}@{}:{}/{}".format(user, passwd, host, port, path))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't respect TOIL_FTP_USE_SSL.

If we implemented open() on top of https://docs.python.org/3/library/ftplib.html#ftplib.FTP.retrbinary and supporting files larger than memory, we'd need some kind of pipe file object and a thread to feed the data into it.

We have some pipe implementations over in

class ReadablePipe(ABC):
but they insist on driving the control flow to simplify cleanup; a real open() needs to produce a file object that Just Works and have all the cleanup happen automatically when it goes out of scope and is closed. We might need to use closing() to help with that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed to put into another issue as it appears to be too large of an issue

@stxue1 stxue1 mentioned this pull request Nov 20, 2024
@adamnovak adamnovak self-requested a review November 22, 2024 18:08
Copy link
Member

@adamnovak adamnovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the SSL enforcement feature I think this looks OK.

@adamnovak adamnovak merged commit f3e98db into master Nov 22, 2024
3 checks passed
@adamnovak adamnovak deleted the issues/5134-ftp-get-size branch November 22, 2024 21:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add FTP support to AbstractJobStore._get_size
3 participants