Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling Multi-Factor Authentication for workers #58

Open
davidwaroquiers opened this issue Jan 30, 2024 · 10 comments
Open

Handling Multi-Factor Authentication for workers #58

davidwaroquiers opened this issue Jan 30, 2024 · 10 comments

Comments

@davidwaroquiers
Copy link
Member

Following question from @JaGeo, opening here an issue on the MFA topic. Let's gather ideas, info, existing solutions, problems, ... related to the fact that clusters are slowly (or maybe rapidly ?) moving to MFA authentication.
@gpetretto did some tests (could you maybe summarize insights here ?)

@ml-evs
Copy link
Member

ml-evs commented Jan 30, 2024

One "simple" approach that might work here (and that I have seen suggested on some machines) is to use SSH multiplexing, so that once an authenticated connection is created (by the user, with TOTP or whatever), then the connection is kept open and all further connections within the session go through it. This is handled with a simple socket file, so Paramiko/Fabric etc. should just seamlessly work too. This would require some structure whereby JFR notifies the user that a new TOTP code is required to keep monitoring jobs, and will depend on how stringently machines enforce these timeouts (in practice, the timeout can be infinite...). The relevant SSH config parameters are ControlPersist and ControlMaster (see https://www.man7.org/linux/man-pages/man5/ssh_config.5.html and e.g. the suggestions in the Cambridge docs https://docs.hpc.cam.ac.uk/hpc/user-guide/mfa.html#reducing-the-effort-of-mfa-connection-sharing).

@ml-evs
Copy link
Member

ml-evs commented Jan 30, 2024

Also related is the previous discussion at Matgenix/qtoolkit#14

@JaGeo
Copy link
Collaborator

JaGeo commented Jan 31, 2024

Would this also require a setup without password? We have both, password and MFA at the moment. Using a key pair does not help with the password.

@gpetretto
Copy link
Contributor

I think in general it would not require a password (at least from my tests). The problem is that unfortunately it seems that multiplexing is not (yet) supported by paramiko: paramiko/paramiko#852.
Until support is addeed I am afraid we should deal with this issue in some other way.

The fact that jobflow-remote should keep the connection with the host open would allow to pass the OTP when the Runner starts.
The connection can still be closed and indeed it would be good if the user could be notified about that. However, I am not sure if there is a convenient way of doing that.

@ml-evs
Copy link
Member

ml-evs commented Jan 31, 2024

As an aside, I've just pushed #60 which can be used as a test bed for some of these approaches (both by manually building and launching the MFA-enabled Slurm container and testing locally, and by the eventual full JFR automation...). For now we should at least add clear error messages and a docs page about this until we have a real solution.

@ml-evs
Copy link
Member

ml-evs commented Jan 31, 2024

Also, I'm going to assume this isn't the case for the supercomputers in question, but at least for the google-authenticator-libpam implementation I'm using, secrets are stored in plain text in the user directory (so e.g., I can manually define a new emergency backup code for next login once I'm logged in). I'm going to take a wild guess that they are encrypted in production uses (with decryption key probably depending on supercomputer... but might simply be the user's password), so depending on the machine we might have some joy writing new encrypted backup codes that jobflow remote (alone) can use (e.g., at the start of each jobflow-remote "remote tick", generate, write and store a new TOTP emergency backup code, then use it for the next tick's login). Long shot perhaps!

Again, at least for Cambridge, resetting TOTP requires a video call where you show Government ID, which I assume we don't want to try to spoof 😅

@gpetretto
Copy link
Contributor

I have a solution that "works" under certain conditions:

  • The user will need to provide an OTP when starting the Runner
  • Only one password should be passed when logging in. If two are needed (e.g. ssh password + OTP), the first one should be handled with an ssh key.
  • The Runner will not be able reconnect if the connection drops, so It requires a stable connection and that the connection is not killed often by the server.
  • If the connection is killed the Runner needs to be restarted.
  • The Runner can only run with a single process, not in the split mode.

These can be relatively strict, but given the above limitations, I have tested JFR with a simple VM with a MFA based on google authenticator. Just setting an OTP as password in the project configuration file and start the Runner immediately worked fine. Of course, if this proves an effective solution, we can just add an option when starting the Runner to ease the process: e.g. jf runner start -otp 123456.

The limitation on having a single password prompted is not strictly a limitation for paramiko, but as far as I have seen it is not possible for fabric with built-in options. I would need a bit more time to check how to use the lower level paramiko machinery to properly set up the fabric Connection in that case.

I agree that it would be better not to mess with the token generation. I suppose in some cases this could lead to a ban from the cluster.

@gpetretto
Copy link
Contributor

As an update, I managed to create a fabric connection even with password+OTP. It is a bit involved, but should be possible to implement it in jobflow-remote, if needed.

@JaGeo
Copy link
Collaborator

JaGeo commented Feb 2, 2024

I am still testing with the cluster support to see if the key-pair connection could at least allow for a passwordless connection. They think it should work but, in practice, it does not work yet... I will keep you updated.

@gpetretto
Copy link
Contributor

Update on this topic. I have managed to implement a solution to address this issue. In case anyone else is interested it can be found in this branch: https://github.com/Matgenix/jobflow-remote/tree/interactive. I will merge it after testing it more.

The idea is the following: if an OTP needs to be provided, when the daemon is started, the CLI will then allow to connect to the daemon process and interact with it (through supervisor's "foreground" option). In this specific case the Runner will immediately try to connect to the remote host and the user will be prompted for password (if requested) and OTP. This of course still has some of the limitations listed above:

  • The Runner will not be able reconnect if the connection drops, so It requires a stable connection and that the connection is not killed often by the server.
  • If the connection is killed the Runner needs to be restarted.
  • The Runner can only run with a single process, not in the split mode. (This is not a strict limitation. I should be able to easily activate the option for the split runner, but this will require to provide an OTP for each of the different processes. In addition it will create more potential points of faillure, since if the connection of any of the processes drops the runner will need to be restarted. If anyone is interested I can activate this option though).

I should add that the administrators of one computing center told us that storing the secret locally (even encrypted) is not considered an acceptable procedure for them. So I am afraid that the main limitations will remain for the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants