Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supervisor: Make ZeroMQ socket timeout configurable, and/or increase default timeout #5620

Closed
jmcphers opened this issue Dec 4, 2024 · 2 comments
Assignees
Labels
area:kallichore Issues related to the new kernel supervisor area: workbench Issues related to Workbench category.

Comments

@jmcphers
Copy link
Collaborator

jmcphers commented Dec 4, 2024

Currently, the kernel supervisor waits up to 20 seconds for ZeroMQ socket connections. If it is not able to connect to the kernel after 20 seconds, it shows an error like this one:

Python 3.12.4 (Global) starting.
Python 3.12.4 (Global) failed to start up (exit code 130)

Timed out waiting to session's ZeroMQ sockets after 20 seconds

20 seconds is a long time to wait for startup, but some systems are pretty slow, and in our own environments (which aren't necessarily the slowest of any we would support) we've observed start times of > 15s even when everything is working correctly. See e.g. traces in #5340.

It would be great if there were some way for us to know whether the kernel was working correctly (and slowly) or legitimately hung when we are waiting for a socket connection. However, since the sockets are how we talk to the kernel in the first place, we would need to establish some sort of side channel or heuristic to figure this out.

We should, at a minimum:

  • use a longer timeout by default (probably at least 30s)
  • make this timeout configurable so that environments with slower kernel startup times can wait longer
@jmcphers jmcphers added the area:kallichore Issues related to the new kernel supervisor label Dec 4, 2024
@juliasilge juliasilge added this to the Release Candidate milestone Dec 9, 2024
@petetronic
Copy link
Collaborator

Moving this up to 2025.01.0 - I keep running into this for new workspaces on PTD. I have to restart the runtime and hope it starts in less than 20 seconds.

@jmcphers jmcphers added the area: workbench Issues related to Workbench category. label Dec 10, 2024
jmcphers added a commit that referenced this issue Dec 10, 2024
Addresses #5620 by adding an option to specify the number of seconds to
wait for the kernel to connect before giving up. This change is mostly
on the supervisor side; the client changes here just add the new option
and pass it along to the supervisor.

<img width="507" alt="image"
src="https://github.com/user-attachments/assets/89ae4191-3865-4637-b301-e5660e61cd4d">

The default was 20 (hardcoded) and is now 30 (configurable).

While I was updating the API, I also added some methods/types that will
later support #5226 (but aren't currently implemented on the client or
server side).

### QA Notes

A really easy way to test this is to set the timeout at 1 second, since
most kernels do not start up that quickly.
@jonvanausdeln
Copy link
Contributor

Verified Fixed

Positron Version(s) : 2025.01.0-71
OS Version(s) :

Test scenario(s)

Tested several intervals.. works as expected. 1s timeout triggers as expected.

Link(s) to TestRail test cases run or created:

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 27, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area:kallichore Issues related to the new kernel supervisor area: workbench Issues related to Workbench category.
Projects
None yet
Development

No branches or pull requests

4 participants