Supervisor: Make ZeroMQ socket timeout configurable, and/or increase default timeout #5620

jmcphers · 2024-12-04T21:50:46Z

Currently, the kernel supervisor waits up to 20 seconds for ZeroMQ socket connections. If it is not able to connect to the kernel after 20 seconds, it shows an error like this one:

Python 3.12.4 (Global) starting.
Python 3.12.4 (Global) failed to start up (exit code 130)

Timed out waiting to session's ZeroMQ sockets after 20 seconds

20 seconds is a long time to wait for startup, but some systems are pretty slow, and in our own environments (which aren't necessarily the slowest of any we would support) we've observed start times of > 15s even when everything is working correctly. See e.g. traces in #5340.

It would be great if there were some way for us to know whether the kernel was working correctly (and slowly) or legitimately hung when we are waiting for a socket connection. However, since the sockets are how we talk to the kernel in the first place, we would need to establish some sort of side channel or heuristic to figure this out.

We should, at a minimum:

use a longer timeout by default (probably at least 30s)
make this timeout configurable so that environments with slower kernel startup times can wait longer

The text was updated successfully, but these errors were encountered:

petetronic · 2024-12-10T15:28:46Z

Moving this up to 2025.01.0 - I keep running into this for new workspaces on PTD. I have to restart the runtime and hope it starts in less than 20 seconds.

Addresses #5620 by adding an option to specify the number of seconds to wait for the kernel to connect before giving up. This change is mostly on the supervisor side; the client changes here just add the new option and pass it along to the supervisor. <img width="507" alt="image" src="https://github.com/user-attachments/assets/89ae4191-3865-4637-b301-e5660e61cd4d"> The default was 20 (hardcoded) and is now 30 (configurable). While I was updating the API, I also added some methods/types that will later support #5226 (but aren't currently implemented on the client or server side). ### QA Notes A really easy way to test this is to set the timeout at 1 second, since most kernels do not start up that quickly.

jonvanausdeln · 2024-12-12T17:57:06Z

Verified Fixed

Positron Version(s) : 2025.01.0-71
OS Version(s) :

Test scenario(s)

Tested several intervals.. works as expected. 1s timeout triggers as expected.

Link(s) to TestRail test cases run or created:

jmcphers added the area:kallichore Issues related to the new kernel supervisor label Dec 4, 2024

juliasilge assigned jmcphers Dec 9, 2024

juliasilge added this to the Release Candidate milestone Dec 9, 2024

petetronic modified the milestones: Release Candidate, 2025.01.0 Pre-Release Dec 10, 2024

jmcphers added the area: workbench Issues related to Workbench category. label Dec 10, 2024

jmcphers mentioned this issue Dec 10, 2024

Add configurable session timeouts for supervisor #5693

Merged

jmcphers mentioned this issue Dec 11, 2024

Workbench: Positron Python console slow or fails to start #5340

Closed

jonvanausdeln closed this as completed Dec 12, 2024

github-actions bot locked as resolved and limited conversation to collaborators Dec 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supervisor: Make ZeroMQ socket timeout configurable, and/or increase default timeout #5620

Supervisor: Make ZeroMQ socket timeout configurable, and/or increase default timeout #5620

jmcphers commented Dec 4, 2024

petetronic commented Dec 10, 2024

jonvanausdeln commented Dec 12, 2024

Supervisor: Make ZeroMQ socket timeout configurable, and/or increase default timeout #5620

Supervisor: Make ZeroMQ socket timeout configurable, and/or increase default timeout #5620

Comments

jmcphers commented Dec 4, 2024

petetronic commented Dec 10, 2024

jonvanausdeln commented Dec 12, 2024

Verified Fixed

Test scenario(s)

Link(s) to TestRail test cases run or created: