-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Debug intermittent failure of servers to start in LEAP hub #2047
Comments
This gives the single user server process more time to start up, pending investigation in 2i2c-org#2047
#2048 will temporarily increase the start timeout to make launches more robust, as we try to figure out why in the world it takes so long to start up |
Here's debug logs for the user pod:
Not much of use here |
Here's the log output if I switch to using Jupyter Server instead of Notebook Server:
This is both better and worse |
Hmm, everything seems to happen twice here, which is confusing. |
I just tried the same on a different hub without the pangeo image, and it started up in under a second. So something is screwy with some extension in use. |
My intuition was correct - it was an extension being screwy. I discovered this by disabling all extensions and enabling them one by one. And turns out the extension that is causing the blocking is... jupyterlab?!
You'll see this starts under a second as expected.
sigh. |
I'm unfortunately unable to reproduce this locally, with the same version of jupyterlab, and the following
:( |
Something is calling |
And bam, here's the issue raised in the nb_conda_kernels repo: anaconda/nb_conda_kernels#210 and a possible fix that's been sitting idle anaconda/nb_conda_kernels#211 |
Ok, my prior conclusion was wrong, and while the issue should be fixed in nb_conda_kernels too, I don't know that is what the problem here is. I have an strace of |
ok, so a different conclusion I've come to now, based on looking at |
So I modified the code that is https://github.com/jupyter-server/jupyter_server/blob/a52709cdaff55eeb36774d3e60552b3535786e06/jupyter_server/extension/utils.py#L65 inside the container (with start = time.perf_counter()
module = importlib.import_module(package_name)
end = time.perf_counter()
duration = end - start
print(f"Importing {package_name} took {duration:0.2f}s", flush=True) at the location indicated in the first link, to measure how long importing each server extension takes. With all server extensions disabled, here is the appropriate output:
so we have our two culprits - xarray_leaflet and panel jupyter server extension :) And jupyter server itself is a culprit, where disabling extensions does not prevent their import and initialization! |
With all the extensions enabled, there wasn't much of a difference in startup time. However, with all
panel added 5s and xarray-leaflet added almost 2.5s to startup time. |
Verifiable separately:
compared to, say:
So the true culprits here are xarray_leaflet and panel.io's server extension, and them doing a ton of imports of slow modules at import time, blocking up the start of the jupyterhub singleuser server. |
I opened holoviz/panel#4271 in panel |
Panel and Xarray-Leaflet are heavy enough imports that without .pyc files, they sometimes together take as much as 15s to import?! This causes jupyterhub to fail startup in some cases. The longer term fix is in panel and xarray-leaflet ( see holoviz/panel#4271, xarray-contrib/xarray_leaflet#79). In the meantime, leaving the .pyc files in place doesn't increase the image size by much, but makes startup definitely much faster! Ref 2i2c-org/infrastructure#2047
Since server startup is definitely much faster when I also opened xarray-contrib/xarray_leaflet#79 in xarray-leaflet |
Based on some detailed debugging in 2i2c-org/infrastructure#2047 to figure out why a Jupyter Server *process* somtimes takes more than 30s to start, the primary culprit was server extensions that took multiple seconds to just even import. Thanks to some ad-hoc patching (2i2c-org/infrastructure#2047 (comment)), I was able to figure out which were the slow extensions. This PR emits extension import time as log messages, so this information is *much* more visible. I also explicitly chose info instead of debug, primarily because I believe it is *very* important to surface this performance information to users, so they can go bug the appropriate extension. Otherwise, it just feels like 'jupyter server is slow!'. This is compounded by the fact that while notebook server doesn't import *disabled* extensions, jupyter_server does seem to - so it's hard to isolate this.
I've also made this PR to jupyter_server to surface this performance information more easily, so we can see the culprits: jupyter-server/jupyter_server#1171 |
* Log how long each extension module takes to import Based on some detailed debugging in 2i2c-org/infrastructure#2047 to figure out why a Jupyter Server *process* somtimes takes more than 30s to start, the primary culprit was server extensions that took multiple seconds to just even import. Thanks to some ad-hoc patching (2i2c-org/infrastructure#2047 (comment)), I was able to figure out which were the slow extensions. This PR emits extension import time as log messages, so this information is *much* more visible. I also explicitly chose info instead of debug, primarily because I believe it is *very* important to surface this performance information to users, so they can go bug the appropriate extension. Otherwise, it just feels like 'jupyter server is slow!'. This is compounded by the fact that while notebook server doesn't import *disabled* extensions, jupyter_server does seem to - so it's hard to isolate this. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Account for the possibility logger is None Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Will make Panel 0.14.3 available some time next week which should reduce import time by at least half based on my testing. |
Amazing, @philippjfr! Thank you for your prompt work :) I also opened jupyter-server/jupyter_server#1177 to get jupyter-server to stop importing python packages if they are disabled. |
All this work upstream of us is beautiful, @yuvipanda! |
Here's a PR addressing the issue of loading disabled extensions jupyter-server/jupyter_server#1180 |
@jcrist suggested in https://hachyderm.io/@jcristharif/109707037385576616 that the performance issue might be the write time of .pyc files, not the parsing itself - and suggested a method to test this. So I did! No .pyc files, no env var:
The .pyc files are created during the first invocation, so the second invocation is about 360% faster. No .pyc files, but
I validated that no .pyc files are created, and the second invocation is about the same speed. So from this, while not creating .pyc files at all does help with startup performance (12s vs 18s), it's still a lot more than with .pyc files present (~4s). I checked the size of the .pyc files created, and that's 17MB in this case. I think shipping 17MB is a reasonable trade-off here... |
Thanks for the writeup Yuvi! Turns out my intuition was wrong (as it often is). Your conclusion to not strip the |
Yay ty @jcrist :) Would you consider adding a note to your blog post about the tradeoff for startup time with .pyc files? |
To mark this issue as completed, I'd want to bump the LEAP hub image to include all the fixes we have in place. I opened pangeo-data/pangeo-docker-images#433 to include xarray-leaflet. I don't think bumping jupyter_server is required, but would be good to get panel in there. |
@yuvipanda amazing work and writeup!! I checked checkboxes in the top post that was completed and communicated back to Julius as the LEAP hub now uses the latest docker image which includes xarray-leaflet etc. |
Is this actually needed to close this one? |
In https://2i2c.freshdesk.com/a/tickets/365, it was reported that there was failure to start on some LEAP hub users intermittently. Looking at the hub logs, I discovered a number of:
at relevant times.
Upon looking at the user's pod logs (using techniques described in #2046), I see at the matching timestamp:
This means the pod started, but notebook process inside the pod hadn't started in time, and the 'Deleting pod' sent the appropriate signal to the user pod to shut down. So this isn't a spawn timeout but a http timeout.
In the user logs, the line preceeding the 'received signal 15' is:
If I look at the logs from a successful startup, I see:
So right after:
you see
However, it took 12 seconds to arrive! That is a very long time for something to be happening here, and the notebook server is not 'ready' until well after that. It is quite likely that sometimes, it takes longer than 30s and hence the timeout happens.
Ideally, all this should really be taking under a second so it is unclear why things are taking as long as they are. My suspicion is there's a notebook extension that's doing some blocking code execution on startup, causing this - but not sure how exactly to look for that.
Items to resolve support ticket
panel 0.14.3 released, and updated in pangeo-docker-images
Observed to be resolved
Done in Update LEAP hub images to latest pangeo image #2183
Items to resolve to close this
The text was updated successfully, but these errors were encountered: