-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hetzner-2i2c federation tracking #3173
Comments
#3174 increases replicas When I switched GESIS back to prime, some CSS and images stopped loading, I'm guessing because |
Increasing replicas does not seem to have eliminated the push timeout. We could try increasing the docker client api timeout if we could set the config: DockerEngine.extra_init_args={"timeout": 120} but the repo2docker cli doesn't support general traitlets config (it tries to, but I don't think there is a way to get traitlets args and have positional args for The alternative is to pass via repo2docker_config file, which binderhub doesn't currently support, since it would need to be via an additional volume. So there's really no way to pass repo2docker traitlet config in binderhub right now. |
Various build failures for common images:
which I believe is OOMKiller. May need to bump the builder memory
no other useful output (nice). Possibly just needs an update, not sure, I don't know anything about R. |
The docker build cache also doesn't seem to be used at all, as every build seems to start from scratch, no matter what. I'm not sure what's going on there, as these failed pushes should be super quick. |
ah, apparently the traitlets arg is not a general issue, it is specific to the option I want to pass. If there is no space, it works:
I'll give that a try |
increasing timeout doesn't seem to have solved the timeouts, just increased the amount of time before the timeout is reported. It appears something is stuck, but I have no idea what or how to debug. I'm not sure it's a slowness issue either, since the same images consistently fail. Maybe it's a layer size limit somewhere. |
The jupyterlab-demo image appears to be killed because the mamba solve actually uses too much memory. I believe this is because the repo uses an outdated version of everything and pin many direct dependencies, but no transitive dependencies, meaning mamba has to go through a super complex solve of changing every version of every installed package in the env, plus installing a very large env. |
I think something might be a little funky in the registry because it's consistently filling up at around 3GB/minute, which seems like a pretty wild rate of increase, which is causing the image cleaner to be invoked a lot more often than one might expect. |
Makes me think some cache something might not be being re-used, I'm not sure. But it doesn't seem like the number of builds we are running should be eating up that much space that quickly. |
jupyterhub/binderhub#1913 adds the option we need to skip cordoning during image cleaning |
Creating a single tracking issue since there are likely to be various small issues
After adding heztner-2i2c to the federation (#3169), it started to get a lot of traffic. I probably should have started with a smaller weight and quota to ease it in, as coming in with a high weight and prime=True leads to a whole lot of builds with a (note: prime=True means quota is ignored if all federation members are full, so this is the one that gets traffic when everyone is overwhelmed. Probably not best to start there!)
Problems encountered with the early builds:
failed to create fsnotify watcher: too many open files
in logsI addressed this following what I found in issues like kairos-io/kairos#2071 by running:
and adding those two lines to a new file
/etc/sysctl.d/99-mybinder.conf
to it should survive a restart. This seemed to immediately reduce the fsnotify (TODO: is this the right place to put it, or is there a k3s-specific config option for it?)timeout pushing to local registry
The localhost registry appears to be timing out due to load. registry is quite single-threaded, I believe, so concurrent pushes may fail with timeouts if overwhelmed. This should alleviate as the build cache warms up. I'll investigate increasing the number of registry replicas, in case that helps.
The text was updated successfully, but these errors were encountered: