Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hetzner-2i2c federation tracking #3173

Open
minrk opened this issue Jan 21, 2025 · 13 comments
Open

hetzner-2i2c federation tracking #3173

minrk opened this issue Jan 21, 2025 · 13 comments

Comments

@minrk
Copy link
Member

minrk commented Jan 21, 2025

Creating a single tracking issue since there are likely to be various small issues

After adding heztner-2i2c to the federation (#3169), it started to get a lot of traffic. I probably should have started with a smaller weight and quota to ease it in, as coming in with a high weight and prime=True leads to a whole lot of builds with a (note: prime=True means quota is ignored if all federation members are full, so this is the one that gets traffic when everyone is overwhelmed. Probably not best to start there!)

Problems encountered with the early builds:

failed to create fsnotify watcher: too many open files in logs

I addressed this following what I found in issues like kairos-io/kairos#2071 by running:

sysctl -w fs.inotify.max_user_instances=8192
sysctl -w fs.inotify.max_user_watches=524288

and adding those two lines to a new file /etc/sysctl.d/99-mybinder.conf to it should survive a restart. This seemed to immediately reduce the fsnotify (TODO: is this the right place to put it, or is there a k3s-specific config option for it?)

timeout pushing to local registry

The localhost registry appears to be timing out due to load. registry is quite single-threaded, I believe, so concurrent pushes may fail with timeouts if overwhelmed. This should alleviate as the build cache warms up. I'll investigate increasing the number of registry replicas, in case that helps.

@minrk
Copy link
Member Author

minrk commented Jan 21, 2025

#3174 increases replicas

When I switched GESIS back to prime, some CSS and images stopped loading, I'm guessing because /binder/ is being added to the path twice for the proxied requests or something. Something to fix in the federation redirect, I suspect. #3174 restores 2i2c to prime so it can serve static resources, but keeping the quota and weight low to limit sudden traffic.

@minrk
Copy link
Member Author

minrk commented Jan 21, 2025

Increasing replicas does not seem to have eliminated the push timeout.

We could try increasing the docker client api timeout if we could set the config:

DockerEngine.extra_init_args={"timeout": 120}

but the repo2docker cli doesn't support general traitlets config (it tries to, but repo and cmd args swallow all unrecognized cli args, so traitlets_args can only ever be empty.

I don't think there is a way to get traitlets args and have positional args for repo and cmd at the same time, they are mutually exclusive.

The alternative is to pass via repo2docker_config file, which binderhub doesn't currently support, since it would need to be via an additional volume. So there's really no way to pass repo2docker traitlet config in binderhub right now.

@minrk
Copy link
Member Author

minrk commented Jan 21, 2025

Various build failures for common images:

The command '/bin/sh -c TIMEFORMAT='time: %3R' bash -c 'time ${MAMBA_EXE} env update -p ${NB_PYTHON_PREFIX} --file ".binder/environment.yml" && time ${MAMBA_EXE} clean --all -f -y && ${MAMBA_EXE} list -p ${NB_PYTHON_PREFIX} '' returned a non-zero code: 137

which I believe is OOMKiller. May need to bump the builder memory

The command '/bin/sh -c apt-get update > /dev/null && apt-get install --yes --no-install-recommends         libclang-dev         libzmq3-dev > /dev/null && wget --quiet -O /tmp/r-4.3.deb     https://cdn.rstudio.com/r/ubuntu-$(. /etc/os-release && echo $VERSION_ID | sed 's/\.//')/pkgs/r-4.3_1_amd64.deb && apt install --yes --no-install-recommends /tmp/r-4.3.deb > /dev/null && rm /tmp/r-4.3.deb && apt-get -qq purge && apt-get -qq clean && rm -rf /var/lib/apt/lists/* && ln -s /opt/R/4.3/bin/R /usr/local/bin/R && ln -s /opt/R/4.3/bin/Rscript /usr/local/bin/Rscript && R --version' returned a non-zero code: 8

no other useful output (nice). Possibly just needs an update, not sure, I don't know anything about R.

@minrk
Copy link
Member Author

minrk commented Jan 21, 2025

The docker build cache also doesn't seem to be used at all, as every build seems to start from scratch, no matter what. I'm not sure what's going on there, as these failed pushes should be super quick.

@minrk
Copy link
Member Author

minrk commented Jan 21, 2025

ah, apparently the traitlets arg is not a general issue, it is specific to the option I want to pass. If there is no space, it works:

'--DockerEngine.extra_init_args={"timeout":120}'

I'll give that a try

@minrk
Copy link
Member Author

minrk commented Jan 21, 2025

increasing timeout doesn't seem to have solved the timeouts, just increased the amount of time before the timeout is reported. It appears something is stuck, but I have no idea what or how to debug. I'm not sure it's a slowness issue either, since the same images consistently fail. Maybe it's a layer size limit somewhere.

@minrk
Copy link
Member Author

minrk commented Jan 21, 2025

The jupyterlab-demo image appears to be killed because the mamba solve actually uses too much memory. I believe this is because the repo uses an outdated version of everything and pin many direct dependencies, but no transitive dependencies, meaning mamba has to go through a super complex solve of changing every version of every installed package in the env, plus installing a very large env.

@yuvipanda
Copy link
Contributor

#3179
#3181
#3182

these all helped, but ultimately I think the core of the issue is:

#3183

with those in place, there's no build backup. I've increased the quota to 250! But let's continue to keep an eye on it.

@yuvipanda
Copy link
Contributor

@minrk
Copy link
Member Author

minrk commented Jan 22, 2025

I think something might be a little funky in the registry because it's consistently filling up at around 3GB/minute, which seems like a pretty wild rate of increase, which is causing the image cleaner to be invoked a lot more often than one might expect.

@minrk
Copy link
Member Author

minrk commented Jan 22, 2025

Makes me think some cache something might not be being re-used, I'm not sure. But it doesn't seem like the number of builds we are running should be eating up that much space that quickly.

@minrk
Copy link
Member Author

minrk commented Jan 23, 2025

jupyterhub/binderhub#1913 adds the option we need to skip cordoning during image cleaning

@minrk
Copy link
Member Author

minrk commented Jan 23, 2025

hetzner is reporting over 100 requests waiting on a build (meaning open EventStream connections while a build is running). I'm guessing this is a bug in the metric not always decrementing the counter and not a true number, but it could be another bug in the event stream code.

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants