hetzner-2i2c federation tracking #3173

minrk · 2025-01-21T10:23:03Z

Creating a single tracking issue since there are likely to be various small issues

After adding heztner-2i2c to the federation (#3169), it started to get a lot of traffic. I probably should have started with a smaller weight and quota to ease it in, as coming in with a high weight and prime=True leads to a whole lot of builds with a (note: prime=True means quota is ignored if all federation members are full, so this is the one that gets traffic when everyone is overwhelmed. Probably not best to start there!)

Problems encountered with the early builds:

`failed to create fsnotify watcher: too many open files` in logs

I addressed this following what I found in issues like kairos-io/kairos#2071 by running:

sysctl -w fs.inotify.max_user_instances=8192
sysctl -w fs.inotify.max_user_watches=524288

and adding those two lines to a new file /etc/sysctl.d/99-mybinder.conf to it should survive a restart. This seemed to immediately reduce the fsnotify (TODO: is this the right place to put it, or is there a k3s-specific config option for it?)

timeout pushing to local registry

The localhost registry appears to be timing out due to load. registry is quite single-threaded, I believe, so concurrent pushes may fail with timeouts if overwhelmed. This should alleviate as the build cache warms up. I'll investigate increasing the number of registry replicas, in case that helps.

The text was updated successfully, but these errors were encountered:

minrk · 2025-01-21T10:38:32Z

#3174 increases replicas

When I switched GESIS back to prime, some CSS and images stopped loading, I'm guessing because /binder/ is being added to the path twice for the proxied requests or something. Something to fix in the federation redirect, I suspect. #3174 restores 2i2c to prime so it can serve static resources, but keeping the quota and weight low to limit sudden traffic.

minrk · 2025-01-21T11:29:20Z

Increasing replicas does not seem to have eliminated the push timeout.

We could try increasing the docker client api timeout if we could set the config:

DockerEngine.extra_init_args={"timeout": 120}

but the repo2docker cli doesn't support general traitlets config (it tries to, but repo and cmd args swallow all unrecognized cli args, so traitlets_args can only ever be empty.

I don't think there is a way to get traitlets args and have positional args for repo and cmd at the same time, they are mutually exclusive.

The alternative is to pass via repo2docker_config file, which binderhub doesn't currently support, since it would need to be via an additional volume. So there's really no way to pass repo2docker traitlet config in binderhub right now.

minrk · 2025-01-21T11:32:10Z

Various build failures for common images:

jupyterlab-demo keeps failing with

The command '/bin/sh -c TIMEFORMAT='time: %3R' bash -c 'time ${MAMBA_EXE} env update -p ${NB_PYTHON_PREFIX} --file ".binder/environment.yml" && time ${MAMBA_EXE} clean --all -f -y && ${MAMBA_EXE} list -p ${NB_PYTHON_PREFIX} '' returned a non-zero code: 137

which I believe is OOMKiller. May need to bump the builder memory

R example

The command '/bin/sh -c apt-get update > /dev/null && apt-get install --yes --no-install-recommends         libclang-dev         libzmq3-dev > /dev/null && wget --quiet -O /tmp/r-4.3.deb     https://cdn.rstudio.com/r/ubuntu-$(. /etc/os-release && echo $VERSION_ID | sed 's/\.//')/pkgs/r-4.3_1_amd64.deb && apt install --yes --no-install-recommends /tmp/r-4.3.deb > /dev/null && rm /tmp/r-4.3.deb && apt-get -qq purge && apt-get -qq clean && rm -rf /var/lib/apt/lists/* && ln -s /opt/R/4.3/bin/R /usr/local/bin/R && ln -s /opt/R/4.3/bin/Rscript /usr/local/bin/Rscript && R --version' returned a non-zero code: 8

no other useful output (nice). Possibly just needs an update, not sure, I don't know anything about R.

minrk · 2025-01-21T11:33:58Z

The docker build cache also doesn't seem to be used at all, as every build seems to start from scratch, no matter what. I'm not sure what's going on there, as these failed pushes should be super quick.

minrk · 2025-01-21T11:38:17Z

ah, apparently the traitlets arg is not a general issue, it is specific to the option I want to pass. If there is no space, it works:

'--DockerEngine.extra_init_args={"timeout":120}'

I'll give that a try

minrk · 2025-01-21T12:35:18Z

increasing timeout doesn't seem to have solved the timeouts, just increased the amount of time before the timeout is reported. It appears something is stuck, but I have no idea what or how to debug. I'm not sure it's a slowness issue either, since the same images consistently fail. Maybe it's a layer size limit somewhere.

minrk · 2025-01-21T12:37:14Z

The jupyterlab-demo image appears to be killed because the mamba solve actually uses too much memory. I believe this is because the repo uses an outdated version of everything and pin many direct dependencies, but no transitive dependencies, meaning mamba has to go through a super complex solve of changing every version of every installed package in the env, plus installing a very large env.

yuvipanda · 2025-01-22T05:23:49Z

#3179
#3181
#3182

these all helped, but ultimately I think the core of the issue is:

#3183

with those in place, there's no build backup. I've increased the quota to 250! But let's continue to keep an eye on it.

yuvipanda · 2025-01-22T06:38:28Z

lots more at https://jupyter.zulipchat.com/#narrow/channel/469744-jupyterhub/topic/2i2c.20joining.20mybinder.2Eorg.20federation

minrk · 2025-01-22T09:54:09Z

I think something might be a little funky in the registry because it's consistently filling up at around 3GB/minute, which seems like a pretty wild rate of increase, which is causing the image cleaner to be invoked a lot more often than one might expect.

minrk · 2025-01-22T09:56:25Z

Makes me think some cache something might not be being re-used, I'm not sure. But it doesn't seem like the number of builds we are running should be eating up that much space that quickly.

minrk · 2025-01-23T10:10:15Z

jupyterhub/binderhub#1913 adds the option we need to skip cordoning during image cleaning

minrk · 2025-01-23T10:24:26Z

hetzner is reporting over 100 requests waiting on a build (meaning open EventStream connections while a build is running). I'm guessing this is a bug in the metric not always decrementing the counter and not a true number, but it could be another bug in the event stream code.

minrk mentioned this issue Jan 21, 2025

ease 2i2c in more #3174

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hetzner-2i2c federation tracking #3173

hetzner-2i2c federation tracking #3173

minrk commented Jan 21, 2025

minrk commented Jan 21, 2025

minrk commented Jan 21, 2025

minrk commented Jan 21, 2025

minrk commented Jan 21, 2025

minrk commented Jan 21, 2025

minrk commented Jan 21, 2025

minrk commented Jan 21, 2025

yuvipanda commented Jan 22, 2025

yuvipanda commented Jan 22, 2025

minrk commented Jan 22, 2025

minrk commented Jan 22, 2025

minrk commented Jan 23, 2025

minrk commented Jan 23, 2025

hetzner-2i2c federation tracking #3173

hetzner-2i2c federation tracking #3173

Comments

minrk commented Jan 21, 2025

failed to create fsnotify watcher: too many open files in logs

timeout pushing to local registry

minrk commented Jan 21, 2025

minrk commented Jan 21, 2025

minrk commented Jan 21, 2025

minrk commented Jan 21, 2025

minrk commented Jan 21, 2025

minrk commented Jan 21, 2025

minrk commented Jan 21, 2025

yuvipanda commented Jan 22, 2025

yuvipanda commented Jan 22, 2025

minrk commented Jan 22, 2025

minrk commented Jan 22, 2025

minrk commented Jan 23, 2025

minrk commented Jan 23, 2025

`failed to create fsnotify watcher: too many open files` in logs