Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a 2i2c federation member on Hetzner #3169

Merged
merged 25 commits into from
Jan 21, 2025
Merged

Add a 2i2c federation member on Hetzner #3169

merged 25 commits into from
Jan 21, 2025

Conversation

yuvipanda
Copy link
Contributor

@yuvipanda yuvipanda commented Jan 17, 2025

https://2i2c.mybinder.org/ is up now!

https://github.com/2i2c-org/2i2c-org.github.io/pull/356/files#diff-7244b57e647732dd6a8f006bdf63943e1dcb813fa1a085073522ccf40e2cdfc6 has more context - that's also an announcement blog post. It came together quickly.

This is a single node k3s cluster running on Hetzner. It's not as large as we'd like it to be - which is CCX63 on https://www.hetzner.com/cloud. That's 48 vCPUs and 192GB of RAM. And with k3s, we can override the number of pods on a node. Given the current guarantee of 450M, we can put approximately 400 pods on this one node! That runs out to less than $1 / month per user capacity which is pretty good.

Still need to figure out:

  • Access for everyone else on the team
  • Resize the server to be big, and set up k3s again there from scratch + document (I simply followed the quickstart with traefik disabled)
  • Test prometheus (works fine: https://prometheus.2i2c.mybinder.org/)
  • Test Grafana grafana
  • Add dashboards to grafana
  • Add 2i2c to list of supporters
  • Add this installation to the rotation with an appropriate weight
  • Network Policy doesn't seem to allow dind to talk to the ingress controller - need to figure that out
  • Make sure the maximum number of pods can be tweaked to be high enough to use the whole node
  • Add a little bit more documentation (but don't expect it to be perfect)
  • Fix the helm templates for the registry pvc so we only provision it if necessary (and not using an object store as backing)

Am excited to try this out and see how it goes.

It currently uses quay.io for image storage, but we can move to a local docker registry backed by hetzner's new S3 service https://www.hetzner.com/storage/object-storage/ eventually.

The registry is now using a local setup of CNCF Distribution (aka docker registry), deployed via the chart in here. It's exposed as an Ingress for HTTPS (otherwise everyone complains), but only accessible with a strong password. We can try to figure out if we can restrict it to only being pulled from the local network at the ingress level, or figure out a custom cert situation - although that would need to be validated both by binderhub (for push) and k8s (for pull), and it's nice to let let's encrypt handle that. The images are stored on disk currently, which is fine to start because the hetzner image we will end up using has about 960 GB of fast SSD space. Unfortunately the images are 'doubled' anyway as we push and pull from the same disk (lol) but that's better than pushing to quay and then pulling. We can move this to the hetzner S3 storage when it gets a little bigger.

Thanks to @choldgraf, @colliand, @jmunroe and others at 2i2c for supporting me through this.

yuvipanda and others added 3 commits January 17, 2025 15:12
https://2i2c.mybinder.org/ is up now!

https://github.com/2i2c-org/2i2c-org.github.io/pull/356/files#diff-7244b57e647732dd6a8f006bdf63943e1dcb813fa1a085073522ccf40e2cdfc6
has more context - that's also an announcement blog post. It came
together quickly.

This is a single node k3s cluster running on Hetzner. It's not as
large as we'd like it to be - which is CCX63 on https://www.hetzner.com/cloud.
That's 48 vCPUs and 192GB of RAM. And with k3s, we can override the
number of pods on a node. Given the current guarantee of 450M, we can
put approximately 400 pods on this one node! That runs out to less
than $1 / month per user capacity which is pretty good.

Still need to figure out:

1. Access for everyone else on the team
2. Resize the server to be big, and set up k3s again there from
   scratch + document (I simply followed the quickstart with traefik
   disabled)
3. Test prometheus and grafana
4. Add 2i2c to list of supporters

Am excited to try this out and see how it goes.

Thanks to @choldgraf, @colliand, @jmunroe and others at 2i2c
for supporting me through this.
@yuvipanda yuvipanda requested review from manics and minrk January 17, 2025 23:28
@choldgraf
Copy link
Member

choldgraf commented Jan 18, 2025

It was so fast to spin this up! (Just tried binder-examples/requirements)

@yuvipanda
Copy link
Contributor Author

Am going to try to put the registry in cluster as well, let's see.

@minrk
Copy link
Member

minrk commented Jan 18, 2025

This is awesome! I won't have time to monitor a deployment rollout until Monday, but this looks great.

Copy link
Member

@manics manics left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Do you want to merge this now?

Is your plan to figure out the automated deployment in a future PR, after you work out how to setup external K8s API access?

This is awesome! I won't have time to monitor a deployment rollout until Monday, but this looks great.

I think it's fine to merge now and revert if necessary- doesn't add a Hetzner GitHub deployment workflow, it only modifies the redirector to direct builds and make Hetzner the prime host.

claimName: registry
containers:
- name: registry
image: registry:2.8.3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a future PR we should add this to the watch-dependencies workflow

strategy:
fail-fast: false
matrix:
include:
- name: repo2docker
registry: quay.io
repository: jupyterhub/repo2docker

mybinder/values.yaml Show resolved Hide resolved
mybinder/values.yaml Show resolved Hide resolved
@manics
Copy link
Member

manics commented Jan 18, 2025

Can you check the websocket timeout?
https://2i2c.mybinder.org/v2/gh/jupyterhub/jupyter-remote-desktop-proxy/HEAD?urlpath=desktop
The desktop reproducibly goes blank after ~70 seconds

@yuvipanda
Copy link
Contributor Author

@yuvipanda
Copy link
Contributor Author

I've a tattoo appt all day, I'll get back to this once that's done!

@yuvipanda
Copy link
Contributor Author

@manics I wonder if that's nginx timeout that needs tuning

@yuvipanda
Copy link
Contributor Author

@manics hmm, I can't seem to reproduce the timeout! There's no external load balancer here, just our nginx. Is it still happening to you?

@yuvipanda
Copy link
Contributor Author

The current node is really small - only a CCX23. I'm waiting for a quota increase to make it bigger. I think it's worth waiting for that to happen, as CCX23 is only 16GB of RAM

@rgaiacs
Copy link
Collaborator

rgaiacs commented Jan 18, 2025

This looks good to me. Thanks 2i2c! <3

@yuvipanda
Copy link
Contributor Author

I've added an encrypted ssh key for ssh access for other team members!

@yuvipanda
Copy link
Contributor Author

I've sent email invites to the hetzner project to @minrk and @manics. @rgaiacs if you share an email address with me i can send one to you too!

@yuvipanda
Copy link
Contributor Author

poking to see if i can switch the registry to object storage already, while we wait for the quota increase. This would mean the instance is purely zero state to a much greater extent.

Also actually make the registry read the config file - it was
not doing that before.
@yuvipanda
Copy link
Contributor Author

Alright, now we use the hetzner object storage as the storage backend for the registry! and thus we run 2 replicas of the registry as well :)

I'm also leaving in comment the small bit of config change that's required to continue using the filesystem backend as well. The goal here is to make it as easy as possible for people to join the federation. And with this, that's simply down to '1 VM'

@yuvipanda
Copy link
Contributor Author

I started adding k3s docs!

@manics
Copy link
Member

manics commented Jan 19, 2025

Are you missing a commit? secrets/hetzner-2i2c.yml is identical to secrets/ovh2-kubeconfig.yml

@manics
Copy link
Member

manics commented Jan 19, 2025

I've got SSH access, copying the k3s.yaml kubeconfig file and editing the server IP gives me K8S access!

@yuvipanda
Copy link
Contributor Author

@manics ah yes - i was missing a commit. Added now.

@yuvipanda
Copy link
Contributor Author

I validated that we can change the number of pods on a node by following https://stackoverflow.com/a/65899273. This node currently is set to max 250 pods although it can't support that many (is smol). Shall be added to the documentation on how to setup k3s.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the changes in deploy.py, I think this wants to be secrets/hentzner-2i2c-kubeconfig.yml

@rgaiacs
Copy link
Collaborator

rgaiacs commented Jan 20, 2025

@yuvipanda you can send the invitation to [email protected].

@yuvipanda
Copy link
Contributor Author

The quota increase was approved! I've taken the existing node offlien and bringing up a new node. will be done tonight

@yuvipanda
Copy link
Contributor Author

@rgaiacs done!

@yuvipanda
Copy link
Contributor Author

Created new server from scratch, rebuilt it, and it's all good to go! I'm going to sleep though, so either I'm happy for someone else to merge it (ssh keys updated so you can debug if necessary), and if not I"ll try to find time.

@minrk
Copy link
Member

minrk commented Jan 21, 2025

Awesome! Dealing with a plumber now, but I'll give it go when I'm free in an hour or two unless someone else is ready first.

@minrk
Copy link
Member

minrk commented Jan 21, 2025

/test-this-pr

just to make sure the registry validates and doesn't deploy to staging

@jupyterhub-bot
Copy link
Collaborator

This Pull Request is now being tested 🎉 See the test progress in GitHub Actions.

@jupyterhub-bot
Copy link
Collaborator

Job status: success
Branch 'test-this-pr/3169' has been deleted

@minrk
Copy link
Member

minrk commented Jan 21, 2025

Giving this a try!

@minrk minrk merged commit a06b8b5 into jupyterhub:main Jan 21, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants