Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streamline deployment of GESIS stage server #3090

Closed
wants to merge 133 commits into from

Conversation

rgaiacs
Copy link
Collaborator

@rgaiacs rgaiacs commented Sep 6, 2024

This is related to #2797

The configuration in the ansible folder is working and GitLab CI at .gitlab-ci.yml is also working.

I'm trying to complete the Kubernetes cluster configuration in the Helm chart.

@rgaiacs rgaiacs self-assigned this Sep 6, 2024
@rgaiacs
Copy link
Collaborator Author

rgaiacs commented Sep 6, 2024

@manics @sgibson91 @minrk could you help me to understand what Helm chart configuration is being loaded by mistake? The binder pod crashes with the following log

Loading /etc/binderhub/config/values.yaml
Loading extra config: 01-eventlog
Loading extra config: 01-template-variables
Loading extra config: 02-badge-base-url
Loading extra config: 02-event-loop-metric
[BinderHub] starting!
[BinderHub] WARNING | BinderHub.build_node_selector is deprecated, use KubernetesBuildExecutor.node_selector
[BinderHub] WARNING | BinderHub.build_docker_host is deprecated, use KubernetesBuildExecutor.docker_host
[W 240906 15:36:29 _metadata:139] Compute Engine Metadata server unavailable on attempt 1 of 3. Reason: timed out
[W 240906 15:36:32 _metadata:139] Compute Engine Metadata server unavailable on attempt 2 of 3. Reason: timed out
[W 240906 15:36:35 _metadata:139] Compute Engine Metadata server unavailable on attempt 3 of 3. Reason: timed out
[W 240906 15:36:35 _default:338] Authentication failed using Compute Engine authentication due to unavailable metadata server.
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.11/site-packages/binderhub/__main__.py", line 4, in <module>
    main()
  File "/usr/local/lib/python3.11/site-packages/traitlets/config/application.py", line 1074, in launch_instance
    app.initialize(argv)
  File "/usr/local/lib/python3.11/site-packages/binderhub/app.py", line 913, in initialize
    self.event_log = EventLog(parent=self)
                     ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/binderhub/events.py", line 51, in __init__
    self.handlers = self.handlers_maker(self)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<string>", line 18, in _make_eventsink_handler
  File "/usr/local/lib/python3.11/site-packages/google/cloud/logging_v2/client.py", line 122, in __init__
    super(Client, self).__init__(
  File "/usr/local/lib/python3.11/site-packages/google/cloud/client/__init__.py", line 320, in __init__
    _ClientProjectMixin.__init__(self, project=project, credentials=credentials)
  File "/usr/local/lib/python3.11/site-packages/google/cloud/client/__init__.py", line 268, in __init__
    project = self._determine_default(project)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/google/cloud/client/__init__.py", line 287, in _determine_default
    return _determine_default_project(project)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/google/cloud/_helpers/__init__.py", line 152, in _determine_default_project
    _, project = google.auth.default()
                 ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/google/auth/_default.py", line 691, in default
    raise exceptions.DefaultCredentialsError(_CLOUD_SDK_MISSING_CREDENTIALS)
google.auth.exceptions.DefaultCredentialsError: Your default credentials were not found. To set up Application Default Credentials, see https://cloud.google.com/docs/authentication/external/set-up-adc for more information.

GESIS runs the BinderHub server on bare metal.

Copy link
Member

@manics manics left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mybinder.org members send events to Google Stackdriver/Google Cloud Operations:

extraConfig:
# Send Events to StackDriver on Google Cloud
# This doesn't need any extra permissions, since the GKE nodes have
# permission to write to StackDriver by default. We don't block access
# to cloud metadata in binderhub pod, so this should 'just work'.
01-eventlog: |
import os
import google.cloud.logging
import google.cloud.logging.handlers
from traitlets.log import get_logger
# importing google cloud configures a root log handler,
# which prevents tornado's pretty-logging
import logging
logging.getLogger().handlers = []
class JSONCloudLoggingHandler(google.cloud.logging.handlers.CloudLoggingHandler):
def emit(self, record):
record.name = None
super().emit(record)
def _make_eventsink_handler(el):
client = google.cloud.logging.Client()
# These events are not parsed as JSON in stackdriver, so give it a different name
# for now. Should be fixed in https://github.com/googleapis/google-cloud-python/pull/6293
name = os.environ.get("EVENT_LOG_NAME") or "binderhub-events-text"
get_logger().info("Sending event logs to %s/logs/%s", client.project, name)
return [JSONCloudLoggingHandler(client, name=name)]
c.EventLog.handlers_maker = _make_eventsink_handler

If you're haven't disabled this in your existing deployment you should have a secret eventsArchiver.serviceAccountKey.

I noticed some of your Ansible roles include configuration values that will be specific to Gesis. I think we should move those into a Gesis specific subfolder in case we want to use Ansible for other members in future.

@@ -0,0 +1,146 @@
"""Script to identify when Docker-in-Docker stop working."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workaround of removing an incorrect DinD directory was added into BinderHub in
jupyterhub/binderhub#1828

metadata:
name: {{ .Release.Name }}
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed? Isn't / the default?

@@ -0,0 +1,24 @@
{{- $ingressType := index .Values "ingress-nginx" "controller" "service" "type" }}
{{- if eq $ingressType "ClusterIP" }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have two layers of ingress? Since this is specific to your setup I think we should make it an explicit flag rather than auto-configuring based on an internal config value. This will make maintenance easier across the multiple deployments.

@rgaiacs
Copy link
Collaborator Author

rgaiacs commented Sep 10, 2024

Thanks @manics for the reply and comments. I was able to disable the the attempt to contact Google Cloud with

analyticsPublisher:
enabled: false

The problem that I have is that all persistent volume claims are pending.

kubectl get -n gesis pvc
NAME                                        STATUS    VOLUME         CAPACITY   ACCESS MODES   STORAGECLASS   AGE
binderhub-grafana                           Pending                                                           24h
binderhub-harbor-jobservice                 Pending                                                           4d16h
binderhub-harbor-registry                   Pending                                                           4d16h
binderhub-prometheus-server                 Pending                                            standard       24h
data-binderhub-harbor-redis-0               Bound     alertmanager   5Gi        RWO                           4d16h
data-binderhub-harbor-trivy-0               Pending                                                           4d16h
database-data-binderhub-harbor-database-0   Pending                                                           4d16h
hub-db-dir                                  Pending                                                           24h

I know that I need to declare a correct persistent volume but I can't find where the persistent volume is declared for OVH or CurveNote. @manics can you point me to the persistent volume declaration? Thanks!

@rgaiacs
Copy link
Collaborator Author

rgaiacs commented Sep 10, 2024

I have the main pods running.

kubectl get -n gesis pods
NAME                                                     READY   STATUS             RESTARTS   AGE
binder-7c84c576c-2689p                                   1/1     Running            0          80m
binderhub-cryptnono-c9hrj                                2/2     Running            0          128m
binderhub-cryptnono-dgr4g                                2/2     Running            0          128m
binderhub-cryptnono-hqpzf                                2/2     Running            0          128m
binderhub-cryptnono-pbqlx                                2/2     Running            0          128m
binderhub-dind-ntxvs                                     1/1     Running            0          80m
binderhub-grafana-9d48bc74-qtn4x                         1/1     Running            0          62m
binderhub-image-cleaner-6zc9v                            1/1     Running            0          80m
binderhub-ingress-nginx-controller-6fdbf98688-j29w2      1/1     Running            0          47m
binderhub-ingress-nginx-defaultbackend-5d698c868-qh5zx   1/1     Running            0          128m
binderhub-kube-state-metrics-8547b9d4dd-rr4tw            1/1     Running            0          128m
binderhub-prometheus-node-exporter-4dv2s                 1/1     Running            0          128m
binderhub-prometheus-node-exporter-c8bv7                 1/1     Running            0          128m
binderhub-prometheus-node-exporter-gkxcf                 1/1     Running            0          128m
binderhub-prometheus-node-exporter-wfk7h                 1/1     Running            0          128m
binderhub-prometheus-server-7c59dd5d85-fwbqm             2/2     Running            0          128m
hub-6564cd475f-nxltz                                     1/1     Running            0          13m
minesweeper-bf58z                                        0/1     ImagePullBackOff   0          128m
minesweeper-fkjd6                                        0/1     ImagePullBackOff   0          128m
minesweeper-t2fs8                                        0/1     ImagePullBackOff   0          128m
proxy-f5b566ddc-j7l9l                                    1/1     Running            0          80m
proxy-patches-85b5998bdb-9mjw9                           1/1     Running            0          128m
static-6f64c6bc8-ndn2t                                   1/1     Running            0          128m
user-scheduler-55df956bcf-6b4m6                          1/1     Running            0          80m
user-scheduler-55df956bcf-db79g                          1/1     Running            0          80m

Ingress

The ingress is not working. The goal here is to have http://notebooks-test.gesis.org being answer by the NGINX Ingress pod. @manics can you help me?

ping -c 1 notebooks-test.gesis.org
PING notebooks-test.gesis.org (194.95.75.20) 56(84) bytes of data.
64 bytes from svko-css-backup-node.gesis.intra (194.95.75.20): icmp_seq=1 ttl=61 time=2.26 ms

--- notebooks-test.gesis.org ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 2.261/2.261/2.261/0.000 ms
kubectl -n gesis describe ingress binderhub
Name:             binderhub
Labels:           app.kubernetes.io/managed-by=Helm
Namespace:        gesis
Address:          10.100.230.222
Ingress Class:    <none>
Default backend:  <default>
TLS:
  kubelego-tls-binder-binderhub terminates notebooks-test.gesis.org
Rules:
  Host                      Path  Backends
  ----                      ----  --------
  notebooks-test.gesis.org  
                            /   binder:80 (10.244.255.21:8585)
Annotations:                kubernetes.io/ingress.class: nginx
                            kubernetes.io/tls-acme: true
                            meta.helm.sh/release-name: binderhub
                            meta.helm.sh/release-namespace: gesis
Events:
  Type    Reason  Age   From                      Message
  ----    ------  ----  ----                      -------
  Normal  Sync    54m   nginx-ingress-controller  Scheduled for sync
  Normal  Sync    54m   nginx-ingress-controller  Scheduled for sync
  Normal  Sync    53m   nginx-ingress-controller  Scheduled for sync
kubectl -n gesis describe service binderhub-ingress-nginx-controller
Name:              binderhub-ingress-nginx-controller
Namespace:         gesis
Labels:            app.kubernetes.io/component=controller
                   app.kubernetes.io/instance=binderhub
                   app.kubernetes.io/managed-by=Helm
                   app.kubernetes.io/name=ingress-nginx
                   app.kubernetes.io/part-of=ingress-nginx
                   app.kubernetes.io/version=1.11.2
                   helm.sh/chart=ingress-nginx-4.11.2
Annotations:       meta.helm.sh/release-name: binderhub
                   meta.helm.sh/release-namespace: gesis
Selector:          app.kubernetes.io/component=controller,app.kubernetes.io/instance=binderhub,app.kubernetes.io/name=ingress-nginx
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                10.100.230.222
IPs:               10.100.230.222
Port:              http  80/TCP
TargetPort:        http/TCP
Endpoints:         10.244.65.205:80
Port:              https  443/TCP
TargetPort:        https/TCP
Endpoints:         10.244.65.205:443
Session Affinity:  None
Events:            <none>

minesweeper

The image name is wrong. It is trying to pull jupyterhub/mybinder.org-minesweeper:set-by-chartpress.

kubectl -n gesis describe pod minesweeper-bf58z
Name:             minesweeper-bf58z
Namespace:        gesis
Priority:         0
Service Account:  minesweeper
Node:             svko-css-backup-node/194.95.75.20
Start Time:       Tue, 10 Sep 2024 14:27:52 +0200
Labels:           app=binder
                  component=minesweeper
                  controller-revision-hash=767d8795cc
                  heritage=Helm
                  name=minesweeper
                  pod-template-generation=1
                  release=binderhub
Annotations:      checksum/configmap: 7a857debb16fa8bcb22a5de6418a5ff319c9e06f4cfc010705caec539b9614cc
                  cni.projectcalico.org/containerID: a3415f68c66691989387a7ea9bc5c6dd5cfa8039affee823adfd0a9b8f0b7263
                  cni.projectcalico.org/podIP: 10.244.65.206/32
                  cni.projectcalico.org/podIPs: 10.244.65.206/32
Status:           Pending
IP:               10.244.65.206
IPs:
  IP:           10.244.65.206
Controlled By:  DaemonSet/minesweeper
Containers:
  minesweeper:
    Container ID:  
    Image:         jupyterhub/mybinder.org-minesweeper:set-by-chartpress
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Args:
      python
      /srv/minesweeper/minesweeper.py
    State:          Waiting
      Reason:       ImagePullBackOff
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  250Mi
    Requests:
      cpu:     100m
      memory:  100Mi
    Environment:
      NODE_NAME:   (v1:spec.nodeName)
      NAMESPACE:  gesis
    Mounts:
      /etc/minesweeper from config (ro)
      /srv/minesweeper from src (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5wbfq (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  src:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      minesweeper-src
    Optional:  false
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      minesweeper-config
    Optional:  false
  kube-api-access-5wbfq:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 hub.jupyter.org/dedicated=user:NoSchedule
                             hub.jupyter.org_dedicated=user:NoSchedule
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type    Reason   Age                     From     Message
  ----    ------   ----                    ----     -------
  Normal  BackOff  4m51s (x545 over 129m)  kubelet  Back-off pulling image "jupyterhub/mybinder.org-minesweeper:set-by-chartpress"

@manics
Copy link
Member

manics commented Sep 10, 2024

Can you try running an ephemeral pod in the same namespace, and exec something like curl -v http://binderhub-ingress-nginx-controller/ from the pod? That should return a 404 from the Nginx controller default backend. You might need to add the internal service port. Note the existing pods may be restricted by NetworkPolicies, so best to create a new pod. I often use https://gist.github.com/manics/67efaed42d25cc1f830e0d5566652b03 as netshoot includes several useful tools for troubleshooting networks.

Then try curl -v --header 'Host: notebooks-test.gesis.org' http://binderhub-ingress-nginx-controller/ from the pod which should fool the ingress controller into thinking you've requested notebooks-test.gesis.org.

If that works it means the controller and your internal BinderHub/JupyterHub ingress is (probably!) working, and the problem is likely in the path between the external internet and the internal ingress.

@manics
Copy link
Member

manics commented Sep 10, 2024

For the chartpress tag problem you'll need to first run chartpress --skip-build to update the set-by-chartpress placeholders:

- name: "Stage 3: Run chartpress to update values.yaml"
run: |
chartpress ${{ matrix.chartpress_args || '--skip-build' }}

The actual building and pushing of the container images is done in the staging workflow, and since chartpress deterministically generates the tag based on git commit hash it's fine to rerun it to update the tags.

@rgaiacs
Copy link
Collaborator Author

rgaiacs commented Sep 11, 2024

Thanks @manics for the reply. I will look into chartpress. And I believe the problem with traffic is because of the load balancer. I looking at MetalLB.

@rgaiacs
Copy link
Collaborator Author

rgaiacs commented Sep 25, 2024

@manics can I have a bit of help with the pre-commit CI? Anything that I could do for it to reformat the code automatically?

@sgibson91
Copy link
Member

@rgaiacs You can run pre-commit run -a locally and commit/push the result. I think prettier specifically doesn't write in CI for reasons that are documented somewhere but I will need to find the link.

@rgaiacs
Copy link
Collaborator Author

rgaiacs commented Jan 7, 2025

I'm closing this as after some discussion with @arnim, will be better for us at GESIS to handle the Kubernetes deployment to our bare-metal server on a separate Git repository.

Thanks for all the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants