Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLS certificates are not pushed to the workload after POD sandbox changed #436

Open
Gmerold opened this issue Jan 14, 2025 · 0 comments
Open

Comments

@Gmerold
Copy link

Gmerold commented Jan 14, 2025

Bug Description

Hi Team,

I've encountered an interesting (kinda edge) case while testing Charmed Aether SD-Core. If the POD sandbox changes, Traefik restarts without fetching the TLS certificates from the relation data. As a result, SD-Core's GUI becomes unavailable and we're getting an internal server error in Traefik (more details below).
From the Treafik charm's code I see that the certs are pushed to the workload container in 2 cases:

  1. Certs updated in the relation data
  2. Config changed (only if stored state hash changes)
    The problem is that none of these happens when the sandbox changes (Juju logs attached below).

There can be different reasons for the POD sandbox to go to down. In my case it was suspending my laptop for the night (locally-running Microk8s).

After restarting Treafik Pod, everything comes back to normal, because in this case config changed gets fired.

Cheers,
Bartek

To Reproduce

As mentioned in the description, change of a POD sandbox can happen due to a bunch of reasons (i.e. insufficient resources), but the easiest way to reproduce the problem is this:

  1. Follow the Charmed Aether SD-Core's Getting started tutorial using your laptop as a host
  2. Suspend the laptop (might wanna give it like 5 minutes is suspension)
  3. Wake up the laptop, wait for the apps to come up (PODs running, Juju apps active/idle)
  4. Open Charmed Aether SD-Core NMS web page and see the Internal Server Error
  5. With jhack, see that the TLS cert is present in the relation data
  6. Inside the traefik container see that there's no certs under /usr/local/share/ca-certificates

Environment

Required tools and versions described in the Charmed Aether SD-Core's Getting started tutorial

Relevant log output

Symptom from the Traefik's Pod `describe`:
`Normal  SandboxChanged  100s   kubelet  Pod sandbox changed, it will be killed and re-created.`

Effect of the issue when trying to access SD-Core NMS web page:

2025-01-13T09:17:28.445Z [traefik] time="2025-01-13T09:17:28Z" level=debug msg="'500 Internal Server Error' caused by: tls: failed to verify certificate: x509: certificate signed by unknown authority"

Juju logs for Traefik starting after POD sandbox change, showing that config changed is not happening:

unit-traefik-0: 08:51:33 INFO juju.cmd running containerAgent [3.6.1 cdb5fe45b78a4701a8bc8369c5a50432358afbd3 gc go1.23.4]
unit-traefik-0: 08:51:33 INFO juju.cmd.containeragent.unit start "unit"
unit-traefik-0: 08:51:33 INFO juju.worker.upgradesteps upgrade steps for 3.6.1 have already been run.
unit-traefik-0: 08:51:33 INFO juju.worker.probehttpserver starting http server on 127.0.0.1:65301
unit-traefik-0: 08:51:33 ERROR juju.worker.dependency "api-caller" manifold worker returned unexpected error: [1353c1] "unit-traefik-0" cannot open api: unable to connect to API: dial tcp 10.152.183.149:17070: connect: connection refused
unit-traefik-0: 08:51:38 ERROR juju.worker.dependency "api-caller" manifold worker returned unexpected error: [1353c1] "unit-traefik-0" cannot open api: unable to connect to API: dial tcp 10.152.183.149:17070: connect: connection refused
unit-traefik-0: 08:51:43 INFO juju.api cannot resolve "controller-service.controller-microk8s-localhost.svc.cluster.local": lookup controller-service.controller-microk8s-localhost.svc.cluster.local: operation was canceled
unit-traefik-0: 08:51:43 INFO juju.api connection established to "wss://10.152.183.149:17070/model/1353c1e2-6fb4-4669-8f77-3712b9b64faa/api"
unit-traefik-0: 08:51:43 INFO juju.worker.apicaller [1353c1] "unit-traefik-0" successfully connected to "10.152.183.149:17070"
unit-traefik-0: 08:51:43 INFO juju.worker.migrationminion migration migration phase is now: NONE
unit-traefik-0: 08:51:43 INFO juju.worker.logger logger worker started
unit-traefik-0: 08:51:43 WARNING juju.worker.proxyupdater unable to set snap core settings [proxy.http= proxy.https= proxy.store=]: exec: "snap": executable file not found in $PATH, output: ""
unit-traefik-0: 08:51:43 INFO juju.agent.tools ensure jujuc symlinks in /var/lib/juju/tools/unit-traefik-0
unit-traefik-0: 08:51:43 INFO juju.worker.leadership traefik/0 promoted to leadership of traefik
unit-traefik-0: 08:51:43 INFO juju.worker.caasupgrader abort check blocked until version event received
unit-traefik-0: 08:51:43 INFO juju.worker.caasupgrader unblocking abort check
unit-traefik-0: 08:51:43 INFO juju.worker.uniter unit "traefik/0" started
unit-traefik-0: 08:51:43 INFO juju.worker.uniter hooks are retried true
unit-traefik-0: 08:51:43 INFO juju.worker.uniter reboot detected; triggering implicit start hook to notify charm
unit-traefik-0: 08:51:44 INFO unit.traefik/0.juju-log Running legacy hooks/start.
(Removed warnings about deprecation of calling ops.main.main())
unit-traefik-0: 08:51:47 INFO juju.worker.uniter.operation ran "start" hook (via hook dispatching script: dispatch)
unit-traefik-0: 08:51:49 INFO unit.traefik/0.juju-log Kubernetes service 'traefik' patched successfully
(Removed warnings about deprecation of calling ops.main.main())
unit-traefik-0: 08:51:54 INFO juju.worker.uniter.operation ran "traefik-pebble-ready" hook (via hook dispatching script: dispatch)


### Additional context

_No response_
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant