Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leaf clusters lose connection during TLS Multiplex migration on the root. #49447

Open
programmerq opened this issue Nov 25, 2024 · 0 comments
Open
Labels
bug c-q7j Internal Customer Reference tls-routing Issues related to TLS routing trusted-cluster

Comments

@programmerq
Copy link
Contributor

Expected Behavior

The leaf cluster should successfully reconnect to the root cluster after root proxy listener mode is changed to multiplex. The connection should be maintained consistently across root proxy restarts, utilizing the correct protocols as configured.

When migrating from separate listeners to TLS Multiplexing, the migration guide in the docs says:

Turning multiplexing on will not affect existing connections of trusted clusters, reverse tunnel agents and tsh/ssh clients. As long as the legacy listeners are enabled (see step 7 below), all clients will keep connecting in the backwards compatibility mode until restarted or relogged in/reconfigured in case of tsh/ssh as described below.

Current Behavior

After switching the root cluster's proxy listener mode from separate to multiplex, the leaf cluster can no longer connect. Once the root proxy pods are restarted, the leaf cluster fails to reconnect. The packet capture indicates the leaf proxy sends a TLS client handshake with an ALPN value of teleport-reversetunnel, but the root proxy may not be accepting this when it should.

Bug Details

Teleport Version

  • Customer original reported with a recent 15.x version.
  • Reproduced in version 16.4.8

Recreation Steps

  1. Deploy a root cluster with proxy_listener_mode set to separate.
  2. Deploy a leaf cluster and configure it to join the root cluster.
  3. Verify successful connection and functionality between root and leaf clusters.
  4. Change root cluster's proxy_listener_mode to multiplex.
  5. Confirm that the leaf cluster reconnects successfully upon deployment completion.
  6. Restart root proxy pods.
  7. Observe the leaf cluster's failure to reconnect.
Expand for more detailed repro steps

Root cluster helm values:

clusterName: rhiza.example.com
enterprise: true
enterpriseImage: public.ecr.aws/gravitational/teleport-ent-distroless
log:
  level: DEBUG
  format: json
authentication:
  type: local
  secondFactor: "on"
  webauthn:
    rp_id: rhiza.example.com
  connectorName: local
  device_trust:
    mode: "off"
operator:
  enabled: true
  installCRDs: never
podMonitor:
  enabled: true
proxyListenerMode: separate
chartMode: standalone
persistence:
  enabled: true
  volumeSize: 10Gi
highAvailability:
  replicaCount: 2
  certManager:
    enabled: true
    issuerName: "letsencrypt-prod"
    issuerKind: ClusterIssuer
service:
  type: LoadBalancer
auth:
  highAvailability:
    replicaCount: 1
  teleportConfig:
    auth_service:
      #proxy_listener_mode: multiplex
      proxy_listener_mode: separate
  extraArgs:
    - "-d"
proxy:
  annotations:
    service:
      external-dns.alpha.kubernetes.io/hostname: rhiza.example.com,*.rhiza.example.com
resources:
  requests:
    cpu: 50m
    memory: "512Mi"
  limits:
    cpu: "2"
    memory: "512Mi"

leaf cluster values:

clusterName: folio.example.com
enterprise: true
enterpriseImage: public.ecr.aws/gravitational/teleport-ent-distroless
log:
  level: DEBUG
  format: json
authentication:
  type: local
  secondFactor: "on"
  webauthn:
    rp_id: folio.example.com
  connectorName: local
  device_trust:
    mode: "off"
operator:
  enabled: true
  installCRDs: never
podMonitor:
  enabled: true
proxyListenerMode: separate
#proxyListenerMode: multiplex
chartMode: standalone
persistence:
  enabled: true
  volumeSize: 10Gi
highAvailability:
  replicaCount: 2
  certManager:
    enabled: true
    issuerName: "letsencrypt-prod"
    issuerKind: ClusterIssuer
service:
  type: LoadBalancer
auth:
  highAvailability:
    replicaCount: 1
  extraArgs:
    - "-d"
proxy:
  annotations:
    service:
      external-dns.alpha.kubernetes.io/hostname: folio.example.com,*.folio.example.com
resources:
  requests:
    cpu: 50m
    memory: "512Mi"
  limits:
    cpu: "2"
    memory: "512Mi"

Once the root cluster was up, I created a user account and assigned myself to access, auditor, editor.

I created this token on the root:

kind: token
metadata:
  name: leaftoken
spec:
  join_method: token
  roles:
  - Trusted_cluster
version: v2

on the leaf I created this trusted_cluster object:

kind: trusted_cluster
version: v2
metadata:
  name: rhiza
spec:
  enabled: true
  token: leaftoken
  tunnel_addr: rhiza.example.com:3024
  web_proxy_addr: rhiza.example.com:443
  role_map:
  - remote: access
    local: ["access"]
  - remote: auditor
    local: ["auditor"]
  - remote: editor
    local: ["editor"]

At this point, the leaf should join successfully and tctl get rc on the root cluster should show the same.

At this point, the migration to TLS multiplexing can start. The root cluster values file is updated. Instead of setting the entire chart mode to multiplex, an override in the teleportConfig section preserves the separate listeners, but still activates TLS multiplexing.

--- rootvalues.yaml
+++ rootvalues.yaml
@@ -54,7 +54,7 @@ spec:
       replicaCount: 1
     teleportConfig:
       auth_service:
-        proxy_listener_mode: separate
+        proxy_listener_mode: multiplex
     extraArgs:
     - "-d"
   proxy:

The root proxy pods will not be updated during this change. The leaf clusters seem to stay connected during this step. I initially thought that this was a failure to reproduce the reported symptom. Once the root proxy pods are restarted, however, the issue is triggered. Leaf clusters are not able to establish a connection.

Once the trusted_cluster object is re-created, the leaf cluster successfully rejoins. For a root cluster that has dozens or hundreds of leaf clusters, this loss of intermediate migration in functionality poses a problem. The graceful migration path should keep working until each leaf cluster can be updated to use the multiplexed port.

Debug Logs

root cluster:

{"caller":"alpnproxy/proxy.go:335","component":"alpn:proxy","level":"warning","message":"Failed to handle client connection: failed to find ALPN handler based on received client supported protocols [\"teleport-reversetunnel\"]","timestamp":"2024-11-22T23:02:57Z"}

leaf cluster:

{"caller":"reversetunnel/agentpool.go:279","component":"proxy:agent","error":"EOF","level":"debug","message":"Failed to connect agent.","timestamp":"2024-11-22T23:03:04Z","trace.fields":{"localCluster":"folio.example.com","targetCluster":"rhiza.example.com"}}
@programmerq programmerq added bug c-q7j Internal Customer Reference tls-routing Issues related to TLS routing trusted-cluster labels Nov 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug c-q7j Internal Customer Reference tls-routing Issues related to TLS routing trusted-cluster
Projects
None yet
Development

No branches or pull requests

1 participant