Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random TLS certificate verification failure when calling the percona xtradb cluster validating webhook #1675

Open
konoox opened this issue Mar 15, 2024 · 2 comments
Labels

Comments

@konoox
Copy link

konoox commented Mar 15, 2024

Report

Random TLS certificate verification failure when calling the percona xtradb cluster validating webhook

More about the problem

When we deploy the pxc operator in cluster wide mode (watchAllNamespaces=true) and with more than one replica (replicaCount>1), tls certificate verification failure appears at random on validating webhook call.
These errors can be seen when a user try to apply or edit a CR definition of a pxc cluster, or from the operator logs during reconciliation operations. The logs are the following :*

"Internal error occured: failed calling webhook "validationwebhook.pxc.percona.com": failed to call webhook: Post "[https://percona-xtradb-cluster-operator.namespace.svc:443/validate-percona-xtradbcluster?timeout=10s](https://percona-xtradb-cluster-operator.namespace.svc/validate-percona-xtradbcluster?timeout=10s)": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "Root CA")

After some investigations, I noticed that the ca bundle configured in the validating webhook change each time a pxc-operator replica pod take the lead of the operations and only this pod has valid tls certificate.

This can be checked by recovering the ca-bundle from the validating webhook and the tls.crt from the pxc-operator leader pod and verify the signature with openssl :

kubectl get validatingwebhookconfiguration percona-xtradbcluster-webhook -o jsonpath='{.webhooks[0].clientConfig.caBundle}' | base64 -d > ca-bundle.crt
kubectl exec pxc-operator-6bc5fb656b-2grl7 -- cat /tmp/k8s-webhook-server/serving-certs/tls.crt > leader-tls.crt
openssl verify -CAfile ca-bundle.crt leader-tls.crt
leader-tls.crt: OK

But if we extract the tls.crt from another pxc-operator replica pod, the verification fails :

openssl verify -CAfile ca-bundle.crt replica-tls.crt
error 7 at 0 depth lookup:certificate signature failure
139771764295568:error:0407008A:rsa routines:RSA_padding_check_PKCS1_type_1:invalid padding:rsa_pk1.c:116:
139771764295568:error:04067072:rsa routines:RSA_EAY_PUBLIC_DECRYPT:padding check failed:rsa_eay.c:761:
139771764295568:error:0D0C5006:asn1 encoding routines:ASN1_item_verify:EVP lib:a_verify.c:249:

And if we delete the leader pod, the ca-bundle configured in the validating webhook change to match the certificate of the new leader.
As the percona-xtradb-cluster-operator k8s service point to any of the pxc-operator replica pods, this explains why the error appears at random if the validation webhook call is redirected to a non-leader pxc-operator replica pod.
This was also confirm by the fact that the problem disappears when we scale down the operator to only one replica.

Steps to reproduce

  1. Deploy the pxc operator in cluster wide mode with more than one replica (helm values watchAllNamespaces=true and replicaCount>1), the more replica the easier the bug to reproduce.
  2. Deploy a pxc cluster with any valid configuration
  3. Wait a bit of time a check the operator logs : tls verification failure should appear at random during reconciliation operation. You can also check that the signature is valid for only one of the operator pod certificate.

Versions

  1. Kubernetes - v1.27.6
  2. Operator - Percona Operator for MySQL based on Percona XtraDB Cluster 1.13.0

Anything else?

No response

@konoox konoox added the bug label Mar 15, 2024
@Elyytscha
Copy link

Elyytscha commented Sep 12, 2024

we have the exact same behaviour with just 1 operator pod.

2024-09-12T08:20:50.978Z	ERROR	Update status	{"controller": "pxc-controller", "namespace": "helpdesk", "name": "sys-stat-db-cluster", "reconcileID": "00dbb033-c1fa-46c8-8683-f217ed05fd9d", "error": "write status: Internal error occurred: failed calling webhook \"validationwebhook.pxc.percona.com\": failed to call webhook: Post \"https://percona-xtradb-cluster-operator.pxc-operator.svc:443/validate-percona-xtradbcluster?timeout=10s\": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"Root CA\")", "errorVerbose": "Internal error occurred: failed calling webhook \"validationwebhook.pxc.percona.com\": failed to call webhook: Post \"https://percona-xtradb-cluster-operator.pxc-operator.svc:443/validate-percona-xtradbcluster?timeout=10s\": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"Root CA\")\nwrite status\ngithub.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc.(*ReconcilePerconaXtraDBCluster).writeStatus\n\t/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc/status.go:158\ngithub.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc.(*ReconcilePerconaXtraDBCluster).updateStatus\n\t/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc/status.go:43\ngithub.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc.(*ReconcilePerconaXtraDBCluster).Reconcile.func1\n\t/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc/controller.go:204\ngithub.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc.(*ReconcilePerconaXtraDBCluster).Reconcile\n\t/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc/controller.go:327\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:261\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:222\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695"}
github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc.(*ReconcilePerconaXtraDBCluster).Reconcile.func1
	/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc/controller.go:206
github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc.(*ReconcilePerconaXtraDBCluster).Reconcile
	/go/src/github.com/percona/percona-xtradb-cluster-operator/pkg/controller/pxc/controller.go:327
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:261
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:222
kind: Deployment
apiVersion: apps/v1
metadata:
  name: percona-xtradb-cluster-operator
  namespace: pxc-operator
  annotations:
    deployment.kubernetes.io/revision: '1'
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/component: operator
      app.kubernetes.io/instance: percona-xtradb-cluster-operator
      app.kubernetes.io/name: percona-xtradb-cluster-operator
      app.kubernetes.io/part-of: percona-xtradb-cluster-operator
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/component: operator
        app.kubernetes.io/instance: percona-xtradb-cluster-operator
        app.kubernetes.io/name: percona-xtradb-cluster-operator
        app.kubernetes.io/part-of: percona-xtradb-cluster-operator
    spec:
      containers:
        - resources:
            limits:
              cpu: 200m
              memory: 500Mi
            requests:
              cpu: 100m
              memory: 20Mi
          terminationMessagePath: /dev/termination-log
          name: percona-xtradb-cluster-operator
          command:
            - percona-xtradb-cluster-operator
          livenessProbe:
            httpGet:
              path: /metrics
              port: metrics
              scheme: HTTP
            timeoutSeconds: 1
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
          env:
            - name: LOG_STRUCTURED
              value: 'false'
            - name: LOG_LEVEL
              value: INFO
            - name: WATCH_NAMESPACE
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.name
            - name: OPERATOR_NAME
              value: percona-xtradb-cluster-operator
            - name: DISABLE_TELEMETRY
              value: 'false'
          ports:
            - name: metrics
              containerPort: 8080
              protocol: TCP
          imagePullPolicy: Always
          terminationMessagePolicy: File
          image: 'perconalab/percona-xtradb-cluster-operator:main'
      restartPolicy: Always
      terminationGracePeriodSeconds: 600
      dnsPolicy: ClusterFirst
      serviceAccountName: percona-xtradb-cluster-operator
      serviceAccount: percona-xtradb-cluster-operator
      securityContext: {}
      schedulerName: default-scheduler
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 25%
  revisionHistoryLimit: 10
  progressDeadlineSeconds: 600

@wuxinchao011
Copy link

me too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants