webhook calls fails with "remote error: tls: bad certificate" and prevents management of Provisioners #4709

BryanStenson-okta · 2023-09-27T06:47:46Z

Description

Observed Behavior:

On a new EKS cluster (without anything previously installed), we're "tls: bad certificate" when trying to register a provisioner...which breaks our scale out. :(

Karpenter: v0.27.6
EKS: v1.25.12-eks-2d98532
ArgoCD: v2.8.3+77556d9

Sample logs:

$ kubectl logs karpenter-provisioners-5cd99796cf-lrnbs -f
{"level":"info","ts":1695796080.7865567,"logger":"fallback","caller":"injection/injection.go:63","msg":"Starting informers..."}
2023/09/27 06:28:03 Registering 2 clients
2023/09/27 06:28:03 Registering 2 informer factories
2023/09/27 06:28:03 Registering 3 informers
2023/09/27 06:28:03 Registering 5 controllers
{"level":"INFO","time":"2023-09-27T06:28:03.864Z","logger":"controller","message":"Starting server","commit":"5a2fe84-dirty","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
{"level":"INFO","time":"2023-09-27T06:28:03.867Z","logger":"controller","message":"Starting server","commit":"5a2fe84-dirty","kind":"health probe","addr":"[::]:8081"}
I0927 06:28:03.970155 1 leaderelection.go:248] attempting to acquire leader lease infra/karpenter-leader-election...
{"level":"INFO","time":"2023-09-27T06:28:04.009Z","logger":"controller","message":"Starting informers...","commit":"5a2fe84-dirty"}
2023/09/27 06:28:05 http: TLS handshake error from 10.42.172.163:50520: remote error: tls: bad certificate
...

I've x-posted this here, as possibly related/identical: knative/pkg#2560 (comment)

Expected Behavior:
no errors :)

Reproduction Steps (Please include YAML):

mostly stock helm chart (v0.27.0), with the following overrides (some details redacted here):

logLevel: info
logEncoding: json

revisionHistoryLimit: 3

controller:
  image:
    repository: {{ include "ecrPrefix" . }}/mirror/public.ecr.aws/karpenter/controller
    tag: v0.27.6
    digest: sha256:21848a7d84ad33a02d930ad1c233fe7403920507c9e0681c9e7780dc9c34fca4

  resources:
    requests:
      cpu: 100m
      memory: 300Mi

  env:
    # needed to call AWS pricing api
    - name: HTTP_PROXY
      value: http://squid.infra:3128
    - name: HTTPS_PROXY
      value: http://squid.infra:3128
    - name: NO_PROXY
      value: sqs.{{ .Values.context.aws_region }}.amazonaws.com,eks.{{ .Values.context.aws_region }}.amazonaws.com,sts.{{ .Values.context.aws_region }}.amazonaws.com,ec2.{{ .Values.context.aws_region }}.amazonaws.com,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16

settings:
  aws:
    clusterName: {{ .Values.context.eks_cluster_name }}
    defaultInstanceProfile: {{ .Values.context.eks_cluster_name }}-node-karpenter
    interruptionQueueName: karpenter-{{ .Values.context.eks_cluster_name }}-instance-interruption
  tags:
    Terraformed: false
    KarpenterManaged: true
  batchMaxDuration: "120s"
  batchIdleDuration: "30s"

extraVolumeMounts:
  - name: aws-iam-token
    mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
    readOnly: true

extraVolumes:
  - name: aws-iam-token
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          audience: sts.amazonaws.com
          expirationSeconds: 86400
          path: token

replicas: 2

serviceAccount:
  name: karpenter-{{ .Values.context.slice_name }}
  annotations:
    eks.amazonaws.com/role-arn: {{ include "awsRoleARNPrefix" . }}/karpenter-{{ .Values.context.slice_name }}

dnsPolicy: ClusterFirst

tolerations:
  - key: "bootstrapper"
    operator: "Exists"

podAnnotations:
  ad.datadoghq.com/controller.check_names: '["openmetrics"]'
  ad.datadoghq.com/controller.init_configs: '[{}]'
  # https://docs.datadoghq.com/agent/guide/template_variables/
  ad.datadoghq.com/controller.instances: '[{ "openmetrics_endpoint":"http://%%host%%:8080/metrics","namespace":"infra","metrics":["^karpenter.*"] }]'
  ad.datadoghq.com/webhook.check_names: '["openmetrics"]'
  ad.datadoghq.com/webhook.init_configs: '[{}]'
  # https://docs.datadoghq.com/agent/guide/template_variables/
  ad.datadoghq.com/webhook.instances: '[{ "openmetrics_endpoint":"http://%%host%%:8080/metrics","namespace":"infra","metrics":["^karpenter.*"] }]'

Versions:

Chart Version: v0.27.0
Karpenter Version: v0.27.6
Kubernetes Version (kubectl version): v1.25.12-eks-2d98532

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

BryanStenson-okta · 2023-09-27T16:54:10Z

We're using ArgoCD to install the karpenter helm chart....so our very ugly workaround is:

disable ArgoCD "auto-sync" for karpenter
delete all validating and mutating k8s webhooks
manually apply the desired Provisioner to the cluster (kubectl -f foo.yaml apply)
re-enable ArgoCD "auto-sync" for karpenter
observe karpenter restores the deleted k8s webhooks
rolling restart of the karpenter nodes (kubectl rollout restart deployment karpenter)

This allows us to get a Provisioner into the cluster (by skipping any validation), so the karpenter controller can build out nodes.

NOTE: any edits/updates of the Provisioner fail -- due the the same error: http: TLS handshake error from 10.42.172.163:50520: remote error: tls: bad certificate

tzneal · 2023-09-29T15:44:57Z

I'm going to close this as a duplicate of #2902, we're working on removing the webhooks entirely which should resolve this.

BryanStenson-okta added the bug Something isn't working label Sep 27, 2023

BryanStenson-okta changed the title ~~remote error: tls: bad certificate~~ webhook calls fails with "remote error: tls: bad certificate" and prevents management of Provisioners Sep 27, 2023

tzneal closed this as completed Sep 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

webhook calls fails with "remote error: tls: bad certificate" and prevents management of Provisioners #4709

webhook calls fails with "remote error: tls: bad certificate" and prevents management of Provisioners #4709

BryanStenson-okta commented Sep 27, 2023

BryanStenson-okta commented Sep 27, 2023

tzneal commented Sep 29, 2023

webhook calls fails with "remote error: tls: bad certificate" and prevents management of Provisioners #4709

webhook calls fails with "remote error: tls: bad certificate" and prevents management of Provisioners #4709

Comments

BryanStenson-okta commented Sep 27, 2023

Description

BryanStenson-okta commented Sep 27, 2023

tzneal commented Sep 29, 2023