Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

webhook calls fails with "remote error: tls: bad certificate" and prevents management of Provisioners #4709

Closed
BryanStenson-okta opened this issue Sep 27, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@BryanStenson-okta
Copy link
Contributor

Description

Observed Behavior:

On a new EKS cluster (without anything previously installed), we're "tls: bad certificate" when trying to register a provisioner...which breaks our scale out. :(

Karpenter: v0.27.6
EKS: v1.25.12-eks-2d98532
ArgoCD: v2.8.3+77556d9

Sample logs:

$ kubectl logs karpenter-provisioners-5cd99796cf-lrnbs -f
{"level":"info","ts":1695796080.7865567,"logger":"fallback","caller":"injection/injection.go:63","msg":"Starting informers..."}
2023/09/27 06:28:03 Registering 2 clients
2023/09/27 06:28:03 Registering 2 informer factories
2023/09/27 06:28:03 Registering 3 informers
2023/09/27 06:28:03 Registering 5 controllers
{"level":"INFO","time":"2023-09-27T06:28:03.864Z","logger":"controller","message":"Starting server","commit":"5a2fe84-dirty","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
{"level":"INFO","time":"2023-09-27T06:28:03.867Z","logger":"controller","message":"Starting server","commit":"5a2fe84-dirty","kind":"health probe","addr":"[::]:8081"}
I0927 06:28:03.970155 1 leaderelection.go:248] attempting to acquire leader lease infra/karpenter-leader-election...
{"level":"INFO","time":"2023-09-27T06:28:04.009Z","logger":"controller","message":"Starting informers...","commit":"5a2fe84-dirty"}
2023/09/27 06:28:05 http: TLS handshake error from 10.42.172.163:50520: remote error: tls: bad certificate
...

I've x-posted this here, as possibly related/identical: knative/pkg#2560 (comment)

Expected Behavior:
no errors :)

Reproduction Steps (Please include YAML):

mostly stock helm chart (v0.27.0), with the following overrides (some details redacted here):

logLevel: info
logEncoding: json

revisionHistoryLimit: 3

controller:
  image:
    repository: {{ include "ecrPrefix" . }}/mirror/public.ecr.aws/karpenter/controller
    tag: v0.27.6
    digest: sha256:21848a7d84ad33a02d930ad1c233fe7403920507c9e0681c9e7780dc9c34fca4

  resources:
    requests:
      cpu: 100m
      memory: 300Mi

  env:
    # needed to call AWS pricing api
    - name: HTTP_PROXY
      value: http://squid.infra:3128
    - name: HTTPS_PROXY
      value: http://squid.infra:3128
    - name: NO_PROXY
      value: sqs.{{ .Values.context.aws_region }}.amazonaws.com,eks.{{ .Values.context.aws_region }}.amazonaws.com,sts.{{ .Values.context.aws_region }}.amazonaws.com,ec2.{{ .Values.context.aws_region }}.amazonaws.com,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16

settings:
  aws:
    clusterName: {{ .Values.context.eks_cluster_name }}
    defaultInstanceProfile: {{ .Values.context.eks_cluster_name }}-node-karpenter
    interruptionQueueName: karpenter-{{ .Values.context.eks_cluster_name }}-instance-interruption
  tags:
    Terraformed: false
    KarpenterManaged: true
  batchMaxDuration: "120s"
  batchIdleDuration: "30s"

extraVolumeMounts:
  - name: aws-iam-token
    mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
    readOnly: true

extraVolumes:
  - name: aws-iam-token
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          audience: sts.amazonaws.com
          expirationSeconds: 86400
          path: token

replicas: 2

serviceAccount:
  name: karpenter-{{ .Values.context.slice_name }}
  annotations:
    eks.amazonaws.com/role-arn: {{ include "awsRoleARNPrefix" . }}/karpenter-{{ .Values.context.slice_name }}

dnsPolicy: ClusterFirst

tolerations:
  - key: "bootstrapper"
    operator: "Exists"

podAnnotations:
  ad.datadoghq.com/controller.check_names: '["openmetrics"]'
  ad.datadoghq.com/controller.init_configs: '[{}]'
  # https://docs.datadoghq.com/agent/guide/template_variables/
  ad.datadoghq.com/controller.instances: '[{ "openmetrics_endpoint":"http://%%host%%:8080/metrics","namespace":"infra","metrics":["^karpenter.*"] }]'
  ad.datadoghq.com/webhook.check_names: '["openmetrics"]'
  ad.datadoghq.com/webhook.init_configs: '[{}]'
  # https://docs.datadoghq.com/agent/guide/template_variables/
  ad.datadoghq.com/webhook.instances: '[{ "openmetrics_endpoint":"http://%%host%%:8080/metrics","namespace":"infra","metrics":["^karpenter.*"] }]'

Versions:

  • Chart Version: v0.27.0
  • Karpenter Version: v0.27.6
  • Kubernetes Version (kubectl version): v1.25.12-eks-2d98532
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@BryanStenson-okta BryanStenson-okta added the bug Something isn't working label Sep 27, 2023
@BryanStenson-okta
Copy link
Contributor Author

We're using ArgoCD to install the karpenter helm chart....so our very ugly workaround is:

  1. disable ArgoCD "auto-sync" for karpenter
  2. delete all validating and mutating k8s webhooks
  3. manually apply the desired Provisioner to the cluster (kubectl -f foo.yaml apply)
  4. re-enable ArgoCD "auto-sync" for karpenter
  5. observe karpenter restores the deleted k8s webhooks
  6. rolling restart of the karpenter nodes (kubectl rollout restart deployment karpenter)

This allows us to get a Provisioner into the cluster (by skipping any validation), so the karpenter controller can build out nodes.

NOTE: any edits/updates of the Provisioner fail -- due the the same error: http: TLS handshake error from 10.42.172.163:50520: remote error: tls: bad certificate

@BryanStenson-okta BryanStenson-okta changed the title remote error: tls: bad certificate webhook calls fails with "remote error: tls: bad certificate" and prevents management of Provisioners Sep 27, 2023
@tzneal
Copy link
Contributor

tzneal commented Sep 29, 2023

I'm going to close this as a duplicate of #2902, we're working on removing the webhooks entirely which should resolve this.

@tzneal tzneal closed this as completed Sep 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants