Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Karpenter conversion webhooks do not work on chart version 1.0.2 #6982

Closed
dcherniv opened this issue Sep 11, 2024 · 10 comments
Closed

Karpenter conversion webhooks do not work on chart version 1.0.2 #6982

dcherniv opened this issue Sep 11, 2024 · 10 comments
Labels
bug Something isn't working triage/needs-investigation Issues that need to be investigated before triaging

Comments

@dcherniv
Copy link

Description

Observed Behavior:
karpenter Chart version: 1.0.2
karpenter-crd Chart version: 1.0.2
with webhooks enabled in values as follows:

karpenter-crd:
  webhook:
    enabled: false
karpenter:
  enabled: true
  webhook:
    enabled: false

The webhooks fail with the below error with no indication as to why:

{"level":"ERROR","time":"2024-09-11T20:30:53.969Z","logger":"controller","message":"Reconciler error","commit":"b897114","controller":"nodeclaim.podevents","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"istio-gateway-internal-secret-init-84df655d8b-sdxn4","namespace":"istio-system"},"namespace":"istio-system","name":"istio-gateway-internal-secret-init-84df655d8b-sdxn4","reconcileID":"86d25014-bd4b-4b5d-852f-a5c4d50fe7c4","error":"Internal error occurred: failed calling webhook \"validation.webhook.karpenter.sh\": failed to call webhook: the server rejected our request for an unknown reason"}

Expected Behavior:
Webhook to work in 1.0.2

Reproduction Steps (Please include YAML):
#6847 (comment)

Versions:

  • Chart Version: 1.0.2
  • Kubernetes Version (kubectl version): 1.29+ eks
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@jonathan-innis
Copy link
Contributor

Can you share if there is anything in the logs that indicates why the webhook rejected the request? I'm also curious if this even made it to the webhook or if something is getting in the way of the network traffic to the service/pod.

@jonathan-innis jonathan-innis added triage/needs-investigation Issues that need to be investigated before triaging and removed needs-triage Issues that need to be triaged labels Sep 12, 2024
@dcherniv
Copy link
Author

@jonathan-innis nothing beyond this.
{"level":"ERROR","time":"2024-09-11T20:30:53.969Z","logger":"controller","message":"Reconciler error","commit":"b897114","controller":"nodeclaim.podevents","controllerGroup":"","controllerKind":"Pod","Pod":{"name":"istio-gateway-internal-secret-init-84df655d8b-sdxn4","namespace":"istio-system"},"namespace":"istio-system","name":"istio-gateway-internal-secret-init-84df655d8b-sdxn4","reconcileID":"86d25014-bd4b-4b5d-852f-a5c4d50fe7c4","error":"Internal error occurred: failed calling webhook "validation.webhook.karpenter.sh": failed to call webhook: the server rejected our request for an unknown reason"}
This is from Karpenter logs.
I'm assuming this was trying to validate a nodeclaim which didn't have group field set.
Reading the docs for the migration to v1. Can you confirm that 0.37.3 doesn't mutate resources?
Its worth noting that at the point I got this error all the resources I have defined for node-pools etc in my chart were already upgraded to v1 spec and deployed prior to deploying Karpenter 1.0.2 chart
My path was:

  1. Upgrade to 0.37.3
  2. Upgrade all manifests to v1 spec on 0.37.3
  3. Deploy Karpenter 1.0.2

@jonathan-innis jonathan-innis added triage/needs-information Marks that the issue still needs more information to properly triage and removed triage/needs-investigation Issues that need to be investigated before triaging labels Sep 12, 2024
@jonathan-innis
Copy link
Contributor

Upgrade all manifests to v1 spec on 0.37.3

When you run through this part of your upgrade, did this also have the v1 storage version? Even with the resources on the v1 spec, if they aren't stored on the correct storage version, there can be issues during the upgrade.

@jonathan-innis
Copy link
Contributor

Also, from what it looks like, there is something on the network path that is blocking this traffic from getting through. If there's no errors on the webhook side, that would indicate to me that there's something preventing the call coming from the apiserver to the pod service endpoint.

@dcherniv
Copy link
Author

@jonathan-innis Found a couple of folks who encountered the same #6847 (comment)
and #6879 (comment)
Deleting the validating webhooks altogether resolves this. I think they are carried over from 0.37.3 and are not needed on 1.0.2? Can you confirm? Its weird that they are not cleaned up upon upgrade automatically.

@jonathan-innis
Copy link
Contributor

Agreed, this looks like an issue with the interaction that Karpenter has with Argo. We definitely need to look at this since we shouldn't be leaving behind MutatingWebhookConfigurations and ValidatingWebhookConfigurations after the upgrade.

@jonathan-innis jonathan-innis added triage/needs-investigation Issues that need to be investigated before triaging and removed triage/needs-information Marks that the issue still needs more information to properly triage labels Sep 16, 2024
@jonathan-innis
Copy link
Contributor

So, I think that we tracked it down. There's an open issue here that actually specifically describes this interaction between Knative and ArgoCD. What it comes down to is that ArgoCD refuses to delete objects with ownerReferences -- even if it was the original creator of the object.

Because knative injects an owner reference onto the ValidatingWebhookConfiguration and MutatingWebhookConfiguration as part of its reconciliation flow, this causes ArgoCD to ignore these objects when pruning and leave them behind on the cluster. I'm trying out what happens when we don't have these owner references by patching the knative package that we are using here: #7042 to see if that resolves the issue (I suspect it should).

Code link to where ArgoCD does this check: https://github.com/argoproj/gitops-engine/blob/master/pkg/cache/cluster.go#L1167

@neoakris
Copy link

neoakris commented Sep 20, 2024

Could this be part of the issue?
https://github.com/aws/karpenter-provider-aws/blob/main/charts/karpenter/Chart.yaml
https://github.com/aws/karpenter-provider-aws/blob/v1.0.2/charts/karpenter/Chart.yaml
^-- v1.0.0 of the helm chart should use v1 CRDs
but from what I can tell it's referencing v1beta1 CRDs

Also in the future after these upstream issues get resolved, it'd be nice if EKS Blueprints karpenter addon could get some love.
aws-quickstart/cdk-eks-blueprints#1078
I tried doing a fresh install of karpenter helm chart 1.0.2 (via eks blueprints for cdk) where there should be no need to convert crds / no ned to convert old versions of the app, in that case even a fresh install doesn't work, and installs older beta versions of the CRDs from what I can tell and I think it might be the fault of the upstream helm chart, based on references in the above link.

@jonathan-innis
Copy link
Contributor

We just released latest patch versions of pre-v1.0 versions that fix this issue so that these configuration resources aren't leaked. Please use one of the following versions prior to going to 1.0.x since these versions remove the ownerReference that is causing Argo to leak the resources and causes the failure on upgrade:

  • v0.33.10
  • v0.34.11
  • v0.35.10
  • v0.36.7
  • v0.37.5

@jonathan-innis
Copy link
Contributor

Closing since this should be resolved now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage/needs-investigation Issues that need to be investigated before triaging
Projects
None yet
Development

No branches or pull requests

3 participants