-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resource conflict regression from v0.60.0 #1037
Comments
Thank you for creating the issue @universam1!
🙏🏻
One of the principles for kapp is that it guarantees that it will only apply the changes that have been approved by the user. If we want to retry on this particular error, it would mean getting a confirmation from the user again, which might not be a great user experience. It would be ideal to retry the kapp deploy from outside, i.e via some pipeline or some controller like kapp-controller. |
Maybe! The scenario is a brand new, vanilla EKS cluster, just right after the Cloudformation returned the success request, we call Kapp to deploy the core services. Those core services include updates to existing daemonsets. Apparently, but this is not clear, EKS might have delayed deployments that might happen during Kapp runtime.
Since this is transient error, quite hard to generate, but I'll try.
Probably not. Hard to test since we are in CI pipeline here. But it seems like it does suceed after retrying. However, we cannot do that in CI due to one-time session zero credentials to EKS which is "it either succeeded or not" problem.
Please consider that we are not in an interactive session here but in CI pipeline, running I agree, in an interactive session it makes sense to require another user interaction, but here in a headless mode in CI, Kapp should have an option to enforce a desired state! |
Yeah, that could be the reason.
I see, thanks, if we can check both the original diff and the recalculated diff, it would help us in determining the exact fields due to which the diff is changing and we can probably add rebase rules to ignore those fields.
Curious to know how you were able to pinpoint the exact version of kapp with the issue.
I agree that such an option would be useful and I have seen a few similar requests in the past. I think it would be good to first determine the root cause and see if a rebase rule would help else we can think of the best way retry in such cases. |
I have more results from testing, and the problem is not scoped to managed resources. It happens also for resources that are solely owned by Kapp! And it is reproducable. Let me attach examples below.
See following examples, those resources are Kapp owned and not touched by any other operator. This is the output of a 3rd retry (I was able to implement CI job retries)! original diff@@ update deployment/skipper-ingress (apps/v1) namespace: kube-system @@
...
205,205 spec:
206 - progressDeadlineSeconds: 600
207,206 replicas: 2
208 - revisionHistoryLimit: 10
209,207 selector:
210,208 matchLabels:
...
215,213 maxUnavailable: 0
216 - type: RollingUpdate
217,214 template:
218,215 metadata:
219 - creationTimestamp: null
220,216 labels:
221,217 application: skipper-ingress
...
289,285 image: registry.opensource.zalan.do/teapot/skipper:v0.21.223
290 - imagePullPolicy: IfNotPresent
291,286 name: skipper
292,287 ports:
...
294,289 name: ingress-port
295 - protocol: TCP
296,290 - containerPort: 9998
297,291 name: redirect-port
298 - protocol: TCP
299,292 - containerPort: 9911
300,293 name: metrics-port
301 - protocol: TCP
302,294 readinessProbe:
303 - failureThreshold: 3
304,295 httpGet:
305,296 path: /kube-system/healthz
...
308,299 initialDelaySeconds: 5
309 - periodSeconds: 10
310 - successThreshold: 1
311,300 timeoutSeconds: 1
312,301 resources:
...
315,304 memory: 200Mi
316 - terminationMessagePath: /dev/termination-log
317 - terminationMessagePolicy: File
318,305 volumeMounts:
319,306 - mountPath: /etc/skipper-cert
...
329,316 name: skipper-init
330 - dnsPolicy: ClusterFirst
331,317 priorityClassName: system-cluster-critical
332 - restartPolicy: Always
333 - schedulerName: default-scheduler
334 - securityContext: {}
335 - serviceAccount: skipper-ingress
336,318 serviceAccountName: skipper-ingress
337 - terminationGracePeriodSeconds: 30
338,319 tolerations:
339,320 - effect: NoExecute
...
346,327 secret:
347 - defaultMode: 420
348,328 secretName: skipper-cert
349,329 - name: vault-tls
350,330 secret:
351 - defaultMode: 420
352,331 secretName: vault-tls
353,332 - name: oidc-secret-file
354,333 secret:
355 - defaultMode: 420
356,334 secretName: skipper-oidc-secret
357,335 - configMap:
@@ update poddisruptionbudget/skipper-ingress (policy/v1) namespace: kube-system @@
...
2, 2 metadata:
3 - annotations: {}
4, 3 creationTimestamp: "2024-12-02T08:22:15Z"
5, 4 generation: 1
@@ update prometheus/k8s (monitoring.coreos.com/v1) namespace: monitoring @@
...
2, 2 metadata:
3 - annotations: {}
4, 3 creationTimestamp: "2024-12-02T08:22:20Z"
5, 4 generation: 1 recalculated diffError:
- update deployment/skipper-ingress (apps/v1) namespace: kube-system: Failed to update due to resource conflict (approved diff no longer matches): Updating resource deployment/skipper-ingress (apps/v1) namespace: kube-system: API server says: Operation cannot be fulfilled on deployments.apps "skipper-ingress": the object has been modified; please apply your changes to the latest version and try again (reason: Conflict): Recalculated diff:
207,207 - progressDeadlineSeconds: 600
209,208 - revisionHistoryLimit: 10
217,215 - type: RollingUpdate
220,217 - creationTimestamp: null
291,287 - imagePullPolicy: IfNotPresent
296,291 - protocol: TCP
299,293 - protocol: TCP
302,295 - protocol: TCP
304,296 - failureThreshold: 3
310,301 - periodSeconds: 10
311,301 - successThreshold: 1
317,306 - terminationMessagePath: /dev/termination-log
318,306 - terminationMessagePolicy: File
331,318 - dnsPolicy: ClusterFirst
333,319 - restartPolicy: Always
334,319 - schedulerName: default-scheduler
335,319 - securityContext: {}
336,319 - serviceAccount: skipper-ingress
338,320 - terminationGracePeriodSeconds: 30
348,329 - defaultMode: 420
352,332 - defaultMode: 420
356,335 - defaultMode: 420
- update poddisruptionbudget/skipper-ingress (policy/v1) namespace: kube-system: Failed to update due to resource conflict (approved diff no longer matches): Updating resource poddisruptionbudget/skipper-ingress (policy/v1) namespace: kube-system: API server says: Operation cannot be fulfilled on poddisruptionbudgets.policy "skipper-ingress": the object has been modified; please apply your changes to the latest version and try again (reason: Conflict): Recalculated diff:
3, 3 - annotations: {}
- update horizontalpodautoscaler/skipper-ingress (autoscaling/v2) namespace: kube-system: Failed to update due to resource conflict (approved diff no longer matches): Updating resource horizontalpodautoscaler/skipper-ingress (autoscaling/v2) namespace: kube-system: API server says: Operation cannot be fulfilled on horizontalpodautoscalers.autoscaling "skipper-ingress": the object has been modified; please apply your changes to the latest version and try again (reason: Conflict): Recalculated diff:
95, 95 - selectPolicy: Max
102,101 - selectPolicy: Max
- update prometheus/k8s (monitoring.coreos.com/v1) namespace: monitoring: Failed to update due to resource conflict (approved diff no longer matches): Updating resource prometheus/k8s (monitoring.coreos.com/v1) namespace: monitoring: API server says: Operation cannot be fulfilled on prometheuses.monitoring.coreos.com "k8s": the object has been modified; please apply your changes to the latest version and try again (reason: Conflict): Recalculated diff:
3, 3 - annotations: {}
189,188 - evaluationInterval: 30s
205,203 - portName: web
233,230 - scrapeInterval: 30s
We are running v0.58.0 in production. Once upgrading to 0.63.3 we faced all integration pipelines failing, consistently. In order to determine the problematic release, I created versions of our CI tooling with all minor versions of Kapp btw. those two versions and discovered that the latest working version is v0.59.4. Now we are running this version in production.
BTW. I was able to implement a Kapp - retry in our CI tool nevertheless. However, even that fails consistently and we are even with 3 retries unable to converge successfully! It just fails on other resources. So there is a principle regression. |
Thanks a lot for the details @universam1! |
Thank you @praveenrewar for you help! Happy to assist, let me know where I can help! |
@praveenrewar One interesting detail comparing the logs is that the working versions of Kapp output a lot of example retryable logs8:09:40AM: create issuer/vault-secrets-webhook-ca (cert-manager.io/v1) namespace: vault
8:09:40AM: ^ Retryable error: Creating resource issuer/vault-secrets-webhook-ca (cert-manager.io/v1) namespace: vault: API server says: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": no endpoints available for service "cert-manager-webhook" (reason: InternalError)
8:09:40AM: create certificate/vault-secrets-webhook-webhook-tls (cert-manager.io/v1) namespace: vault
8:09:40AM: ^ Retryable error: Creating resource certificate/vault-secrets-webhook-webhook-tls (cert-manager.io/v1) namespace: vault: API server says: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": no endpoints available for service "cert-manager-webhook" (reason: InternalError)
8:09:40AM: create issuer/vault-secrets-webhook-selfsign (cert-manager.io/v1) namespace: vault
8:09:40AM: ^ Retryable error: Creating resource issuer/vault-secrets-webhook-selfsign (cert-manager.io/v1) namespace: vault: API server says: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": no endpoints available for service "cert-manager-webhook" (reason: InternalError)
8:09:40AM: create certificate/vault-secrets-webhook-ca (cert-manager.io/v1) namespace: vault
8:09:40AM: ^ Retryable error: Creating resource certificate/vault-secrets-webhook-ca (cert-manager.io/v1) namespace: vault: API server says: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": no endpoints available for service "cert-manager-webhook" (reason: InternalError)
8:09:45AM: create clusterissuer/selfsigned-issuer (cert-manager.io/v1) cluster
8:09:45AM: ^ Retryable error: Creating resource clusterissuer/selfsigned-issuer (cert-manager.io/v1) cluster: API server says: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": no endpoints available for service "cert-manager-webhook" (reason: InternalError)
8:09:45AM: create certificate/hubble-server-certs (cert-manager.io/v1) namespace: kube-system
8:09:45AM: ^ Retryable error: Creating resource certificate/hubble-server-certs (cert-manager.io/v1) namespace: kube-system: API server says: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": no endpoints available for service "cert-manager-webhook" (reason: InternalError)
8:09:45AM: create certificate/hubble-relay-client-certs (cert-manager.io/v1) namespace: kube-system
8:09:45AM: ^ Retryable error: Creating resource certificate/hubble-relay-client-certs (cert-manager.io/v1) namespace: kube-system: API server says: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": no endpoints available for service "cert-manager-webhook" (reason: InternalError)
8:09:45AM: create certificate/cilium-selfsigned-ca (cert-manager.io/v1) namespace: cert-manager
8:09:45AM: ^ Retryable error: Creating resource certificate/cilium-selfsigned-ca (cert-manager.io/v1) namespace: cert-manager: API server says: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": no endpoints available for service "cert-manager-webhook" (reason: InternalError)
8:10:29AM: create certificate/aws-load-balancer-serving-cert (cert-manager.io/v1) namespace: kube-system
8:10:29AM: ^ Retryable error: Creating resource certificate/aws-load-balancer-serving-cert (cert-manager.io/v1) namespace: kube-system: API server says: Internal error occurred: failed calling webhook "webhook.cert-manager.io": failed to call webhook: Post "https://cert-manager-webhook.cert-manager.svc:443/mutate?timeout=10s": no endpoints available for service "cert-manager-webhook" (reason: InternalError)
8:10:34AM: create issuer/self-signer (cert-manager.io/v1) namespace: kube-system |
It might be that the conflict is happening before these retryable errors. |
What steps did you take:
We are unable to use any version newer than v0.59.4 with
app-deploy
, failing withresource conflict (approved diff no longer matches)
.By method of elimination we have tested following versions:
0.63.3: FAIL
0.62.1: FAIL
0.61.0: FAIL
0.60.2: FAIL
0.60.0: FAIL
0.59.4: SUCCESS
What happened:
We are deploying full cluster config from scratch via kapp app-deploy, in sum ~800 resources, within a single app. This works amazingly well with Kapp, way better than Helm!
However, since v0.60.0 on the first apply we encounter this error:
My assumption is that a webhook or a controller might interfere here with Kapp on certain fields.
However, we need to be able to configure the EKS cluster via Kapp even under a temporary clash.
What did you expect:
Kapp to retry
@praveenrewar
Vote on this request
This is an invitation to the community to vote on issues, to help us prioritize our backlog. Use the "smiley face" up to the right of this comment to vote.
👍 "I would like to see this addressed as soon as possible"
👎 "There are other more important things to focus on right now"
We are also happy to receive and review Pull Requests if you want to help working on this issue.
The text was updated successfully, but these errors were encountered: