Ignore failed pods in tests because of graceful node shutdown for Kubernetes 1.22+ #679

pkosiec · 2022-03-18T09:43:42Z

Description

Changes proposed in this pull request:

Ignore failed pods in tests because of graceful node shutdown for Kubernetes 1.22+

Notes

Currently, our E2E tests ignore the pods with reason "Shutdown", because of the graceful node shutdown feature. It totally makes sense, because in the node shutdown manager from K8s 1.21 the reason is indeed the one: https://github.com/kubernetes/kubernetes/blob/v1.21.6/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go#L40-L42

Since 15.03.2022 we started to observe Pod failures on our long-running cluster:

Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.6-gke.1503", GitCommit:"2c7bbda09a9b7ca78db230e099cf90fe901d3df8", GitTreeState:"clean", BuildDate:"2022-02-18T03:17:45Z", GoVersion:"go1.16.9b7", Compiler:"gc", Platform:"linux/amd64"}

the Pods have the following statuses:

Status:         Failed
Reason:         NodeShutdown
Message:        Pod was rejected as the node is shutting down.

or

Status:         Failed
Reason:         Terminated
Message:        Pod was terminated in response to imminent node shutdown.

They are also related to the graceful node shutdown, and it is totally normal behavior:

However, what's really weird, is that these reasons and messages are the "newer" ones - they changed since 1.22:
https://github.com/kubernetes/kubernetes/blob/v1.22.0/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go#L39-L42 - PR which introduces that: kubernetes/kubernetes#102840

Again, see the code and docs from 1.21:

I couldn't see anything useful in GCP GKE release notes and in GCP GKE issue tracker.

I stopped the short investigation at this point, as:

~~I want to create an issue in the GCP issue tracker.~~ Issue created: https://issuetracker.google.com/issues/225186478
This change will be useful anyway once we migrate to K8s 1.22+: Bump Go and Kubernetes dependencies #611
I don't think this has high priority to continue it by our own, apart from creating the issue. While it would be worth to know "why", at least we know "what".
- If we want to continue it, I can create a dedicated task for that. Idea: We can to reconfigure our node pool, enable SSH, connect to it and see the kubelet configuration - but I'm not sure if that's possible. Even if it is, I'm not sure if we find anything useful.

Testing

I tested the PR against our GCP cluster:

Authenticate to the cluster (gcloud container clusters get-credentials ${CLUSTER_NAME} --region ${REGION})
Add _ "k8s.io/client-go/plugin/pkg/client/auth/gcp" import in test/e2e/cluster_test.go
(Optional) Comment line 41 (fields.OneTermNotEqualSelector("metadata.namespace", "kube-system"),), to run this test also against kube-system, where we have such Pods.
- UPDATE: I disabled periodic tests for a while, as we can notice such terminated pods in the capact-system NS. You don't need to comment this line.
Run the Cluster check test, see that it passes

You can run this test again to see what happens with previous check - comment out the lines 110-115:

	if strings.EqualFold(status.Reason, nodeShutdownReason) && strings.EqualFold(status.Message, nodeShutdownMessage) {
		return true
	}
	if strings.EqualFold(status.Reason, nodeShutdownNotAdmittedReason) {
		return true
	}

and run it. You'll see that it fails with the same messages as https://github.com/capactio/capact/runs/5595654956?check_suite_focus=true

…1.22+

mszostok

lgtm 👍

Ignore failed pods in tests because of node shutdown for Kubernetes v…

35ec0ca

…1.22+

pkosiec added bug Something isn't working area/ci Relates to CI labels Mar 18, 2022

pkosiec added this to the 0.7.0 milestone Mar 18, 2022

pkosiec requested review from tmitchell, mszostok and mkuziemko as code owners March 18, 2022 09:43

pkosiec assigned mszostok Mar 18, 2022

Remove duplicated if

1b08099

mszostok approved these changes Mar 21, 2022

View reviewed changes

pkosiec merged commit 0d531eb into capactio:main Mar 21, 2022

pkosiec deleted the ignore-shutdown-pods-in-e2e-tests branch March 21, 2022 13:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore failed pods in tests because of graceful node shutdown for Kubernetes 1.22+ #679

Ignore failed pods in tests because of graceful node shutdown for Kubernetes 1.22+ #679

pkosiec commented Mar 18, 2022 •

edited

Loading

mszostok left a comment

Ignore failed pods in tests because of graceful node shutdown for Kubernetes 1.22+ #679

Ignore failed pods in tests because of graceful node shutdown for Kubernetes 1.22+ #679

Conversation

pkosiec commented Mar 18, 2022 • edited Loading

Description

Notes

Testing

mszostok left a comment

Choose a reason for hiding this comment

pkosiec commented Mar 18, 2022 •

edited

Loading