Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore failed pods in tests because of graceful node shutdown for Kubernetes 1.22+ #679

Merged
merged 2 commits into from
Mar 21, 2022

Conversation

pkosiec
Copy link
Member

@pkosiec pkosiec commented Mar 18, 2022

Description

Changes proposed in this pull request:

  • Ignore failed pods in tests because of graceful node shutdown for Kubernetes 1.22+

Notes

Currently, our E2E tests ignore the pods with reason "Shutdown", because of the graceful node shutdown feature. It totally makes sense, because in the node shutdown manager from K8s 1.21 the reason is indeed the one: https://github.com/kubernetes/kubernetes/blob/v1.21.6/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go#L40-L42

Since 15.03.2022 we started to observe Pod failures on our long-running cluster:

Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.6-gke.1503", GitCommit:"2c7bbda09a9b7ca78db230e099cf90fe901d3df8", GitTreeState:"clean", BuildDate:"2022-02-18T03:17:45Z", GoVersion:"go1.16.9b7", Compiler:"gc", Platform:"linux/amd64"}

the Pods have the following statuses:

Status:         Failed
Reason:         NodeShutdown
Message:        Pod was rejected as the node is shutting down.

or

Status:         Failed
Reason:         Terminated
Message:        Pod was terminated in response to imminent node shutdown.

They are also related to the graceful node shutdown, and it is totally normal behavior:

Screenshot 2022-03-18 at 10 21 08

However, what's really weird, is that these reasons and messages are the "newer" ones - they changed since 1.22:
https://github.com/kubernetes/kubernetes/blob/v1.22.0/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go#L39-L42 - PR which introduces that: kubernetes/kubernetes#102840

Again, see the code and docs from 1.21:

I couldn't see anything useful in GCP GKE release notes and in GCP GKE issue tracker.

I stopped the short investigation at this point, as:

  • I want to create an issue in the GCP issue tracker. Issue created: https://issuetracker.google.com/issues/225186478
  • This change will be useful anyway once we migrate to K8s 1.22+: Bump Go and Kubernetes dependencies #611
  • I don't think this has high priority to continue it by our own, apart from creating the issue. While it would be worth to know "why", at least we know "what".
    • If we want to continue it, I can create a dedicated task for that. Idea: We can to reconfigure our node pool, enable SSH, connect to it and see the kubelet configuration - but I'm not sure if that's possible. Even if it is, I'm not sure if we find anything useful.

Testing

I tested the PR against our GCP cluster:

  1. Authenticate to the cluster (gcloud container clusters get-credentials ${CLUSTER_NAME} --region ${REGION})
  2. Add _ "k8s.io/client-go/plugin/pkg/client/auth/gcp" import in test/e2e/cluster_test.go
  3. (Optional) Comment line 41 (fields.OneTermNotEqualSelector("metadata.namespace", "kube-system"),), to run this test also against kube-system, where we have such Pods.
    • UPDATE: I disabled periodic tests for a while, as we can notice such terminated pods in the capact-system NS. You don't need to comment this line.
  4. Run the Cluster check test, see that it passes

You can run this test again to see what happens with previous check - comment out the lines 110-115:

	if strings.EqualFold(status.Reason, nodeShutdownReason) && strings.EqualFold(status.Message, nodeShutdownMessage) {
		return true
	}
	if strings.EqualFold(status.Reason, nodeShutdownNotAdmittedReason) {
		return true
	}

and run it. You'll see that it fails with the same messages as https://github.com/capactio/capact/runs/5595654956?check_suite_focus=true

@pkosiec pkosiec added bug Something isn't working area/ci Relates to CI labels Mar 18, 2022
@pkosiec pkosiec added this to the 0.7.0 milestone Mar 18, 2022
Copy link
Member

@mszostok mszostok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm 👍

@pkosiec pkosiec merged commit 0d531eb into capactio:main Mar 21, 2022
@pkosiec pkosiec deleted the ignore-shutdown-pods-in-e2e-tests branch March 21, 2022 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ci Relates to CI bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants