Ignore failed pods in tests because of graceful node shutdown for Kubernetes 1.22+ #679
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Changes proposed in this pull request:
Notes
Currently, our E2E tests ignore the pods with reason "Shutdown", because of the graceful node shutdown feature. It totally makes sense, because in the node shutdown manager from K8s 1.21 the reason is indeed the one: https://github.com/kubernetes/kubernetes/blob/v1.21.6/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go#L40-L42
Since 15.03.2022 we started to observe Pod failures on our long-running cluster:
the Pods have the following statuses:
or
They are also related to the graceful node shutdown, and it is totally normal behavior:
However, what's really weird, is that these reasons and messages are the "newer" ones - they changed since 1.22:
https://github.com/kubernetes/kubernetes/blob/v1.22.0/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go#L39-L42 - PR which introduces that: kubernetes/kubernetes#102840
Again, see the code and docs from 1.21:
I couldn't see anything useful in GCP GKE release notes and in GCP GKE issue tracker.
I stopped the short investigation at this point, as:
I want to create an issue in the GCP issue tracker.Issue created: https://issuetracker.google.com/issues/225186478Testing
I tested the PR against our GCP cluster:
gcloud container clusters get-credentials ${CLUSTER_NAME} --region ${REGION}
)_ "k8s.io/client-go/plugin/pkg/client/auth/gcp"
import intest/e2e/cluster_test.go
fields.OneTermNotEqualSelector("metadata.namespace", "kube-system"),
), to run this test also againstkube-system
, where we have such Pods.capact-system
NS. You don't need to comment this line.Cluster check
test, see that it passesYou can run this test again to see what happens with previous check - comment out the lines 110-115:
and run it. You'll see that it fails with the same messages as https://github.com/capactio/capact/runs/5595654956?check_suite_focus=true