-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flaky e2e jobs after EKS migration #2898
Comments
Failed at 18:32:07; pod failed at 18:33:23.
According to the log, it just failed after the timeout 2 minutes(PodStartShortTimeout). During the timeout waiting, some error logs in kubelet is like below:
|
renamed the issue to be more general as there seem to be a lot more flaky jobs. my suspicion right now is resource constrains, VM CPU / resource sharing limits at EKS. @dims @ameukam unfortunately, i don't think this is clean signal for kubeadm. this of course is with my assumption above about "resource constrains", in case someone finds another reason. let me know. |
@neolit123 feel free to revert to the k8s-infra cluster if current signal is not good enough for you. |
@neolit123 my personal preference as you know would be to work through these issues. but totally understand if you wish to just revert. I'd hate for us to not being able to use the resources we have in AWS (we are probably going to consume 1/3 of the budget this year and let the rest go unused as it does not roll up to next year) We could try bumping cpu/mem limits if you think that will help? Looking at https://prow.k8s.io/?job=*kubeadm-kinder*&state=failure i think we were still tweaking things yesterday for the inotify stuff, so i'd only focus on things that failed after those changes landed as well. Please do compare it to the jobs that suceeded as well ( https://prow.k8s.io/?job=*kubeadm-kinder*&state=success ) but in the end it's your call. thanks for considering the above arguments. |
one concern about clean signal is that we plan to start working on a new api for kubeadm 1.28 and later in 1.29. we could leave the latest jobs (1.28) to be in the k8s infra cluster and the rest (majority in numbers) to run on EKS, until we resolve the issues. but as mentioned above, to me this does seem like a VM resouce sharing/limitation problem, based on the randomness of the flakes. while on the Google prow, we never found a solution for that and it just happened. if the assumption is true, we need to ensure EKS can guarantee the needed resources. what can be done as an experiment e.g. to bump VM resources and who can help us debug? |
these values were migrated from the k8s infra prow to eks. it can be done as a test, but IIRC if the cloud provider infra is busy we won't be getting the requested values. |
Thanks @dims for pinging me on this issue! 👋 Let me try to unpack this issue.
Those EKS VMs are beefy and heavily underutilized. To put in some concrete numbers in, we have 20 nodes in the cluster. The node with the highest memory usage has 6% of memory usage. CPU usage is in the similar state, going between few percents to up to 30% on some nodes. Also, those machines are
I heavily agree with @dims here, but after all, this is your call. I think we should work to solve those issues, using the EKS cluster should be default if a job doesn't depend on GCP resources. We managed to solve most of failures and flakes within some reasonable timeframe, I think we can do same here, but we need someone to help us from SCL. If you decide to revert, please leave some of those jobs as canaries, so that we can work on fixing those issues and tracking if anything is getting better.
Can you please provide some context on how those jobs work? That's going to help us to figure out in what direction to go. |
these jobs create a containerd-in-docker kubeadm cluster (similar to kind) with usually 3 control plane nodes and 2 worker nodes. then they perform some kubeadm specific tests and optionally run the official k8s e2e test suite with tests that support parallel mode.
the inotify change yesterday did seem to improve the flakes, and given some of these jobs run only 2 times a day, we can give it more time. i guess.
looking at https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-kubeadm some of the flakes seem to be related to running the k8s e2e suite. those are likely due to timeouts. there are a good number of cases where the prow pod gets deleted, like here: note to kubeadm maintainers and something not EKS related IMO (docker exec issue?) is that there are also some cases where
which we have seen before but it was never understood, and the fix was to retry all dir / file creations in our clustering tool. |
Agree. Most of tests are passed in the last run, and
BTW, warning message from AWS runs:
|
can we solve this missing kernel config problem? here is how system validators searches for it , and usually users don't see the problem with normal distros: without the kernel config kubeadm cannot tell if this node is k8s / kubelet compliant (kernel features). |
@xmudrii @pkprzekwas i can confirm that @BenTheElder @michelle192837 @xmcqueen how do we handle this in existing prow instances? |
@dims @neolit123 Here's my understanding of the situation here. system-validators are:
The issue is that
Which is correct, because Ubuntu doesn't store kernel config there, but in
To my knowledge, this is Ubuntu-specific thing and we don't use Ubuntu on existing Prow instances. |
i think that's what kind is doing. one alternative on the kubeadm side is to just ignore the warning, but i don't think that's a good idea. |
we in fact use ubuntu instances at least for the community-owned clusters. I think it's ok to have those kernel modules exposed to our prowjobs in RO; the security risk is very low but IMHO it's probably not worth it. Our ultimate goal is to have all the prowjobs running on k8s-infra (the community infrastructure) which was already the case before the migration to EKS. |
I think not exposing those kernel modules can be a problem for kubeadm, we should probably have some coverage on that side too, but @neolit123 can weight on this. |
yes, please try to mount the config in one of the known paths. it will be used to pass validation on required/optional kernel features. |
@ameukam What do you think about creating a new preset (e.g. |
Recent flakes are some job timeout like below:
The flakes will not trigger an alert to https://groups.google.com/a/kubernetes.io/g/sig-cluster-lifecycle-kubeadm-alerts/ as the next try will pass. |
also what i see as the most common failure. |
We're investigating some stability issues that might cause this error, see kubernetes/k8s.io#5473 for more details. |
+1 for this preset. |
I checked the current status, just found one flaky, https://testgrid.k8s.io/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-1-28-on-1-27
Before close this issue, I want to confirm with @neolit123 @SataQiu @pacoxu is that a known issue? |
I remember this being a very common failure when docker hangs to delete a container. |
/close |
@pacoxu: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-1-25
https://k8s-testgrid.appspot.com/sig-cluster-lifecycle-kubeadm#kubeadm-kinder-1-27
Both are flaky, but cannot see any suspicious from the logs.
It's this e2e are flaky in the 1-25 (not always),
[sig-node] Variable Expansion should fail substituting values in a volume subpath with absolute path [Slow] [Conformance]
but other e2e run successfully at the end.The text was updated successfully, but these errors were encountered: