You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are experiencing nodes going from Ready to NotReady when the node has high memory pressure caused by pods with unset memory limits.
Expectation is that the kubelet kills the pods if they over-allocate, or if other pods arrive with requests and the over-committed pods should then get evicted by the kubelet.
Instead in these high memory pressure cases the entire node seems to die when using default EKS AMI configuration. Kubelet doesn't report back to the API server, and we also cannot connect to the nodes with SSM or Instance Connect.
Adding systemReserved configuration through --kubelet-extra-args might be an acceptable workaround, but it seems like something that should be configured by default on the nodes, so that even if the kubelet becomes unresponsive, services in the system.slice keep working so one can go and troubleshoot the node.
The text was updated successfully, but these errors were encountered:
What would you like to be added:
The default kubelet should configure dedicated systemReserved cpu and memory.
Something like the following should be added to the kubelet-config.json:
Why is this needed:
We are experiencing nodes going from Ready to NotReady when the node has high memory pressure caused by pods with unset memory limits.
Expectation is that the kubelet kills the pods if they over-allocate, or if other pods arrive with requests and the over-committed pods should then get evicted by the kubelet.
Instead in these high memory pressure cases the entire node seems to die when using default EKS AMI configuration. Kubelet doesn't report back to the API server, and we also cannot connect to the nodes with SSM or Instance Connect.
Adding systemReserved configuration through
--kubelet-extra-args
might be an acceptable workaround, but it seems like something that should be configured by default on the nodes, so that even if the kubelet becomes unresponsive, services in thesystem.slice
keep working so one can go and troubleshoot the node.The text was updated successfully, but these errors were encountered: