Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

False-positive "Container has the same readiness and liveness probe" if failureThreshold differs #339

Open
AndiDog opened this issue Jan 4, 2021 · 12 comments

Comments

@AndiDog
Copy link

AndiDog commented Jan 4, 2021

Which version of kube-score are you using?

kube-score version: 1.10.0, commit: 95faa2a, built: 2020-11-07T14:17:50Z

What did you do?

readinessProbe:
  tcpSocket:
    port: http
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 3
livenessProbe:
  tcpSocket:
    port: http
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 12

in a Deployment.

What did you expect to see?

No error for this part because failureThreshold is a very valid approach to differentiate a liveness probe, for example to kill a container if it's not ready for an extended period of time.

What did you see instead?

[CRITICAL] Pod Probes
[...]
Container has the same readiness and liveness probe
@zegl
Copy link
Owner

zegl commented Jan 5, 2021

Hey,

As I see it, this is by design, to avoid cascading failures, read README_PROBES.md for more information about why this exists.

If you don't agree with this, and have examples of why you think that kube-scores recommendations can be harmful, please post them here as a comment, and we can have a chat about it.

@AndiDog
Copy link
Author

AndiDog commented Jan 5, 2021

Indeed I'd ask my above production example to be treated as fine. failureThreshold: 3 in the readiness probe makes the pod become passive if no traffic can be received right now (in this case if the port isn't open yet/anymore), while the liveness probe with failureThreshold: 12 takes care to kill the container in case it couldn't become ready again after some longer time window. This has proven reasonable for applications which are generally expected to never get stuck, and still should have some sane probe pattern attached given that no regular failure scenario has been observed. If the pod ever gets stuck (e.g. deadlock or similar), the relevant containers get restarted after some time. This solves 1 out of (typically) 3 pods in a deployment remaining stuck indefinitely. In my example, I am for example talking about a simple FastCGI server container which only provides that port for probing. Forcing use of a "different" probe can lead to worse scenarios – for example a developer could think "if I need to use a different type of probe, let's test if the external database is reachable" which in turn creates much higher risk of failure circuits.

@zegl
Copy link
Owner

zegl commented May 7, 2021

Ok, just saw that you submitted a PR that changes the behaviour of Pod Probes, and I'm not convinced.

The reasoning behind why kube-score tries to make sure that different probes are used, is to make sure that the user understands what they are doing, to avoid making a misconfiguration that leads to unnecessary downtime.

Having two probes, with only a difference in either initialDelaySeconds, periodSeconds, or failureThreshold effectively means that the only difference in behaviour is how the probe deals with time. And this is not what you want to do.

A key difference between a Readiness and Liveness Probe is how the Readiness Probe is used when safely draining connections from a Pod, to make sure that it's unregistered from all Services/Endpoints/LBs/etc before the service is stopped. It's fairly normal to implement a 120s (or more) draining period after receiving SIGTERM, where the Readiness Probe is failing, but the service will still happily keep receiving new requests, until all clients have reacted to the Pod shutting down (by polling the DNS records, or similar). This draining however, can not happen on the Liveness Probe, as restarting the container in the middle of draining connections defeats its purpose.

kube-score tries to make it as easy as possible to it's users to spot this type of misconfiguration. Configuring an application to have two different probes, for the two different purposes, is never hard to do, and will make the user spend some time reflecting on their probes, and what the consequences can be it they are misconfigured. And if they can't implement a new probe (if you're running some third party software), this check is easy to be disabled on a per-application basis, with an annotation (but it hopefully still made you think!).

I might be wrong, but I'd like to see some examples of where having different time configurations has been the right thing to do, before taking the time to review your PR.

Thanks,
Gustav

@greenmaid
Copy link

Hello,

In fact there is a link in README_PROBES.md to Liveness probes are dangerous, srcco.de. I read here :

if you use Liveness Probe, don’t set the same specification for Liveness and Readiness Probe
you can use a Liveness Probe with the same health check, but a higher failureThreshold (e.g. mark as not-ready after 3 attempts and fail Liveness Probe after 10 attempts)

@AndiDog
Copy link
Author

AndiDog commented Aug 28, 2021

Yes @greenmaid, that's exactly the main use case. So @zegl any concerns around that?

@vrusinov
Copy link

vrusinov commented Feb 21, 2022

And if they can't implement a new probe (if you're running some third party software), this check is easy to be disabled on a per-application basis, with an annotation

The problem with this is there's a single pod-probes which disables all probe checks. I'd like to be able to ignore just the "has the same readiness and liveness probe" part.

@rparree
Copy link

rparree commented Aug 12, 2022

Any developments on this? I agree with @AndiDog and would like to "accept" same probe but more lenient in time. Just as the documentation explains it.

@bgoareguer
Copy link

The Kubernetes documentation on probes indicates that "A common pattern for liveness probes is to use the same low-cost HTTP endpoint as for readiness probes, but with a higher failureThreshold".

So I would also agree on updating kube-score accordingly.

@mbyio
Copy link

mbyio commented Mar 6, 2024

In my use case, we have a service that sometimes gets stuck and stops responding to HTTP. I want to use the liveness probe to reboot automatically when this happens. I can't modify the code for legal reasons so I can't change the HTTP path etc. In my case, it makes sense to use the same probe for readiness and liveness, and set the liveness to a higher failure threshold.

Sure, it is more clear if I could have different paths/probes. It would also be better if the app didn't get stuck. But I don't think that is related to the actual Kubernetes manifest, which is what kube-score is checking. That's the main reason I feel like this check should be updated.

I agree with the commenters - if the liveness probe has a higher failure threshold than readiness, then it should be pretty safe, and it should be allowed by default. I could see some people wanting to opt-in to an even stricter check though, that's fine as long as it is optional.

@derek-gfs
Copy link

This should really be changed since it contradicts the Kubernetes documentation the way it stands today.

@gruzchik
Copy link
Contributor

Hello guys! I have added the next PR related to this case https://github.com/zegl/kube-score/pull/635/files . This update contains separation of usual probe checks like existing readiness/liveness etc., and checking of the "same readiness and liveness probe". It will work by default, but with this code we'll have an opportunity to exclude 'same readiness/leveness' rule with '--ignore-test pod-probes-identical' option

@zegl could you add your opinion and review to this PR please

@AndiDog , @greenmaid , @vrusinov , @rparree , @bgoareguer , @mbyio , @derek-gfs please add your reactions or comments here

@gruzchik
Copy link
Contributor

gruzchik commented Dec 4, 2024

Hi @AndiDog do you have some concerns regarding to this update? From my side this is not convenient to use 'same readiness and liveness probes' everywhere, so because of this I divided and excluded this. On other hand I see that it can be useful some users, so it can be optional

In case if you have any corrections regarding syntax, feel free to update this PR or provide it here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants