You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
From some conversations on Discord - it seems like this probe might be a bit too aggressive:
Hi @everyone, I have a small question about the Persistance healthchecks. I think they are changed and they are now cleaning up their snapshots https://github.com/petabridge/akkadotnet-healthcheck, That cleanup sometimes fails with a 404. at the same time the seems to fail, unsure if that is because the delete failed or that is because the creation failed but that brings down the container running akka.net. Is there a way to add fault tolerance for this? Because if I add fault tolerance to the container healthchecks, all healthchecks will have that extra tolerance, which might not be wanted.
Aaronontheweb — 04/24/2024 8:27 AM
cc @Arkatufus - we just made a bunch of bug fixes to these because they were throwing off false positives at startup @kupo1309
do you have a lot of load at startup or something @kupo1309 ? Or does this probe just fail eventually later
kupo1309 — 04/24/2024 8:43 AM
no this is after days/weeks of running, it seems
so my guess is that it is in fact a transient issue in the azurestorage
we are using 1.5.18
Arkatufus — 04/24/2024 9:13 AM
@kupo1309 you can add a layer of resiliency on top of it, like, it needs to fail twice or 3 times in a row before being killed?
kupo1309 — 04/24/2024 9:17 AM
its 3 by default indeed
i am upping it to 10, but indeed it seems like it failed a few times, then it got disassociated and then it got restarted.
TL;DR; - we might need to have this probe persistently fail several times before we mark the node as unhealthy. Failing at the first sign of trouble seems like it compounds problems that busy systems are having.
The text was updated successfully, but these errors were encountered:
From some conversations on Discord - it seems like this probe might be a bit too aggressive:
TL;DR; - we might need to have this probe persistently fail several times before we mark the node as unhealthy. Failing at the first sign of trouble seems like it compounds problems that busy systems are having.
The text was updated successfully, but these errors were encountered: