-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a way to detect and introspect check failures #103
Comments
Per discussion at our Frankfurt sprint, we'd like to use the existing changes/tasks system instead. Obviously we don't want to record every check result as a Pebble change object (the default is every 10s), but we could record changes in check state. So when the check goes from up to down we'd record a Ideally we'd start logging failures internally as soon as a check starts failing, and then only when the number of failures reaches the threshold (on the up to down transition) would we record a change and include that failure log in the task. The change summary could be the first failure log. Then when the check goes from down to up we'd record another change. Something like this:
|
Notes from meeting just now:
|
Per comment on #119 (comment), we're planning to do this with events (a new concept) instead, and have started spec JU048 to design that. |
Further update (and spec): we discussed this with Gustavo at the London scriptlets mini-sprint (16 Sep 2022), and decided on the following:
Below is how we mapped out changes and tasks for a check going into error and then recovering. We'd have two task states: success (task kind "perform check") and failing/recovering (task kind "recover check").
|
PR to fix this (implement health checks via changes and tasks): #409 |
Per spec JU073, this reimplements the health check manager to to use Changes and Tasks do drive the check operations. There are two change kinds, each of which has a single task of the same kind: * `perform-check`: used for driving the check while it's "up". The change (and task) finish when the number of failures hits the threshold, at which point it goes into Error status. Each check failure records a task log. * `recover-check`: used for driving the check while it's "down". The change (and task) finish when the check starts succeeding again, at which point it goes into Done status. Again, each check failure records a task log. We also add a new `change-id` field to the /v1/checks responses, allowing a client and the CLI to look up the change for each check (primarily to get the task logs for failing checks). The `pebble checks` command displays this as follows: ``` Check Level Status Failures Change chk1 - up 0/3 1 chk2 - down 1/1 2 (cannot perform check: blah blah error) chk3 alive down 42/3 3 (this is a long truncated error messag... run "pebble tasks 3" for more) ``` We (@hpidcock, @jameinel, and I) debated whether it should switch from perform-check to recover-check after the first failure or only when it hits the threshold (which is how I had the code originally). We decided in the end that it's better and more obvious to switch when it hits the threshold, as otherwise the check status is out of sync with the change kind, and you need a third state. We figured in a Juju context, for example, your alive/ready checks would normally be separate from the change-update checks you want the charm to be notified about. And if you do want them to be the same check, you can easily add a similar check with a different threshold. Here's an example of that flow (taken from the spec): ``` initial state: change 1 - Doing task 1: perform-check - Doing on first error: change 1 - Error task 1: perform-check - Error (contains first failure log(s) - up to 10) change 2 - Doing task 2: recover-check - Doing on second (or subsequent) errors: change 1 - Error task 1: perform-check - Error change 2 - Doing task 2: recover-check - Doing (contains last failure log(s) - up to 10) now on success: change 1 - Error task 1: perform-check - Error change 2 - Done task 2: recover-check - Done (keeps last failure log(s) - up to 10) change 3 - Doing task 3: perform-check - Doing ``` As far as the checks system concerns, this is an implementation detail. However, the use of changes and tasks mean that the status of a check operation and check failures can be introspected (via the existing changes API). Fixes #103.
Per thread at #86 (comment), I previously had
LastError
andErrorDetails
fields on theCheckInfo
API struct. However, this was a bit ad-hoc and didn't fit our existing constructs like task logs. I've pulled those out of that PR for now, but we should design a better approach.It would seem quite heavy to just use Pebble "changes" and "tasks" for checks. If we did that, I think each check would have to be a change (with one task?) and that would flood the changes list. Besides, these aren't really state changes, they're checks. ("Change represents a tracked modification to the system state.")
So instead I suggest (idea from @niemeyer) using the same "log" concept that a task has, which is a
Log []string
field, and adding that to theCheckInfo
object. We could keep the last 10 like aTask
does. Each log string would be a timestamp-formatted string with a prefix likeERROR
and a message, like so:This field would be in the
CheckInfo
struct for a particular named check, so no further context is needed. The log would contain check failure errors and the details that was previously inErrorDetails
(truncate HTTP body forhttp
checks, last few line of command output forexec
checks).The
/v1/checks
API would return this information as a list of strings. The CLIpebble checks
could return the (possibly truncated) first log line by default, and the full log if you specifypebble checks --verbose
. Something like this:It's an open question whether the log would be cleared/reset when the check became healthy again. Probably not, as it's a log/history?
The text was updated successfully, but these errors were encountered: