Add a way to detect and introspect check failures #103

benhoyt · 2022-02-11T04:06:51Z

Per thread at #86 (comment), I previously had LastError and ErrorDetails fields on the CheckInfo API struct. However, this was a bit ad-hoc and didn't fit our existing constructs like task logs. I've pulled those out of that PR for now, but we should design a better approach.

It would seem quite heavy to just use Pebble "changes" and "tasks" for checks. If we did that, I think each check would have to be a change (with one task?) and that would flood the changes list. Besides, these aren't really state changes, they're checks. ("Change represents a tracked modification to the system state.")

So instead I suggest (idea from @niemeyer) using the same "log" concept that a task has, which is a Log []string field, and adding that to the CheckInfo object. We could keep the last 10 like a Task does. Each log string would be a timestamp-formatted string with a prefix like ERROR and a message, like so:

2022-02-22T11:40:53+13:00 ERROR timed out after 100ms: context deadline exceeded
2022-02-22T11:50:53+13:00 ERROR timed out after 100ms: context deadline exceeded

This field would be in the CheckInfo struct for a particular named check, so no further context is needed. The log would contain check failure errors and the details that was previously in ErrorDetails (truncate HTTP body for http checks, last few line of command output for exec checks).

The /v1/checks API would return this information as a list of strings. The CLI pebble checks could return the (possibly truncated) first log line by default, and the full log if you specify pebble checks --verbose. Something like this:

$ pebble checks
Check  Level  Status  Failures  Log
chk1   -      up      0/3
chk2   alive  down    2/2       2022-02-22T11:40:53+13:00 ERROR timed out after 100ms...

$ pebble checks --verbose
Check  Level  Status  Failures
chk1   -      up      0/3
chk2   alive  down    2/2
    2022-02-22T11:40:53+13:00 ERROR timed out after 100ms: context deadline exceeded 
    2022-02-22T11:50:53+13:00 ERROR timed out after 100ms: context deadline exceeded

It's an open question whether the log would be cleared/reset when the check became healthy again. Probably not, as it's a log/history?

The text was updated successfully, but these errors were encountered:

benhoyt · 2022-03-09T13:43:00Z

Per discussion at our Frankfurt sprint, we'd like to use the existing changes/tasks system instead. Obviously we don't want to record every check result as a Pebble change object (the default is every 10s), but we could record changes in check state. So when the check goes from up to down we'd record a check-down change (it's a change in state, after all), and then when it goes from down to up we'd record a check-up change. Because there's already a bit of hysteresis involved with the failure threshold (default is 3), this shouldn't cause too many changes to be recorded even if the check is flapping up and down.

Ideally we'd start logging failures internally as soon as a check starts failing, and then only when the number of failures reaches the threshold (on the up to down transition) would we record a change and include that failure log in the task. The change summary could be the first failure log. Then when the check goes from down to up we'd record another change.

Something like this:

$ pebble changes
ID   Status  Spawn                   Ready                   Summary
193  Done    today at 11:48 CET      today at 11:48 CET      Check "nginx" down: timed out after 100ms...
194  Done    today at 14:05 CET      today at 14:05 CET      Check "nginx" up

$ pebble tasks 193
Status  Spawn                   Ready                   Summary
Done    today at 11:48 CET      today at 11:48 CET      Check "nginx" down

......................................................................
2022-03-09T11:48:33+13:00 ERROR timed out after 100ms: context deadline exceeded 
2022-03-09T11:48:43+13:00 ERROR timed out after 100ms: context deadline exceeded 
2022-03-09T11:48:53+13:00 ERROR timed out after 100ms: context deadline exceeded

benhoyt · 2022-03-29T07:41:08Z

Notes from meeting just now:

Similar to Add a way to introspect failing/restarting services #104 (comment).
Design struggle here is because it's somewhat different from snapd: there we record what we intend to do and persist it to disk across restarts, here it's more an in-memory record. Let's be careful to be consistent that a change represents an action (that has taken place or will take place), and similar for tasks. Change might be "Auto-refresh N snaps", tasks might be "Ensure per-requisites are available" or "Run service command X", "Stop snap", and so on. Come up with a way to phrase it so it feels like something being done just now.
Perhaps along the lines of "Recover from check failure".
Maybe make the on-check-failure action a task inside the change. (Then again, maybe not -- that's configured on the service so doesn't belong here?)
Probably multiple tasks similar to issue Add a way to introspect failing/restarting services #104.

benhoyt · 2022-06-20T22:36:34Z

Per comment on #119 (comment), we're planning to do this with events (a new concept) instead, and have started spec JU048 to design that.

benhoyt · 2023-01-25T03:44:53Z

Further update (and spec): we discussed this with Gustavo at the London scriptlets mini-sprint (16 Sep 2022), and decided on the following:

We want to invert what Ben did previously with connecting changes+tasks to check failure and service failure. Then we created changes and tasks from the check/service code. Instead, we want to make the actual checks happen as tasks ("do" method etc).
We want a "pointer" to the check's latest change from the check so that when you type "pebble checks" it can direct you to "pebble changes". (Or we could query changes dynamically if that works better.)
We don't record each subsequent failure as a task – there'd be too many – just the first failure log and last as shown below.
Be careful with the pruning of aborted tasks – we want a way to say "this task is long-running" or "can't be aborted".
Probably want to increase our 24-hour prune time anyway – that's quite aggressive in a Pebble context. Perhaps 7 days.

Below is how we mapped out changes and tasks for a check going into error and then recovering. We'd have two task states: success (task kind "perform check") and failing/recovering (task kind "recover check").

initial state:
  change 1 - Doing
task 1: perform check - Doing

on first error:
  change 1 - Error
task 1: perform check - Error (contains first failure log)
  change 2 - Doing
task 2: recover check - Doing

on second (or subsequent) errors:
  change 1 - Error
task 1: perform check - Error
  change 2 - Doing
task 2: recover check - Doing (contains last failure log)

now on success:
  change 1 - Error
task 1: perform check - Error
  change 2 - Done
task 2: recover check - Done (keeps last failure log)
  change 3 - Doing
task 3: perform check - Doing

benhoyt · 2024-04-18T03:54:22Z

PR to fix this (implement health checks via changes and tasks): #409

@hpidcock

Per spec JU073, this reimplements the health check manager to to use Changes and Tasks do drive the check operations. There are two change kinds, each of which has a single task of the same kind: * `perform-check`: used for driving the check while it's "up". The change (and task) finish when the number of failures hits the threshold, at which point it goes into Error status. Each check failure records a task log. * `recover-check`: used for driving the check while it's "down". The change (and task) finish when the check starts succeeding again, at which point it goes into Done status. Again, each check failure records a task log. We also add a new `change-id` field to the /v1/checks responses, allowing a client and the CLI to look up the change for each check (primarily to get the task logs for failing checks). The `pebble checks` command displays this as follows: ``` Check Level Status Failures Change chk1 - up 0/3 1 chk2 - down 1/1 2 (cannot perform check: blah blah error) chk3 alive down 42/3 3 (this is a long truncated error messag... run "pebble tasks 3" for more) ``` We (@hpidcock, @jameinel, and I) debated whether it should switch from perform-check to recover-check after the first failure or only when it hits the threshold (which is how I had the code originally). We decided in the end that it's better and more obvious to switch when it hits the threshold, as otherwise the check status is out of sync with the change kind, and you need a third state. We figured in a Juju context, for example, your alive/ready checks would normally be separate from the change-update checks you want the charm to be notified about. And if you do want them to be the same check, you can easily add a similar check with a different threshold. Here's an example of that flow (taken from the spec): ``` initial state: change 1 - Doing task 1: perform-check - Doing on first error: change 1 - Error task 1: perform-check - Error (contains first failure log(s) - up to 10) change 2 - Doing task 2: recover-check - Doing on second (or subsequent) errors: change 1 - Error task 1: perform-check - Error change 2 - Doing task 2: recover-check - Doing (contains last failure log(s) - up to 10) now on success: change 1 - Error task 1: perform-check - Error change 2 - Done task 2: recover-check - Done (keeps last failure log(s) - up to 10) change 3 - Doing task 3: perform-check - Doing ``` As far as the checks system concerns, this is an implementation detail. However, the use of changes and tasks mean that the status of a check operation and check failures can be introspected (via the existing changes API). Fixes #103.

benhoyt mentioned this issue Feb 11, 2022

Checks/health API and "pebble checks" CLI command #86

Merged

benhoyt mentioned this issue May 9, 2022

Record "recover-check" change and task logs when health check is failing #119

Closed

benhoyt changed the title ~~Add API field to see more detailed info about check failures~~ Add a way to detect and introspect check failures Jun 20, 2022

benhoyt mentioned this issue Jan 25, 2023

Add a way to introspect failing/restarting services #104

Open

benhoyt self-assigned this Mar 13, 2024

benhoyt added 24.04 Feature A feature request labels Mar 13, 2024

benhoyt mentioned this issue Apr 18, 2024

feat: perform health checks via changes and tasks #409

Merged

benhoyt closed this as completed in #409 Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a way to detect and introspect check failures #103

Add a way to detect and introspect check failures #103

benhoyt commented Feb 11, 2022 •

edited

Loading

benhoyt commented Mar 9, 2022

benhoyt commented Mar 29, 2022 •

edited

Loading

benhoyt commented Jun 20, 2022

benhoyt commented Jan 25, 2023 •

edited

Loading

benhoyt commented Apr 18, 2024

Add a way to detect and introspect check failures #103

Add a way to detect and introspect check failures #103

Comments

benhoyt commented Feb 11, 2022 • edited Loading

benhoyt commented Mar 9, 2022

benhoyt commented Mar 29, 2022 • edited Loading

benhoyt commented Jun 20, 2022

benhoyt commented Jan 25, 2023 • edited Loading

benhoyt commented Apr 18, 2024

benhoyt commented Feb 11, 2022 •

edited

Loading

benhoyt commented Mar 29, 2022 •

edited

Loading

benhoyt commented Jan 25, 2023 •

edited

Loading