Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a way to detect and introspect check failures #103

Closed
benhoyt opened this issue Feb 11, 2022 · 5 comments · Fixed by #409
Closed

Add a way to detect and introspect check failures #103

benhoyt opened this issue Feb 11, 2022 · 5 comments · Fixed by #409
Assignees
Labels
Feature A feature request

Comments

@benhoyt
Copy link
Contributor

benhoyt commented Feb 11, 2022

Per thread at #86 (comment), I previously had LastError and ErrorDetails fields on the CheckInfo API struct. However, this was a bit ad-hoc and didn't fit our existing constructs like task logs. I've pulled those out of that PR for now, but we should design a better approach.

It would seem quite heavy to just use Pebble "changes" and "tasks" for checks. If we did that, I think each check would have to be a change (with one task?) and that would flood the changes list. Besides, these aren't really state changes, they're checks. ("Change represents a tracked modification to the system state.")

So instead I suggest (idea from @niemeyer) using the same "log" concept that a task has, which is a Log []string field, and adding that to the CheckInfo object. We could keep the last 10 like a Task does. Each log string would be a timestamp-formatted string with a prefix like ERROR and a message, like so:

2022-02-22T11:40:53+13:00 ERROR timed out after 100ms: context deadline exceeded
2022-02-22T11:50:53+13:00 ERROR timed out after 100ms: context deadline exceeded

This field would be in the CheckInfo struct for a particular named check, so no further context is needed. The log would contain check failure errors and the details that was previously in ErrorDetails (truncate HTTP body for http checks, last few line of command output for exec checks).

The /v1/checks API would return this information as a list of strings. The CLI pebble checks could return the (possibly truncated) first log line by default, and the full log if you specify pebble checks --verbose. Something like this:

$ pebble checks
Check  Level  Status  Failures  Log
chk1   -      up      0/3
chk2   alive  down    2/2       2022-02-22T11:40:53+13:00 ERROR timed out after 100ms...

$ pebble checks --verbose
Check  Level  Status  Failures
chk1   -      up      0/3
chk2   alive  down    2/2
    2022-02-22T11:40:53+13:00 ERROR timed out after 100ms: context deadline exceeded 
    2022-02-22T11:50:53+13:00 ERROR timed out after 100ms: context deadline exceeded 

It's an open question whether the log would be cleared/reset when the check became healthy again. Probably not, as it's a log/history?

@benhoyt
Copy link
Contributor Author

benhoyt commented Mar 9, 2022

Per discussion at our Frankfurt sprint, we'd like to use the existing changes/tasks system instead. Obviously we don't want to record every check result as a Pebble change object (the default is every 10s), but we could record changes in check state. So when the check goes from up to down we'd record a check-down change (it's a change in state, after all), and then when it goes from down to up we'd record a check-up change. Because there's already a bit of hysteresis involved with the failure threshold (default is 3), this shouldn't cause too many changes to be recorded even if the check is flapping up and down.

Ideally we'd start logging failures internally as soon as a check starts failing, and then only when the number of failures reaches the threshold (on the up to down transition) would we record a change and include that failure log in the task. The change summary could be the first failure log. Then when the check goes from down to up we'd record another change.

Something like this:

$ pebble changes
ID   Status  Spawn                   Ready                   Summary
193  Done    today at 11:48 CET      today at 11:48 CET      Check "nginx" down: timed out after 100ms...
194  Done    today at 14:05 CET      today at 14:05 CET      Check "nginx" up

$ pebble tasks 193
Status  Spawn                   Ready                   Summary
Done    today at 11:48 CET      today at 11:48 CET      Check "nginx" down

......................................................................
2022-03-09T11:48:33+13:00 ERROR timed out after 100ms: context deadline exceeded 
2022-03-09T11:48:43+13:00 ERROR timed out after 100ms: context deadline exceeded 
2022-03-09T11:48:53+13:00 ERROR timed out after 100ms: context deadline exceeded 

@benhoyt
Copy link
Contributor Author

benhoyt commented Mar 29, 2022

Notes from meeting just now:

  • Similar to Add a way to introspect failing/restarting services #104 (comment).
  • Design struggle here is because it's somewhat different from snapd: there we record what we intend to do and persist it to disk across restarts, here it's more an in-memory record. Let's be careful to be consistent that a change represents an action (that has taken place or will take place), and similar for tasks. Change might be "Auto-refresh N snaps", tasks might be "Ensure per-requisites are available" or "Run service command X", "Stop snap", and so on. Come up with a way to phrase it so it feels like something being done just now.
  • Perhaps along the lines of "Recover from check failure".
  • Maybe make the on-check-failure action a task inside the change. (Then again, maybe not -- that's configured on the service so doesn't belong here?)
  • Probably multiple tasks similar to issue Add a way to introspect failing/restarting services #104.

@benhoyt benhoyt changed the title Add API field to see more detailed info about check failures Add a way to detect and introspect check failures Jun 20, 2022
@benhoyt
Copy link
Contributor Author

benhoyt commented Jun 20, 2022

Per comment on #119 (comment), we're planning to do this with events (a new concept) instead, and have started spec JU048 to design that.

@benhoyt
Copy link
Contributor Author

benhoyt commented Jan 25, 2023

Further update (and spec): we discussed this with Gustavo at the London scriptlets mini-sprint (16 Sep 2022), and decided on the following:

  • We want to invert what Ben did previously with connecting changes+tasks to check failure and service failure. Then we created changes and tasks from the check/service code. Instead, we want to make the actual checks happen as tasks ("do" method etc).
  • We want a "pointer" to the check's latest change from the check so that when you type "pebble checks" it can direct you to "pebble changes". (Or we could query changes dynamically if that works better.)
  • We don't record each subsequent failure as a task – there'd be too many – just the first failure log and last as shown below.
  • Be careful with the pruning of aborted tasks – we want a way to say "this task is long-running" or "can't be aborted".
  • Probably want to increase our 24-hour prune time anyway – that's quite aggressive in a Pebble context. Perhaps 7 days.

Below is how we mapped out changes and tasks for a check going into error and then recovering. We'd have two task states: success (task kind "perform check") and failing/recovering (task kind "recover check").

initial state:
  change 1 - Doing
task 1: perform check - Doing

on first error:
  change 1 - Error
task 1: perform check - Error (contains first failure log)
  change 2 - Doing
task 2: recover check - Doing

on second (or subsequent) errors:
  change 1 - Error
task 1: perform check - Error
  change 2 - Doing
task 2: recover check - Doing (contains last failure log)

now on success:
  change 1 - Error
task 1: perform check - Error
  change 2 - Done
task 2: recover check - Done (keeps last failure log)
  change 3 - Doing
task 3: perform check - Doing

@benhoyt
Copy link
Contributor Author

benhoyt commented Apr 18, 2024

PR to fix this (implement health checks via changes and tasks): #409

benhoyt added a commit that referenced this issue Apr 24, 2024
Per spec JU073,
this reimplements the health check manager to to use Changes and Tasks
do drive the check operations. There are two change kinds, each of which
has a single task of the same kind:

* `perform-check`: used for driving the check while it's "up". The
change (and task) finish when the number of failures hits the threshold,
at which point it goes into Error status. Each check failure records a
task log.
* `recover-check`: used for driving the check while it's "down". The
change (and task) finish when the check starts succeeding again, at
which point it goes into Done status. Again, each check failure records
a task log.

We also add a new `change-id` field to the /v1/checks responses,
allowing a client and the CLI to look up the change for each check
(primarily to get the task logs for failing checks). The `pebble checks`
command displays this as follows:

```
Check  Level  Status  Failures  Change
chk1   -      up      0/3       1
chk2   -      down    1/1       2 (cannot perform check: blah blah error)
chk3   alive  down    42/3      3 (this is a long truncated error messag... run "pebble tasks 3" for more)
```

We (@hpidcock, @jameinel, and I) debated whether it should switch from
perform-check to recover-check after the first failure or only when it
hits the threshold (which is how I had the code originally). We decided
in the end that it's better and more obvious to switch when it hits the
threshold, as otherwise the check status is out of sync with the change
kind, and you need a third state. We figured in a Juju context, for
example, your alive/ready checks would normally be separate from the
change-update checks you want the charm to be notified about. And if you
do want them to be the same check, you can easily add a similar check
with a different threshold.

Here's an example of that flow (taken from the spec):

```
initial state:
  change 1 - Doing
task 1: perform-check - Doing

on first error:
  change 1 - Error
task 1: perform-check - Error (contains first failure log(s) - up to 10)
  change 2 - Doing
task 2: recover-check - Doing

on second (or subsequent) errors:
  change 1 - Error
task 1: perform-check - Error
  change 2 - Doing
task 2: recover-check - Doing (contains last failure log(s) - up to 10)

now on success:
  change 1 - Error
task 1: perform-check - Error
  change 2 - Done
task 2: recover-check - Done (keeps last failure log(s) - up to 10)
  change 3 - Doing
task 3: perform-check - Doing
```

As far as the checks system concerns, this is an implementation detail.
However, the use of changes and tasks mean that the status of a check
operation and check failures can be introspected (via the existing
changes API).

Fixes #103.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature A feature request
Projects
None yet
1 participant