Checks/health API and "pebble checks" CLI command #86

benhoyt · 2021-11-25T03:23:06Z

This adds the API and CLI parts of the new custom health checks work, specifically:

An API endpoint at /v1/checks that returns detailed check information, allowing filtering by check level and name.
A Checks method on the Go client that hits this new API.
A CLI command pebble checks that shows info from the checks API in a tabulated textual list.
A plain HTTP API (over TCP rather than Unix socket) to serve the "guest" endpoints, such as the new /v1/health endpoint intended for use in Kubernetes probes. Use the --http command line option to start the guest HTTP API server.
Tests for all of the above.

* Make exec show *last* 1KB instead of first * Include first 1KB of response for http check errors

niemeyer

This is looking quite nice, thanks Ben. There are a few comments below that will need some syncing. I don't know what's more helpful on your side: splitting it up in smaller parts and merging them more quickly (there are clearly lots of independent pieces here, per the PR description), or going back and forth on these points for a bit so it can be merged all at once. Both work for me, but I'd suggest on future cases breaking it down a bit further so it feels faster and the quick wins perhaps a bit more rewarding on your end.

Either way, thanks for these changes. Clearly good work as a whole.

niemeyer · 2022-01-31T18:18:34Z

client/checks.go

+	Healthy      bool       `json:"healthy"`
+	Failures     int        `json:"failures,omitempty"`
+	LastError    string     `json:"last-error,omitempty"`
+	ErrorDetails string     `json:"error-details,omitempty"`


We didn't spend much time on the exposure of such information, and we'd probably benefit from an exchange at this point since we're making this public via APIs, so won't be able to regret from now on. For example, we have an entire state tracking system, that includes tasks, metadata, logging, timing, persistence, execution, etc, all nicely associated with specific events. Yet we've been side-stepping it for our recent features. It sounds like this is relevant here, as there's a conflict between presenting the current state ("System is healthy") vs. understanding what happened in the past ("We had 5 failures, the last one was 10 minutes ago, and here is the log for that last one"). This API is a nice entry point into checks, but we'd probably benefit from not redoing the entire state tracking system again here.

As an aside that probably goes away if we address that point, the field naming should probably be considered a bit further here before putting it into a stable API. Note:

Failures (fine, matches the plan)

LastError (not Failure?)

ErrorDetails (not Last?)

Instead of just renaming it, though, and having a loosely defined idea of what is "the error" and "more details of the error", what if we had just a single "Log" entry, for insance? That's similar to what we have in our state system already, IIRC, and seems to work well.

Instead of agreeing on all these details right away, perhaps we should just strip down this PR a bit by dropping the two error string fields for now. We can come back into it after the foundation is in, and have a look at making use of the state system for the tracking of previous issues.

Yeah, fair enough. I've removed LastError and ErrorDetails for now and opened #103 to track a better approach.

I'll dig into this a bit further and update that issue.

client/checks.go

client/checks_test.go

niemeyer · 2022-01-31T18:47:22Z

client/services.go

+	Name     string         `json:"name"`
+	Startup  ServiceStartup `json:"startup"`
+	Current  ServiceStatus  `json:"current"`
+	Restarts int            `json:"restarts"`


Similar to the other comment, we need some documentation explaining what this field does as it's not obvious from context. When is it this value zeroed over the service lifetime?

Similar to the earlier point, I'm also wondering about exposing something we don't quite understand the use case for yet. In particular, saying the service got restarted 19823 times vs. 19824 is not super useful. What is the admin trying to do, and how do we accommodate it in the best way possible?

To be clear, it's okay that we have this data internally, but it'd be nice to clarify/polish it a bit before we expose it in the API.

Similar to above, I've removed ServiceInfo.Restarts for now and opened #104 to track discussion. I actually think this field will be quite useful for debugging. Care to read my comments there and add your thoughts?

cmd/pebble/cmd_checks.go

internal/daemon/api_checks.go

internal/daemon/api_services.go

niemeyer · 2022-01-31T21:04:19Z

internal/daemon/daemon.go

+		httpAPI := httpapi.NewAPI(d.overlord.CheckManager())
+		d.httpServer = &http.Server{Handler: httpAPI}
+		d.tomb.Go(func() error {
+			logger.Debugf("Starting HTTP API on http://localhost%s", d.httpServer.Addr)


This is not true I believe. :4000 is the same as 0.0.0.0:4000, which is not the same as 127.0.0.1:4000.

I'm also slightly concerned about the introduction of this logic without us having discussed what we want in the medium term. Per conversations we had too many months ago (you were probably not around, Ben), Pebble was designed to eventually support the same protocol over other transports in addition to its unix socket. It feels like this is bolted in quickly and without much thinking around how to name this API, or how to integrate this server, in a way that faciliates that work in the near future.

To be clear, I'm not suggesting we need to do much work now to support that future reality, but we should at least understand how we want to design its entry points so that we can move there when the time comes.

Yeah, fair enough. See comments below.

niemeyer · 2022-01-31T21:09:58Z

internal/httpapi/httpapi.go

+		writeError(w, http.StatusMethodNotAllowed, "method not allowed")
+	})
+
+	s.router.HandleFunc("/v1/health", s.getHealth).Methods("GET")


Similar idea here: this is being bolted on ignoring everything else we already have in Pebble. Note that the stock daemon in Pebble already supports multiple sockets, where each socket supports different endpoints, and potentially even different parameters depending on how the actual handshake with the protocol was made.

It may be useful to understand how this is used by snapd: we have two different sockets, one that is trusted, and one that is untrusted. The untrusted socket will never be able to access certain priviliged APIs, while the trusted socket may access certain APIs, but only if it's being accessed by root. In other words, the untrusted socket cannot see the privileged APIs even if it is accessing the socket as root, because inside the container being root is not actually a sign of privilege as far as snapd is concerned.

So, coming back to our use case here, this is super useful isn't it? We have exactly the same scenario: one socket that is privileged, and one socket that is open and unprivileged. So we should be able to use exactly the same mechanics already in place to dispatch requests, instead of having this completely out of band API.

Makes sense?

Good call -- I hadn't looked at that closely enough. I moved it from a totally separate httpapi into the regular daemon package and used the existing GuestOK auth field -- meaning world-accessible. The "guest API" is what's exposed via pebble -http=<address> now. This made things quite a lot simpler as it reused a bunch of the existing code (-100 lines or so).

For now, at any rate, without further discussion. We'll probably reintroduce something similar later via a Log field.

Will probably reinstate later (in some form) after a bit more discussion.

niemeyer

Thanks for the changes, Ben. Another pass, and probably the final one. Mostly trivials, with the main point remaining being about the UI on of the CLI.

internal/daemon/api_health.go

internal/daemon/daemon.go

client/checks.go

niemeyer · 2022-02-14T14:46:05Z

cmd/pebble/cmd_checks.go

+	w := tabWriter()
+	defer w.Flush()
+
+	fmt.Fprintln(w, "Check\tLevel\tHealthy\tFailures")


A few comments on the UI here.

The Level parameter is being shown empty in some of the examples below. Parsing and even just reading becomes a bit unfriendly with empty columns. The way we typically overcome this is by using "-" on the empty values.

I was thinking about sending the threshold along, so that the client can present something like 0/5, 3/5, or even 14/5, on that Failures field, so not only we know the number of failures currently going, but also when it will (or did) blow up. If we want this, we want to do it now as otherwise the field will become backwards incompatible with future updates. What do you think?

"true" and "false" are slightly less friendly than an actual status name that would give a clear meaning to the line being read, without one having to "parse" the idea of "healthy is false".

Some playing:

Name Level Status Failures chk1 - down 5/3 chk2 - up 1/3

Potentially, we might even adapt the API itself accordingly, so that instead of having just a true/false field, "Health" becomes a string. That opens up the possibility of us introducing other statuses in the future. One easy example is "error", for instance (neither down, nor up: it's broken).

What do you think?

I agree on all of these, thanks. I've made these changes and re-pushed.

cmd/pebble/cmd_services.go

niemeyer · 2022-02-14T15:25:58Z

internal/daemon/api_checks.go

+		return statusBadRequest(`level must be "alive" or "ready"`)
+	}
+
+	names := r.URL.Query()["names"]


In snapd we most commonly use the form where the list is separated by commas (see CommaSeparatedList call sites there), and we seem to be going in the same direction here (see v1GetServices). We should be careful to be pick one convention or the other, but not both, unless there's a more specific reason to diverge.

Hmmm, I see I've diverged here from /v1/services. We use both the comma-separate approach and specifying the query parameter multiple times. The current state of multi-valued query parameters is:

GET /v1/files?action=read: path can be specified multiple times (param name is singular)

GET /v1/logs: services can be specified multiple times (param name is plural)

GET /v1/services: names is a CommaSeparatedList

I believe quite strongly that specifying the query parameter multiple times is superior -- it's already an RFC-spec supported way to a enable multi-valued parameters, and there are no escaping issues with comma. Comma-separated lists, on the other hand, don't support having , in the value ... unless we were to define a way to escape, and then that would be a mini spec in itself, and one that only we use.

For example, in the /v1/files endpoint, a path might well have commas in it. Normal query string escaping applies, so these are automatically handled correctly.

So apart from the fact that even in existing code the multi-param approach is more common than the comma-separated-list approach, I believe it's saner and will avoid escaping issues down the track, so I think we should continue with the multi-param approach.

Minor detail: we should also probably settle on singular vs plural for the parameter name -- I lean towards plural as it gives you the hint that multiple of these can be specified (then again, this is quite debatable, you could say that you should use singular as each time you only specify one).

So I'll stick with the multi-param approach here. I'm happy to open an issue for /v1/services to allow names to be specified more than once too, to make that consistent moving forward (it would still split each param on comma for backwards-compatibility).

(...) so I think we should continue with the multi-param approach.

In terms of continuation, we opted to use comma-separated values in snapd many years ago, and the same convention was being used in Pebble to begin with. So I'm just asking for us to precisely respect the continuation of choices that have been previously made and that are in use today in Pebble already, so we're not all over the place.

In terms of the value of the feature, if you have to play with this API by hand (curl, browser, whatever else) it's significantly more convenient to use comma-separated values than to pass multiple parameters using the HTTP query way. This is the case that matters the most for this particular reason, since otherwise we'll be using code that can trivially do it either way. I believe this was the reason the choice was made in snapd back then, and we're not being so creative here in terms of industry practices.

I'm happy to open an issue for /v1/services to allow names to be

Sure, we can support both ways too by having a single function that handles that, but let's please fix /v1/logs too so that it does support the comma convention that was established in snapd and in Pebble itself earlier on. We don't have to support commas in cases that may require escaping, though. It's my belief that both in snapd and in pebble so far we only use commas in cases where the keys have a well defined namespace that would never include a comma.

we should also probably settle on singular vs plural for the parameter name

Agreed, but the case at hand is a bit confusing because the list action in /v1/files indeed support just one single path. So what would be the proposal? Have different parameters for list and read, or make it plural for a case that can be singular, or make it singular for a case that may be plural? Doesn't look like there's an obvious way out.

Okay, per your comment below, I've added a strutil.MultiCommaSeparatedList function to support both methods for /v1/checks?names=..., /v1/health?names=..., /v1/logs?services=..., and /v1/services?names=.... That's all the existing cases except for the /v1/files one which needs to remain as is in case there is a , in the path.

Also make Level show as "-" if not set

* remove Service.restarts (for now at least) * rename Check.failures to .threshold * change CheckInfo.healthy bool to .status str * remove CheckInfo.last_error and .error_details (for now at least) * add CheckInfo.threshold

niemeyer

Assuming the comment above regarding the parameter convention is handled in a way that we're both happy with, LGTM. Given the commentary so far, it seems that the middle ground would be implementing a single function that when given a parameter can extract the list from the HTTP parameter, and comma-split the individual elements, so that both conventions are supported at the same time. Per the comment, we should only do that in cases where the namespace is clearly defined and commas would never be part of one of the values.

Use in new /v1/checks and /v1/health endpoints for "names", also for existing /v1/services?names=... and /v1/logs?services=... params. Commas will never appear in these fields, so using CommaSeparatedList is okay here.

benhoyt · 2022-02-20T22:09:42Z

Assuming the comment above regarding the parameter convention is handled in a way that we're both happy with, LGTM. Given the commentary so far, it seems that the middle ground would be implementing a single function that when given a parameter can extract the list from the HTTP parameter, and comma-split the individual elements, so that both conventions are supported at the same time. Per the comment, we should only do that in cases where the namespace is clearly defined and commas would never be part of one of the values.

Seems like a reasonable compromise -- done! Going to merge this now. :-)

#13568 This adds Kubernetes probes in the sidecar pod-spec definition to hit the new Pebble `/v1/health` HTTP endpoint (which aggregates the health of checks). The liveness probe hits `/v1/health?level=alive` and the readiness probe hits `/v1/health?level=ready`. Both endpoints return 200 `{"healthy": true}` if no Pebble checks have been defined -- the charmer must explicitly define checks with `level: alive` and/or `level: ready` to use the feature. If the level=alive check is not healthy (and there's no level=ready check defined), that implies level=ready is unhealthy too -- the `/v1/health` aggregation takes this semantic meaning of the levels into account. The intent of this change is to allow charmers to make robust, production-ready charms, with K8s restarting pods if they go unhealthy. See the [spec](https://docs.google.com/document/d/1d6-h3UAt2VPUSvlkVF30l8iuDW8raRNkHbp5M6NUo1A/edit#heading=h.5rw2t2gzo2xf). The Pebble code changes for health checks and the `/v1/health` API are at: * canonical/pebble#85 * canonical/pebble#86 ## QA steps These steps create a new microk8s controller, deploy the `snappass-test` charm, and play with the health checks integration: ``` # Deploy a sidecar charm app make microk8s-operator-update # to update Docker operator image juju bootstrap microk8s juju add-model snappass juju deploy snappass-test juju status # visit http://<address>:5000 to view snappass web app microk8s kubectl describe pods -n snappass snappass-test-0 # view K8s pod details # Add a Pebble layer to ignore "redis-server" exiting but also an "alive" check # write contents of layer.yaml echo " services: redis: override: merge on-success: ignore on-failure: ignore checks: redis-up: override: replace level: alive tcp: port: 6379 " >layer.yaml # copy layer to charm container microk8s kubectl cp -n snappass -c charm ./layer.yaml snappass-test-0:/layer.yaml # run "pebble add" to add layer microk8s kubectl exec -n snappass snappass-test-0 -c charm -- /bin/sh -c 'PEBBLE_SOCKET=/charm/containers/redis/pebble.socket /charm/bin/pebble add l1 /layer.yaml' # run "pebble replan" to pick up service changes microk8s kubectl exec -n snappass snappass-test-0 -c charm -- /bin/sh -c 'PEBBLE_SOCKET=/charm/containers/redis/pebble.socket /charm/bin/pebble replan' # Kill redis-server so check will go unhealthy after 3 attempts sudo killall redis-server # be careful :-) # monitor Pebble check state microk8s kubectl exec -n snappass snappass-test-0 -c charm -- /bin/sh -c 'PEBBLE_SOCKET=/charm/containers/redis/pebble.socket /charm/bin/pebble checks' # After ~30 seconds it should go unhealthy and the K8s probe will fail, terminating the container # monitor K8s status with: microk8s kubectl describe pods -n snappass snappass-test-0 # Other useful commands # run "pebble plan" to show updated plan microk8s kubectl exec -n snappass snappass-test-0 -c charm -- /bin/sh -c 'PEBBLE_SOCKET=/charm/containers/redis/pebble.socket /charm/bin/pebble plan' # hit /v1/health endpoint manually microk8s kubectl exec -n snappass snappass-test-0 -c charm -- curl 'http://localhost:38813/v1/health?level=alive' # run "pebble services" to show service status microk8s kubectl exec -n snappass snappass-test-0 -c charm -- /bin/sh -c 'PEBBLE_SOCKET=/charm/containers/redis/pebble.socket /charm/bin/pebble services' # show Pebble logs via kubectl (can use "-f" to follow) microk8s kubectl logs -n snappass snappass-test-0 -c redis ``` ## Documentation changes See [here](https://juju.is/docs/sdk/pebble#heading--service-auto-restart). Diff at benhoyt/juju-docs#1.

benhoyt added 2 commits November 25, 2021 16:16

Custom health checks: manager, checkers, and interface with servstate

edd30f6

Custom health checks API and "pebble checks" CLI command

49871c3

This was referenced Nov 25, 2021

Custom health checks API and "pebble checks" CLI command benhoyt/pebble#3

Closed

Custom health checks #84

Closed

benhoyt added 20 commits November 25, 2021 16:55

Add copyright headers, comment fixes, minor refactoring

958c196

Test for on-check-failure logic

f345a83

Add new checks fields to README config docs

ea5e590

Add most plan tests (merge not working correctly yet)

101f59e

Move field validation and defaults to after merging; Fix TestReadDir

2154b5f

Fix http/tcp/exec check merging; factor out merge logic into functions

fa1021b

Test out doing Copy and Merge via reflection-based libraries

739e9d2

Check off (so to speak) custom health checks TODO

23b3e2c

Readme fixes per Harry's comments; fix typo in comment

bb51432

Few other tweaks from self code review

4653fd9

A few more details on "level" docs in readme

91e6140

Improve error-details

5eb9484

* Make exec show *last* 1KB instead of first * Include first 1KB of response for http check errors

Add "restarts" (total number of auto-restarts) to API and CLI

f87c342

Add "level" to /v1/checks response and CheckInfo

b00b81d

Fix "level" tests

e975046

Add simple HTTP API server for "public" endpoints like /v1/health

85b731e

Fix CheckManager.Checks so that ready implies alive when filtering

c80f90b

Fix /v1/health level filtering; Add httpapi tests

4545fad

Note that ready implies alive in client and CLI docs

097d2e6

Add end-to-end daemon test for HTTP API

5450459

benhoyt requested review from niemeyer and hpidcock November 30, 2021 05:09

benhoyt marked this pull request as ready for review November 30, 2021 05:09

benhoyt changed the title ~~Checks API and "pebble checks" CLI command~~ Checks/health API and "pebble checks" CLI command Nov 30, 2021

benhoyt mentioned this pull request Dec 3, 2021

Add signals API to send a signal to one or more services #92

Merged

Move "alive implies ready" logic to only /v1/health and fix it

d31f6f4

rwcarlsen mentioned this pull request Jan 28, 2022

Add checks config fields and get_checks API canonical/operator#668

Merged

niemeyer requested changes Jan 31, 2022

View reviewed changes

benhoyt added 8 commits February 11, 2022 11:29

Remove LastError and ErrorDetails from public interfaces

9d56f94

For now, at any rate, without further discussion. We'll probably reintroduce something similar later via a Log field.

Document fields for Checks client structs; don't send level when empty

55176b3

Reformat CheckInfo literal formatting

43018bf

Remove Restarts from public interfaces for now

378da9a

Will probably reinstate later (in some form) after a bit more discussion.

Improve "pebble checks" help text

fd1859b

Standardize on proper punctuation for cmd output

df09756

Move /v1/health into daemon to reuse routing logic and GuestOK auth

43699dd

Remove changes to service_test.go

258a3b0

This was referenced Feb 11, 2022

Add a way to detect and introspect check failures #103

Closed

Add a way to introspect failing/restarting services #104

Open

niemeyer requested changes Feb 14, 2022

View reviewed changes

benhoyt added 5 commits February 15, 2022 10:54

Don't call the new HTTP API a "guest API"

041fcf6

Add Threshold to checks API so we can show (eg) "1/3" failures

90491bc

Also make Level show as "-" if not set

Replace Healthy bool with Status string in API and CLI

b4cb5eb

Use "fake func" approach for getChecks for consistency with other tests

6ef2fd7

Make /v1/checks return [] instead of null when there's no checks

94d9ddc

benhoyt mentioned this pull request Feb 15, 2022

Add docs about auto-restart and checks benhoyt/juju-docs#1

Merged

niemeyer approved these changes Feb 18, 2022

View reviewed changes

Use new strutil.MultiCommaSeparatedList for multi query params

2936aba

Use in new /v1/checks and /v1/health endpoints for "names", also for existing /v1/services?names=... and /v1/logs?services=... params. Commas will never appear in these fields, so using CommaSeparatedList is okay here.

benhoyt merged commit a922aaf into canonical:master Feb 20, 2022

benhoyt deleted the checks-api-cli branch February 20, 2022 22:11

This was referenced Feb 20, 2022

Update Pebble to latest (including health checks and API) juju/juju#13767

Closed

Integrate K8s probes to hit Pebble /v1/health endpoint juju/juju#13568

Merged

benhoyt mentioned this pull request Aug 11, 2022

Add last failure info to checks API and CLI command #134

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checks/health API and "pebble checks" CLI command #86

Checks/health API and "pebble checks" CLI command #86

benhoyt commented Nov 25, 2021 •

edited

Loading

niemeyer left a comment

niemeyer Jan 31, 2022

benhoyt Feb 11, 2022

benhoyt Feb 11, 2022

niemeyer Jan 31, 2022

benhoyt Feb 11, 2022

niemeyer Jan 31, 2022

benhoyt Feb 11, 2022

niemeyer Jan 31, 2022

benhoyt Feb 11, 2022 •

edited

Loading

niemeyer left a comment

niemeyer Feb 14, 2022 •

edited

Loading

benhoyt Feb 15, 2022

niemeyer Feb 14, 2022

benhoyt Feb 14, 2022

niemeyer Feb 18, 2022

benhoyt Feb 20, 2022

niemeyer left a comment

benhoyt commented Feb 20, 2022

Checks/health API and "pebble checks" CLI command #86

Checks/health API and "pebble checks" CLI command #86

Conversation

benhoyt commented Nov 25, 2021 • edited Loading

niemeyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benhoyt Feb 11, 2022 • edited Loading

Choose a reason for hiding this comment

niemeyer left a comment

Choose a reason for hiding this comment

niemeyer Feb 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

niemeyer left a comment

Choose a reason for hiding this comment

benhoyt commented Feb 20, 2022

benhoyt commented Nov 25, 2021 •

edited

Loading

benhoyt Feb 11, 2022 •

edited

Loading

niemeyer Feb 14, 2022 •

edited

Loading