-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checks/health API and "pebble checks" CLI command #86
Conversation
* Make exec show *last* 1KB instead of first * Include first 1KB of response for http check errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking quite nice, thanks Ben. There are a few comments below that will need some syncing. I don't know what's more helpful on your side: splitting it up in smaller parts and merging them more quickly (there are clearly lots of independent pieces here, per the PR description), or going back and forth on these points for a bit so it can be merged all at once. Both work for me, but I'd suggest on future cases breaking it down a bit further so it feels faster and the quick wins perhaps a bit more rewarding on your end.
Either way, thanks for these changes. Clearly good work as a whole.
client/checks.go
Outdated
Healthy bool `json:"healthy"` | ||
Failures int `json:"failures,omitempty"` | ||
LastError string `json:"last-error,omitempty"` | ||
ErrorDetails string `json:"error-details,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We didn't spend much time on the exposure of such information, and we'd probably benefit from an exchange at this point since we're making this public via APIs, so won't be able to regret from now on. For example, we have an entire state tracking system, that includes tasks, metadata, logging, timing, persistence, execution, etc, all nicely associated with specific events. Yet we've been side-stepping it for our recent features. It sounds like this is relevant here, as there's a conflict between presenting the current state ("System is healthy") vs. understanding what happened in the past ("We had 5 failures, the last one was 10 minutes ago, and here is the log for that last one"). This API is a nice entry point into checks, but we'd probably benefit from not redoing the entire state tracking system again here.
As an aside that probably goes away if we address that point, the field naming should probably be considered a bit further here before putting it into a stable API. Note:
- Failures (fine, matches the plan)
- LastError (not Failure?)
- ErrorDetails (not Last?)
Instead of just renaming it, though, and having a loosely defined idea of what is "the error" and "more details of the error", what if we had just a single "Log" entry, for insance? That's similar to what we have in our state system already, IIRC, and seems to work well.
Instead of agreeing on all these details right away, perhaps we should just strip down this PR a bit by dropping the two error string fields for now. We can come back into it after the foundation is in, and have a look at making use of the state system for the tracking of previous issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, fair enough. I've removed LastError
and ErrorDetails
for now and opened #103 to track a better approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll dig into this a bit further and update that issue.
client/services.go
Outdated
Name string `json:"name"` | ||
Startup ServiceStartup `json:"startup"` | ||
Current ServiceStatus `json:"current"` | ||
Restarts int `json:"restarts"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the other comment, we need some documentation explaining what this field does as it's not obvious from context. When is it this value zeroed over the service lifetime?
Similar to the earlier point, I'm also wondering about exposing something we don't quite understand the use case for yet. In particular, saying the service got restarted 19823 times vs. 19824 is not super useful. What is the admin trying to do, and how do we accommodate it in the best way possible?
To be clear, it's okay that we have this data internally, but it'd be nice to clarify/polish it a bit before we expose it in the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to above, I've removed ServiceInfo.Restarts
for now and opened #104 to track discussion. I actually think this field will be quite useful for debugging. Care to read my comments there and add your thoughts?
internal/daemon/daemon.go
Outdated
httpAPI := httpapi.NewAPI(d.overlord.CheckManager()) | ||
d.httpServer = &http.Server{Handler: httpAPI} | ||
d.tomb.Go(func() error { | ||
logger.Debugf("Starting HTTP API on http://localhost%s", d.httpServer.Addr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not true I believe. :4000 is the same as 0.0.0.0:4000, which is not the same as 127.0.0.1:4000.
I'm also slightly concerned about the introduction of this logic without us having discussed what we want in the medium term. Per conversations we had too many months ago (you were probably not around, Ben), Pebble was designed to eventually support the same protocol over other transports in addition to its unix socket. It feels like this is bolted in quickly and without much thinking around how to name this API, or how to integrate this server, in a way that faciliates that work in the near future.
To be clear, I'm not suggesting we need to do much work now to support that future reality, but we should at least understand how we want to design its entry points so that we can move there when the time comes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, fair enough. See comments below.
internal/httpapi/httpapi.go
Outdated
writeError(w, http.StatusMethodNotAllowed, "method not allowed") | ||
}) | ||
|
||
s.router.HandleFunc("/v1/health", s.getHealth).Methods("GET") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar idea here: this is being bolted on ignoring everything else we already have in Pebble. Note that the stock daemon in Pebble already supports multiple sockets, where each socket supports different endpoints, and potentially even different parameters depending on how the actual handshake with the protocol was made.
It may be useful to understand how this is used by snapd: we have two different sockets, one that is trusted, and one that is untrusted. The untrusted socket will never be able to access certain priviliged APIs, while the trusted socket may access certain APIs, but only if it's being accessed by root. In other words, the untrusted socket cannot see the privileged APIs even if it is accessing the socket as root, because inside the container being root is not actually a sign of privilege as far as snapd is concerned.
So, coming back to our use case here, this is super useful isn't it? We have exactly the same scenario: one socket that is privileged, and one socket that is open and unprivileged. So we should be able to use exactly the same mechanics already in place to dispatch requests, instead of having this completely out of band API.
Makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call -- I hadn't looked at that closely enough. I moved it from a totally separate httpapi
into the regular daemon
package and used the existing GuestOK
auth field -- meaning world-accessible. The "guest API" is what's exposed via pebble -http=<address>
now. This made things quite a lot simpler as it reused a bunch of the existing code (-100 lines or so).
For now, at any rate, without further discussion. We'll probably reintroduce something similar later via a Log field.
Will probably reinstate later (in some form) after a bit more discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes, Ben. Another pass, and probably the final one. Mostly trivials, with the main point remaining being about the UI on of the CLI.
cmd/pebble/cmd_checks.go
Outdated
w := tabWriter() | ||
defer w.Flush() | ||
|
||
fmt.Fprintln(w, "Check\tLevel\tHealthy\tFailures") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments on the UI here.
-
The Level parameter is being shown empty in some of the examples below. Parsing and even just reading becomes a bit unfriendly with empty columns. The way we typically overcome this is by using "-" on the empty values.
-
I was thinking about sending the threshold along, so that the client can present something like
0/5
,3/5
, or even14/5
, on that Failures field, so not only we know the number of failures currently going, but also when it will (or did) blow up. If we want this, we want to do it now as otherwise the field will become backwards incompatible with future updates. What do you think? -
"true" and "false" are slightly less friendly than an actual status name that would give a clear meaning to the line being read, without one having to "parse" the idea of "healthy is false".
Some playing:
Name Level Status Failures
chk1 - down 5/3
chk2 - up 1/3
Potentially, we might even adapt the API itself accordingly, so that instead of having just a true/false field, "Health" becomes a string. That opens up the possibility of us introducing other statuses in the future. One easy example is "error", for instance (neither down, nor up: it's broken).
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree on all of these, thanks. I've made these changes and re-pushed.
internal/daemon/api_checks.go
Outdated
return statusBadRequest(`level must be "alive" or "ready"`) | ||
} | ||
|
||
names := r.URL.Query()["names"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In snapd we most commonly use the form where the list is separated by commas (see CommaSeparatedList call sites there), and we seem to be going in the same direction here (see v1GetServices). We should be careful to be pick one convention or the other, but not both, unless there's a more specific reason to diverge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, I see I've diverged here from /v1/services
. We use both the comma-separate approach and specifying the query parameter multiple times. The current state of multi-valued query parameters is:
GET /v1/files?action=read
:path
can be specified multiple times (param name is singular)GET /v1/logs
:services
can be specified multiple times (param name is plural)GET /v1/services
:names
is aCommaSeparatedList
I believe quite strongly that specifying the query parameter multiple times is superior -- it's already an RFC-spec supported way to a enable multi-valued parameters, and there are no escaping issues with comma. Comma-separated lists, on the other hand, don't support having ,
in the value ... unless we were to define a way to escape, and then that would be a mini spec in itself, and one that only we use.
For example, in the /v1/files
endpoint, a path
might well have commas in it. Normal query string escaping applies, so these are automatically handled correctly.
So apart from the fact that even in existing code the multi-param approach is more common than the comma-separated-list approach, I believe it's saner and will avoid escaping issues down the track, so I think we should continue with the multi-param approach.
Minor detail: we should also probably settle on singular vs plural for the parameter name -- I lean towards plural as it gives you the hint that multiple of these can be specified (then again, this is quite debatable, you could say that you should use singular as each time you only specify one).
So I'll stick with the multi-param approach here. I'm happy to open an issue for /v1/services
to allow names
to be specified more than once too, to make that consistent moving forward (it would still split each param on comma for backwards-compatibility).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(...) so I think we should continue with the multi-param approach.
In terms of continuation, we opted to use comma-separated values in snapd many years ago, and the same convention was being used in Pebble to begin with. So I'm just asking for us to precisely respect the continuation of choices that have been previously made and that are in use today in Pebble already, so we're not all over the place.
In terms of the value of the feature, if you have to play with this API by hand (curl, browser, whatever else) it's significantly more convenient to use comma-separated values than to pass multiple parameters using the HTTP query way. This is the case that matters the most for this particular reason, since otherwise we'll be using code that can trivially do it either way. I believe this was the reason the choice was made in snapd back then, and we're not being so creative here in terms of industry practices.
I'm happy to open an issue for /v1/services to allow names to be
Sure, we can support both ways too by having a single function that handles that, but let's please fix /v1/logs too so that it does support the comma convention that was established in snapd and in Pebble itself earlier on. We don't have to support commas in cases that may require escaping, though. It's my belief that both in snapd and in pebble so far we only use commas in cases where the keys have a well defined namespace that would never include a comma.
we should also probably settle on singular vs plural for the parameter name
Agreed, but the case at hand is a bit confusing because the list action in /v1/files indeed support just one single path. So what would be the proposal? Have different parameters for list and read, or make it plural for a case that can be singular, or make it singular for a case that may be plural? Doesn't look like there's an obvious way out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, per your comment below, I've added a strutil.MultiCommaSeparatedList
function to support both methods for /v1/checks?names=...
, /v1/health?names=...
, /v1/logs?services=...
, and /v1/services?names=...
. That's all the existing cases except for the /v1/files
one which needs to remain as is in case there is a ,
in the path.
Also make Level show as "-" if not set
* remove Service.restarts (for now at least) * rename Check.failures to .threshold * change CheckInfo.healthy bool to .status str * remove CheckInfo.last_error and .error_details (for now at least) * add CheckInfo.threshold
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming the comment above regarding the parameter convention is handled in a way that we're both happy with, LGTM. Given the commentary so far, it seems that the middle ground would be implementing a single function that when given a parameter can extract the list from the HTTP parameter, and comma-split the individual elements, so that both conventions are supported at the same time. Per the comment, we should only do that in cases where the namespace is clearly defined and commas would never be part of one of the values.
Use in new /v1/checks and /v1/health endpoints for "names", also for existing /v1/services?names=... and /v1/logs?services=... params. Commas will never appear in these fields, so using CommaSeparatedList is okay here.
Seems like a reasonable compromise -- done! Going to merge this now. :-) |
#13568 This adds Kubernetes probes in the sidecar pod-spec definition to hit the new Pebble `/v1/health` HTTP endpoint (which aggregates the health of checks). The liveness probe hits `/v1/health?level=alive` and the readiness probe hits `/v1/health?level=ready`. Both endpoints return 200 `{"healthy": true}` if no Pebble checks have been defined -- the charmer must explicitly define checks with `level: alive` and/or `level: ready` to use the feature. If the level=alive check is not healthy (and there's no level=ready check defined), that implies level=ready is unhealthy too -- the `/v1/health` aggregation takes this semantic meaning of the levels into account. The intent of this change is to allow charmers to make robust, production-ready charms, with K8s restarting pods if they go unhealthy. See the [spec](https://docs.google.com/document/d/1d6-h3UAt2VPUSvlkVF30l8iuDW8raRNkHbp5M6NUo1A/edit#heading=h.5rw2t2gzo2xf). The Pebble code changes for health checks and the `/v1/health` API are at: * canonical/pebble#85 * canonical/pebble#86 ## QA steps These steps create a new microk8s controller, deploy the `snappass-test` charm, and play with the health checks integration: ``` # Deploy a sidecar charm app make microk8s-operator-update # to update Docker operator image juju bootstrap microk8s juju add-model snappass juju deploy snappass-test juju status # visit http://<address>:5000 to view snappass web app microk8s kubectl describe pods -n snappass snappass-test-0 # view K8s pod details # Add a Pebble layer to ignore "redis-server" exiting but also an "alive" check # write contents of layer.yaml echo " services: redis: override: merge on-success: ignore on-failure: ignore checks: redis-up: override: replace level: alive tcp: port: 6379 " >layer.yaml # copy layer to charm container microk8s kubectl cp -n snappass -c charm ./layer.yaml snappass-test-0:/layer.yaml # run "pebble add" to add layer microk8s kubectl exec -n snappass snappass-test-0 -c charm -- /bin/sh -c 'PEBBLE_SOCKET=/charm/containers/redis/pebble.socket /charm/bin/pebble add l1 /layer.yaml' # run "pebble replan" to pick up service changes microk8s kubectl exec -n snappass snappass-test-0 -c charm -- /bin/sh -c 'PEBBLE_SOCKET=/charm/containers/redis/pebble.socket /charm/bin/pebble replan' # Kill redis-server so check will go unhealthy after 3 attempts sudo killall redis-server # be careful :-) # monitor Pebble check state microk8s kubectl exec -n snappass snappass-test-0 -c charm -- /bin/sh -c 'PEBBLE_SOCKET=/charm/containers/redis/pebble.socket /charm/bin/pebble checks' # After ~30 seconds it should go unhealthy and the K8s probe will fail, terminating the container # monitor K8s status with: microk8s kubectl describe pods -n snappass snappass-test-0 # Other useful commands # run "pebble plan" to show updated plan microk8s kubectl exec -n snappass snappass-test-0 -c charm -- /bin/sh -c 'PEBBLE_SOCKET=/charm/containers/redis/pebble.socket /charm/bin/pebble plan' # hit /v1/health endpoint manually microk8s kubectl exec -n snappass snappass-test-0 -c charm -- curl 'http://localhost:38813/v1/health?level=alive' # run "pebble services" to show service status microk8s kubectl exec -n snappass snappass-test-0 -c charm -- /bin/sh -c 'PEBBLE_SOCKET=/charm/containers/redis/pebble.socket /charm/bin/pebble services' # show Pebble logs via kubectl (can use "-f" to follow) microk8s kubectl logs -n snappass snappass-test-0 -c redis ``` ## Documentation changes See [here](https://juju.is/docs/sdk/pebble#heading--service-auto-restart). Diff at benhoyt/juju-docs#1.
This adds the API and CLI parts of the new custom health checks work, specifically:
/v1/checks
that returns detailed check information, allowing filtering by check level and name.Checks
method on the Go client that hits this new API.pebble checks
that shows info from the checks API in a tabulated textual list./v1/health
endpoint intended for use in Kubernetes probes. Use the--http
command line option to start the guest HTTP API server.