Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checks/health API and "pebble checks" CLI command #86

Merged
merged 83 commits into from
Feb 20, 2022

Conversation

benhoyt
Copy link
Contributor

@benhoyt benhoyt commented Nov 25, 2021

This adds the API and CLI parts of the new custom health checks work, specifically:

  • An API endpoint at /v1/checks that returns detailed check information, allowing filtering by check level and name.
  • A Checks method on the Go client that hits this new API.
  • A CLI command pebble checks that shows info from the checks API in a tabulated textual list.
  • A plain HTTP API (over TCP rather than Unix socket) to serve the "guest" endpoints, such as the new /v1/health endpoint intended for use in Kubernetes probes. Use the --http command line option to start the guest HTTP API server.
  • Tests for all of the above.

@benhoyt benhoyt requested review from niemeyer and hpidcock November 30, 2021 05:09
@benhoyt benhoyt marked this pull request as ready for review November 30, 2021 05:09
@benhoyt benhoyt changed the title Checks API and "pebble checks" CLI command Checks/health API and "pebble checks" CLI command Nov 30, 2021
Copy link
Contributor

@niemeyer niemeyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking quite nice, thanks Ben. There are a few comments below that will need some syncing. I don't know what's more helpful on your side: splitting it up in smaller parts and merging them more quickly (there are clearly lots of independent pieces here, per the PR description), or going back and forth on these points for a bit so it can be merged all at once. Both work for me, but I'd suggest on future cases breaking it down a bit further so it feels faster and the quick wins perhaps a bit more rewarding on your end.

Either way, thanks for these changes. Clearly good work as a whole.

client/checks.go Outdated
Healthy bool `json:"healthy"`
Failures int `json:"failures,omitempty"`
LastError string `json:"last-error,omitempty"`
ErrorDetails string `json:"error-details,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We didn't spend much time on the exposure of such information, and we'd probably benefit from an exchange at this point since we're making this public via APIs, so won't be able to regret from now on. For example, we have an entire state tracking system, that includes tasks, metadata, logging, timing, persistence, execution, etc, all nicely associated with specific events. Yet we've been side-stepping it for our recent features. It sounds like this is relevant here, as there's a conflict between presenting the current state ("System is healthy") vs. understanding what happened in the past ("We had 5 failures, the last one was 10 minutes ago, and here is the log for that last one"). This API is a nice entry point into checks, but we'd probably benefit from not redoing the entire state tracking system again here.

As an aside that probably goes away if we address that point, the field naming should probably be considered a bit further here before putting it into a stable API. Note:

  • Failures (fine, matches the plan)
  • LastError (not Failure?)
  • ErrorDetails (not Last?)

Instead of just renaming it, though, and having a loosely defined idea of what is "the error" and "more details of the error", what if we had just a single "Log" entry, for insance? That's similar to what we have in our state system already, IIRC, and seems to work well.

Instead of agreeing on all these details right away, perhaps we should just strip down this PR a bit by dropping the two error string fields for now. We can come back into it after the foundation is in, and have a look at making use of the state system for the tracking of previous issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, fair enough. I've removed LastError and ErrorDetails for now and opened #103 to track a better approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll dig into this a bit further and update that issue.

client/checks.go Outdated Show resolved Hide resolved
client/checks.go Outdated Show resolved Hide resolved
client/checks_test.go Outdated Show resolved Hide resolved
Name string `json:"name"`
Startup ServiceStartup `json:"startup"`
Current ServiceStatus `json:"current"`
Restarts int `json:"restarts"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the other comment, we need some documentation explaining what this field does as it's not obvious from context. When is it this value zeroed over the service lifetime?

Similar to the earlier point, I'm also wondering about exposing something we don't quite understand the use case for yet. In particular, saying the service got restarted 19823 times vs. 19824 is not super useful. What is the admin trying to do, and how do we accommodate it in the best way possible?

To be clear, it's okay that we have this data internally, but it'd be nice to clarify/polish it a bit before we expose it in the API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above, I've removed ServiceInfo.Restarts for now and opened #104 to track discussion. I actually think this field will be quite useful for debugging. Care to read my comments there and add your thoughts?

cmd/pebble/cmd_checks.go Outdated Show resolved Hide resolved
internal/daemon/api_checks.go Outdated Show resolved Hide resolved
internal/daemon/api_services.go Outdated Show resolved Hide resolved
httpAPI := httpapi.NewAPI(d.overlord.CheckManager())
d.httpServer = &http.Server{Handler: httpAPI}
d.tomb.Go(func() error {
logger.Debugf("Starting HTTP API on http://localhost%s", d.httpServer.Addr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not true I believe. :4000 is the same as 0.0.0.0:4000, which is not the same as 127.0.0.1:4000.

I'm also slightly concerned about the introduction of this logic without us having discussed what we want in the medium term. Per conversations we had too many months ago (you were probably not around, Ben), Pebble was designed to eventually support the same protocol over other transports in addition to its unix socket. It feels like this is bolted in quickly and without much thinking around how to name this API, or how to integrate this server, in a way that faciliates that work in the near future.

To be clear, I'm not suggesting we need to do much work now to support that future reality, but we should at least understand how we want to design its entry points so that we can move there when the time comes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, fair enough. See comments below.

writeError(w, http.StatusMethodNotAllowed, "method not allowed")
})

s.router.HandleFunc("/v1/health", s.getHealth).Methods("GET")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar idea here: this is being bolted on ignoring everything else we already have in Pebble. Note that the stock daemon in Pebble already supports multiple sockets, where each socket supports different endpoints, and potentially even different parameters depending on how the actual handshake with the protocol was made.

It may be useful to understand how this is used by snapd: we have two different sockets, one that is trusted, and one that is untrusted. The untrusted socket will never be able to access certain priviliged APIs, while the trusted socket may access certain APIs, but only if it's being accessed by root. In other words, the untrusted socket cannot see the privileged APIs even if it is accessing the socket as root, because inside the container being root is not actually a sign of privilege as far as snapd is concerned.

So, coming back to our use case here, this is super useful isn't it? We have exactly the same scenario: one socket that is privileged, and one socket that is open and unprivileged. So we should be able to use exactly the same mechanics already in place to dispatch requests, instead of having this completely out of band API.

Makes sense?

Copy link
Contributor Author

@benhoyt benhoyt Feb 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call -- I hadn't looked at that closely enough. I moved it from a totally separate httpapi into the regular daemon package and used the existing GuestOK auth field -- meaning world-accessible. The "guest API" is what's exposed via pebble -http=<address> now. This made things quite a lot simpler as it reused a bunch of the existing code (-100 lines or so).

Copy link
Contributor

@niemeyer niemeyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, Ben. Another pass, and probably the final one. Mostly trivials, with the main point remaining being about the UI on of the CLI.

internal/daemon/api_health.go Outdated Show resolved Hide resolved
internal/daemon/daemon.go Outdated Show resolved Hide resolved
internal/daemon/daemon.go Outdated Show resolved Hide resolved
internal/daemon/daemon.go Outdated Show resolved Hide resolved
internal/daemon/daemon.go Outdated Show resolved Hide resolved
client/checks.go Outdated Show resolved Hide resolved
client/checks.go Show resolved Hide resolved
w := tabWriter()
defer w.Flush()

fmt.Fprintln(w, "Check\tLevel\tHealthy\tFailures")
Copy link
Contributor

@niemeyer niemeyer Feb 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments on the UI here.

  1. The Level parameter is being shown empty in some of the examples below. Parsing and even just reading becomes a bit unfriendly with empty columns. The way we typically overcome this is by using "-" on the empty values.

  2. I was thinking about sending the threshold along, so that the client can present something like 0/5, 3/5, or even 14/5, on that Failures field, so not only we know the number of failures currently going, but also when it will (or did) blow up. If we want this, we want to do it now as otherwise the field will become backwards incompatible with future updates. What do you think?

  3. "true" and "false" are slightly less friendly than an actual status name that would give a clear meaning to the line being read, without one having to "parse" the idea of "healthy is false".

Some playing:

Name  Level  Status  Failures
chk1  -      down    5/3
chk2  -      up      1/3

Potentially, we might even adapt the API itself accordingly, so that instead of having just a true/false field, "Health" becomes a string. That opens up the possibility of us introducing other statuses in the future. One easy example is "error", for instance (neither down, nor up: it's broken).

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on all of these, thanks. I've made these changes and re-pushed.

cmd/pebble/cmd_services.go Show resolved Hide resolved
return statusBadRequest(`level must be "alive" or "ready"`)
}

names := r.URL.Query()["names"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In snapd we most commonly use the form where the list is separated by commas (see CommaSeparatedList call sites there), and we seem to be going in the same direction here (see v1GetServices). We should be careful to be pick one convention or the other, but not both, unless there's a more specific reason to diverge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I see I've diverged here from /v1/services. We use both the comma-separate approach and specifying the query parameter multiple times. The current state of multi-valued query parameters is:

  1. GET /v1/files?action=read: path can be specified multiple times (param name is singular)
  2. GET /v1/logs: services can be specified multiple times (param name is plural)
  3. GET /v1/services: names is a CommaSeparatedList

I believe quite strongly that specifying the query parameter multiple times is superior -- it's already an RFC-spec supported way to a enable multi-valued parameters, and there are no escaping issues with comma. Comma-separated lists, on the other hand, don't support having , in the value ... unless we were to define a way to escape, and then that would be a mini spec in itself, and one that only we use.

For example, in the /v1/files endpoint, a path might well have commas in it. Normal query string escaping applies, so these are automatically handled correctly.

So apart from the fact that even in existing code the multi-param approach is more common than the comma-separated-list approach, I believe it's saner and will avoid escaping issues down the track, so I think we should continue with the multi-param approach.

Minor detail: we should also probably settle on singular vs plural for the parameter name -- I lean towards plural as it gives you the hint that multiple of these can be specified (then again, this is quite debatable, you could say that you should use singular as each time you only specify one).

So I'll stick with the multi-param approach here. I'm happy to open an issue for /v1/services to allow names to be specified more than once too, to make that consistent moving forward (it would still split each param on comma for backwards-compatibility).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(...) so I think we should continue with the multi-param approach.

In terms of continuation, we opted to use comma-separated values in snapd many years ago, and the same convention was being used in Pebble to begin with. So I'm just asking for us to precisely respect the continuation of choices that have been previously made and that are in use today in Pebble already, so we're not all over the place.

In terms of the value of the feature, if you have to play with this API by hand (curl, browser, whatever else) it's significantly more convenient to use comma-separated values than to pass multiple parameters using the HTTP query way. This is the case that matters the most for this particular reason, since otherwise we'll be using code that can trivially do it either way. I believe this was the reason the choice was made in snapd back then, and we're not being so creative here in terms of industry practices.

I'm happy to open an issue for /v1/services to allow names to be

Sure, we can support both ways too by having a single function that handles that, but let's please fix /v1/logs too so that it does support the comma convention that was established in snapd and in Pebble itself earlier on. We don't have to support commas in cases that may require escaping, though. It's my belief that both in snapd and in pebble so far we only use commas in cases where the keys have a well defined namespace that would never include a comma.

we should also probably settle on singular vs plural for the parameter name

Agreed, but the case at hand is a bit confusing because the list action in /v1/files indeed support just one single path. So what would be the proposal? Have different parameters for list and read, or make it plural for a case that can be singular, or make it singular for a case that may be plural? Doesn't look like there's an obvious way out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, per your comment below, I've added a strutil.MultiCommaSeparatedList function to support both methods for /v1/checks?names=..., /v1/health?names=..., /v1/logs?services=..., and /v1/services?names=.... That's all the existing cases except for the /v1/files one which needs to remain as is in case there is a , in the path.

benhoyt added a commit to benhoyt/operator that referenced this pull request Feb 15, 2022
* remove Service.restarts (for now at least)
* rename Check.failures to .threshold
* change CheckInfo.healthy bool to .status str
* remove CheckInfo.last_error and .error_details (for now at least)
* add CheckInfo.threshold
Copy link
Contributor

@niemeyer niemeyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming the comment above regarding the parameter convention is handled in a way that we're both happy with, LGTM. Given the commentary so far, it seems that the middle ground would be implementing a single function that when given a parameter can extract the list from the HTTP parameter, and comma-split the individual elements, so that both conventions are supported at the same time. Per the comment, we should only do that in cases where the namespace is clearly defined and commas would never be part of one of the values.

Use in new /v1/checks and /v1/health endpoints for "names", also for
existing /v1/services?names=... and /v1/logs?services=... params.

Commas will never appear in these fields, so using CommaSeparatedList
is okay here.
@benhoyt
Copy link
Contributor Author

benhoyt commented Feb 20, 2022

Assuming the comment above regarding the parameter convention is handled in a way that we're both happy with, LGTM. Given the commentary so far, it seems that the middle ground would be implementing a single function that when given a parameter can extract the list from the HTTP parameter, and comma-split the individual elements, so that both conventions are supported at the same time. Per the comment, we should only do that in cases where the namespace is clearly defined and commas would never be part of one of the values.

Seems like a reasonable compromise -- done! Going to merge this now. :-)

@benhoyt benhoyt merged commit a922aaf into canonical:master Feb 20, 2022
@benhoyt benhoyt deleted the checks-api-cli branch February 20, 2022 22:11
jujubot added a commit to juju/juju that referenced this pull request Feb 22, 2022
#13568

This adds Kubernetes probes in the sidecar pod-spec definition to hit the new Pebble `/v1/health` HTTP endpoint (which aggregates the health of checks). The liveness probe hits `/v1/health?level=alive` and the readiness probe hits `/v1/health?level=ready`. Both endpoints return 200 `{"healthy": true}` if no Pebble checks have been defined -- the charmer must explicitly define checks with `level: alive` and/or `level: ready` to use the feature.

If the level=alive check is not healthy (and there's no level=ready check defined), that implies level=ready is unhealthy too -- the `/v1/health` aggregation takes this semantic meaning of the levels into account.

The intent of this change is to allow charmers to make robust, production-ready charms, with K8s restarting pods if they go unhealthy. See the [spec](https://docs.google.com/document/d/1d6-h3UAt2VPUSvlkVF30l8iuDW8raRNkHbp5M6NUo1A/edit#heading=h.5rw2t2gzo2xf).

The Pebble code changes for health checks and the `/v1/health` API are at:

* canonical/pebble#85
* canonical/pebble#86

## QA steps

These steps create a new microk8s controller, deploy the `snappass-test` charm, and play with the health checks integration:

```
# Deploy a sidecar charm app
make microk8s-operator-update # to update Docker operator image
juju bootstrap microk8s
juju add-model snappass
juju deploy snappass-test
juju status # visit http://<address>:5000 to view snappass web app
microk8s kubectl describe pods -n snappass snappass-test-0 # view K8s pod details

# Add a Pebble layer to ignore "redis-server" exiting but also an "alive" check
# write contents of layer.yaml
echo "
services:
 redis:
 override: merge
 on-success: ignore
 on-failure: ignore

checks:
 redis-up:
 override: replace
 level: alive
 tcp:
 port: 6379
" >layer.yaml
# copy layer to charm container
microk8s kubectl cp -n snappass -c charm ./layer.yaml snappass-test-0:/layer.yaml
# run "pebble add" to add layer
microk8s kubectl exec -n snappass snappass-test-0 -c charm -- /bin/sh -c 'PEBBLE_SOCKET=/charm/containers/redis/pebble.socket /charm/bin/pebble add l1 /layer.yaml'
# run "pebble replan" to pick up service changes
microk8s kubectl exec -n snappass snappass-test-0 -c charm -- /bin/sh -c 'PEBBLE_SOCKET=/charm/containers/redis/pebble.socket /charm/bin/pebble replan'

# Kill redis-server so check will go unhealthy after 3 attempts
sudo killall redis-server # be careful :-)
# monitor Pebble check state
microk8s kubectl exec -n snappass snappass-test-0 -c charm -- /bin/sh -c 'PEBBLE_SOCKET=/charm/containers/redis/pebble.socket /charm/bin/pebble checks'

# After ~30 seconds it should go unhealthy and the K8s probe will fail, terminating the container
# monitor K8s status with:
microk8s kubectl describe pods -n snappass snappass-test-0

# Other useful commands
# run "pebble plan" to show updated plan
microk8s kubectl exec -n snappass snappass-test-0 -c charm -- /bin/sh -c 'PEBBLE_SOCKET=/charm/containers/redis/pebble.socket /charm/bin/pebble plan'
# hit /v1/health endpoint manually
microk8s kubectl exec -n snappass snappass-test-0 -c charm -- curl 'http://localhost:38813/v1/health?level=alive'
# run "pebble services" to show service status
microk8s kubectl exec -n snappass snappass-test-0 -c charm -- /bin/sh -c 'PEBBLE_SOCKET=/charm/containers/redis/pebble.socket /charm/bin/pebble services'
# show Pebble logs via kubectl (can use "-f" to follow)
microk8s kubectl logs -n snappass snappass-test-0 -c redis
```

## Documentation changes

See [here](https://juju.is/docs/sdk/pebble#heading--service-auto-restart). Diff at benhoyt/juju-docs#1.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants