Docs: how to run services reliably and update service autorestart to service lifecycle. #541

IronCore864 · 2024-12-22T04:50:28Z

According to our discussion (internal doc, sorry), this is the first piece of the next few how-to guides for Pebble.

In this PR:

The “service autorestart” doc is renamed to “service lifecycle” and expanded to a more-detailed reference.
A new how-to guide "how to run services reliably" detailing health checks is created, and placed after “how to install”, as the second how-to doc.
Per David's suggestion, some of the causes of unreliability in the modern microservice world are listed, and which can be mitigated by health checks/Pebble are explained.
Per David's suggestion, a few words on "what health checks are not" are added, so that the users won't misuse this feature to run cronjobs.

See more details and outline in the above discussion doc.

Preview: https://canonical-pebble--541.com.readthedocs.build/en/541/how-to/run-services-reliably/

dwilding · 2024-12-24T03:45:59Z

Thank you for doing this! Before I work on a more detailed review, I have a few ideas about the overall contents:

"Health checks" section - I think this section should have more prominence. So how about making each topic a separate subsection, instead of using bullets?

Also in this section, I think we ought to make it obvious up-front that "health checks" are a combination of Pebble's feature + what the service author chooses to implement.

Then within each topic we can say what we recommend the service to do. For example, in the topic "Identifying Dependent Service Failures", I really like what you wrote:

A service’s health check can include checks to ensure it can connect to its database, message queue, or other required services.

This is nice & clear advice about how to approach health checks on the service side.
"Configuring health checks in Pebble" and "Restarting service on health check failure" sections - I'd probably combine these into a single section that is fully focused on how to configure a health check of HTTP type and restart the service when the health check fails.

Since this is a how-to doc, it's OK to omit the different options for health checks. I think it's better to give a specific scenario, then link to the reference pages for people to understand the different options.
"Demo service" and "Putting it all together" sections - I feel these are extremely helpful, and the health-check-sample-service idea is neat! Since these sections are more guided, I'd consider moving them to a tutorial instead.

IronCore864 · 2024-12-31T00:30:31Z

Thank you for doing this! Before I work on a more detailed review, I have a few ideas about the overall contents:

"Health checks" section - I think this section should have more prominence. So how about making each topic a separate subsection, instead of using bullets?
Also in this section, I think we ought to make it obvious up-front that "health checks" are a combination of Pebble's feature + what the service author chooses to implement.
Then within each topic we can say what we recommend the service to do. For example, in the topic "Identifying Dependent Service Failures", I really like what you wrote:

A service’s health check can include checks to ensure it can connect to its database, message queue, or other required services.

This is nice & clear advice about how to approach health checks on the service side.

"Configuring health checks in Pebble" and "Restarting service on health check failure" sections - I'd probably combine these into a single section that is fully focused on how to configure a health check of HTTP type and restart the service when the health check fails.
Since this is a how-to doc, it's OK to omit the different options for health checks. I think it's better to give a specific scenario, then link to the reference pages for people to understand the different options.

"Demo service" and "Putting it all together" sections - I feel these are extremely helpful, and the health-check-sample-service idea is neat! Since these sections are more guided, I'd consider moving them to a tutorial instead.

Refactored according to 1 and 2; for 3, I haven't done anything yet, mostly because those two parts are not long enough for a tutorial. Should we do that anyway?

IronCore864 · 2025-01-02T02:52:06Z

Per discussion elsewhere, we decided to remove the "Demo service" and "Putting it all together" sections. We will add a tutorial in the future using content from these sections.

docs/how-to/run-services-reliably.md

dwilding · 2025-01-03T02:27:19Z

Looking at "How to run services reliably" with fresh eyes, I think we should go further in refactoring the doc to make the actionable info stand out. I recommend that we drop "Service reliability in the modern microservice world" as a separate section, keeping the info as context within the section that follows.

The nature of this topic is going to require some explanatory content, but I think we can compress it down somewhat.

I've taken what you wrote and put together this structure - what do you think?

# How to run services reliably

You can use Pebble's [health checks](../reference/health-checks) feature to perform checks on services and restart services if the checks fail. To use health checks effectively, you should consider:

- How services monitor their own health and make that information available to Pebble
- How Pebble is configured to fetch health information and respond to unhealthy services

This guide demonstrates how to use health checks to address common service reliability challenges.

## Return health information from services

As you implement health checks within services, consider typical causes of unreliability. You can monitor for unhealthy conditions and expose that information for Pebble to consume.

A common way to expose health information is to use an HTTP endpoint. For an example of how to configure Pebble to check an HTTP endpoint, see [](run_services_reliably_use_http_checks).

### Detect resource exhaustion

A single microservice consuming excessive resources (CPU, memory, disk I/O, and so on) can impact not only its performance and availability but also potentially affect other services depending on it.

Recommendation: Implement a health check that signals an unhealthy state if resource consumption exceeds predefined thresholds.

### Identify dependent service failures

_Use a similar structure of brief context followed by recommendations_

<!-- more subsections here -->

(run_services_reliably_use_http_checks)=
## Use HTTP-based health checks in Pebble

<!-- details here -->

docs/reference/service-lifecycle.md

dwilding · 2025-01-03T03:56:18Z

I've finished adding comments on specific parts. There's a main refactoring suggestion here.

IronCore864 · 2025-01-04T05:14:12Z

Looking at "How to run services reliably" with fresh eyes, I think we should go further in refactoring the doc to make the actionable info stand out. I recommend that we drop "Service reliability in the modern microservice world" as a separate section, keeping the info as context within the section that follows.

The nature of this topic is going to require some explanatory content, but I think we can compress it down somewhat.

I've taken what you wrote and put together this structure - what do you think?

# How to run services reliably

You can use Pebble's [health checks](../reference/health-checks) feature to perform checks on services and restart services if the checks fail. To use health checks effectively, you should consider:

- How services monitor their own health and make that information available to Pebble
- How Pebble is configured to fetch health information and respond to unhealthy services

This guide demonstrates how to use health checks to address common service reliability challenges.

## Return health information from services

As you implement health checks within services, consider typical causes of unreliability. You can monitor for unhealthy conditions and expose that information for Pebble to consume.

A common way to expose health information is to use an HTTP endpoint. For an example of how to configure Pebble to check an HTTP endpoint, see [](run_services_reliably_use_http_checks).

### Detect resource exhaustion

A single microservice consuming excessive resources (CPU, memory, disk I/O, and so on) can impact not only its performance and availability but also potentially affect other services depending on it.

Recommendation: Implement a health check that signals an unhealthy state if resource consumption exceeds predefined thresholds.

### Identify dependent service failures

_Use a similar structure of brief context followed by recommendations_

<!-- more subsections here -->

(run_services_reliably_use_http_checks)=
## Use HTTP-based health checks in Pebble

<!-- details here -->

I haven't done this refactor yet because it seems to be a very significant rework. I have given it some thought, and the current logic is:

First, introduce a health check. Then, explain in what scenarios it's helpful, which leads to Pebble's health check feature, how to configure it, and why it can help improve reliability. In this way, we could paint a picture, which I think a how-to guide should do: what exact problems this document can help achieve.

The suggested logic focuses more on Pebble and its features without laying too much background information first, which then is intertwined with the following paragraphs. This was less clear to me, so I hesitated about the refactoring. Maybe we should get more input from @benhoyt.

Other suggestions are all adopted.

dwilding · 2025-01-06T04:55:29Z

It also sounds good to me if we wait to get Ben's input. If we end up not doing the refactoring, the part that I feel needs to be most emphasized somewhere is that "health checks" are a combination of:

How services monitor their own health and make that information available to Pebble
How Pebble is configured to fetch health information and respond to unhealthy services

I think this distinction is important context for the advice in the doc.

IronCore864 added 2 commits December 22, 2024 12:44

docs: change service autorestart to service lifecycle

8c2204c

docs: how to run service reliably

5af3962

IronCore864 requested a review from dwilding December 24, 2024 01:46

IronCore864 marked this pull request as ready for review December 24, 2024 01:46

chore: refactor after review

85efc22

IronCore864 requested a review from benhoyt January 2, 2025 08:01

chore: remove demo service and tutorial-like content

b1cc37d