Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a guide to metrics for monitoring Teleport #46645

Merged
merged 2 commits into from
Oct 9, 2024

Conversation

ptgott
Copy link
Contributor

@ptgott ptgott commented Sep 16, 2024

Closes #40664

This change turns the Metrics guide in admin-guides into a conceptual guide to the most important metrics for monitoring a Teleport cluster.

Since Agent metrics have inconsistent comprehensiveness across Teleport services--and to reduce the scope of this change--this guide focuses on self-hosted clusters.

To make this a conceptual guide instead of a reference, this change removes the reference table from the admin-guides metrics page. There is already a table in the dedicated metrics reference guide.

Note that, while the new metrics guide is specific to self-hosted clusters, this change does not move the guide to the subsection of Admin Guides related to self-hosting Teleport. Doing this would mean having one subsection of Admin Guides for diagnostics-related guides and one subsection for self-hosted-specific diagnostics, which is potentially confusing. We may also want to add Agent-specific metrics eventually.

Finally, this change does not include alert thresholds for the metrics it describes. We can define these in a subsequent change.

Copy link

🤖 Vercel preview here: https://docs-hwwmn4qux-goteleport.vercel.app/docs/ver/preview

@ptgott ptgott force-pushed the paul.gottschling/40664-monitoring branch from 7f35bb1 to 7443fd8 Compare September 17, 2024 17:39
Copy link

🤖 Vercel preview here: https://docs-3vr59qbq9-goteleport.vercel.app/docs/ver/preview

The backend throughput metrics discussed in the previous section map on to
latency metrics. Whenever the Auth Service increments one of the throughput
metrics, it reports one of the corresponding latency metrics. See the table
below for which throughput metrics miap to which latency metrics. Each metric
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
below for which throughput metrics miap to which latency metrics. Each metric
below for which throughput metrics map to which latency metrics. Each metric

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in b6b8d4f

@ptgott ptgott force-pushed the paul.gottschling/40664-monitoring branch from 7443fd8 to b6b8d4f Compare September 18, 2024 20:22
@strideynet strideynet self-requested a review September 19, 2024 22:41
@ptgott ptgott force-pushed the paul.gottschling/40664-monitoring branch from b6b8d4f to 6a52832 Compare September 23, 2024 18:22
Copy link

🤖 Vercel preview here: https://docs-et5246prw-goteleport.vercel.app/docs/ver/preview

@ptgott ptgott force-pushed the paul.gottschling/40664-monitoring branch from 6a52832 to 6f602be Compare September 27, 2024 17:27
Copy link

🤖 Vercel preview here: https://docs-civ0o1ngr-goteleport.vercel.app/docs/ver/preview

@ptgott ptgott force-pushed the paul.gottschling/40664-monitoring branch from 6f602be to 4d69a4f Compare September 30, 2024 13:19
Copy link

github-actions bot commented Oct 2, 2024

🤖 Vercel preview here: https://docs-g60stgyxr-goteleport.vercel.app/docs/ver/preview

@ptgott
Copy link
Contributor Author

ptgott commented Oct 3, 2024

@evanfreed I've added new information based on your feedback. Checking to make sure it's accurate. Thanks!

Copy link
Contributor

@evanfreed evanfreed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, one spelling find

docs/pages/admin-guides/management/diagnostics/metrics.mdx Outdated Show resolved Hide resolved
ptgott added 2 commits October 9, 2024 11:56
Closes #40664

This change turns the Metrics guide in `admin-guides` into a conceptual
guide to the most important metrics for monitoring a Teleport cluster.

Since Agent metrics have inconsistent comprehensiveness across Teleport
services--and to reduce the scope of this change--this guide focuses on
self-hosted clusters.

To make this a conceptual guide instead of a reference, this change
removes the reference table from the `admin-guides` metrics page. There
is already a table in the dedicated metrics reference guide.

Note that, while the new metrics guide is specific to self-hosted
clusters, this change does not move the guide to the subsection of Admin
Guides related to self-hosting Teleport. Doing this would mean having
one subsection of Admin Guides for diagnostics-related guides and one
subsection for self-hosted-specific diagnostics, which is potentially
confusing. We may also want to add Agent-specific metrics eventually.

Finally, this change does not include alert thresholds for the metrics
it describes. We can define these in a subsequent change.
- Describe `backend_write_requests_failed_precondition_total`
- Include the precondition metric in the write availability formula.
- Turn the `registered_servers` discussion into a discussion of Teleport
  instance version, since it's not possible to group this metric by
  service and subtract the count of Auth Service/Proxy Service instances
  from the count of all registered services.
@ptgott ptgott force-pushed the paul.gottschling/40664-monitoring branch from dd0d4f9 to da46d40 Compare October 9, 2024 15:56
Copy link

github-actions bot commented Oct 9, 2024

🤖 Vercel preview here: https://docs-qjwtm0d7k-goteleport.vercel.app/docs/ver/preview

Copy link

github-actions bot commented Oct 9, 2024

🤖 Vercel preview here: https://docs-jj68b6zag-goteleport.vercel.app/docs/ver/preview

@ptgott ptgott added this pull request to the merge queue Oct 9, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 9, 2024
@ptgott ptgott added this pull request to the merge queue Oct 9, 2024
Merged via the queue into master with commit 7b38d5b Oct 9, 2024
40 checks passed
@ptgott ptgott deleted the paul.gottschling/40664-monitoring branch October 9, 2024 18:39
@public-teleport-github-review-bot

@ptgott See the table below for backport results.

Branch Result
branch/v14 Create PR
branch/v15 Create PR
branch/v16 Create PR

mvbrock pushed a commit that referenced this pull request Oct 16, 2024
* Add a guide to metrics for monitoring Teleport

Closes #40664

This change turns the Metrics guide in `admin-guides` into a conceptual
guide to the most important metrics for monitoring a Teleport cluster.

Since Agent metrics have inconsistent comprehensiveness across Teleport
services--and to reduce the scope of this change--this guide focuses on
self-hosted clusters.

To make this a conceptual guide instead of a reference, this change
removes the reference table from the `admin-guides` metrics page. There
is already a table in the dedicated metrics reference guide.

Note that, while the new metrics guide is specific to self-hosted
clusters, this change does not move the guide to the subsection of Admin
Guides related to self-hosting Teleport. Doing this would mean having
one subsection of Admin Guides for diagnostics-related guides and one
subsection for self-hosted-specific diagnostics, which is potentially
confusing. We may also want to add Agent-specific metrics eventually.

Finally, this change does not include alert thresholds for the metrics
it describes. We can define these in a subsequent change.

* Respond to evanfreed feedback

- Describe `backend_write_requests_failed_precondition_total`
- Include the precondition metric in the write availability formula.
- Turn the `registered_servers` discussion into a discussion of Teleport
  instance version, since it's not possible to group this metric by
  service and subtract the count of Auth Service/Proxy Service instances
  from the count of all registered services.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"Operating Teleport in Production: Monitoring" Guide
4 participants