RFD 197 - Prometheus metrics guidelines #51139

hugoShaka · 2025-01-16T21:37:13Z

Following the addition of the internal metric registry in Teleport, here's a mini RFD about how to add metrics in teleport.

Rendered version.

hugoShaka · 2025-01-16T21:40:51Z

rfd/0197-prometheus-metrics.md

+
+#### Do
+
+- Use `teleport.MetricsNamespace` as the namespace.


Do we want this to be true for every executable? Should tbot metrics be prefixed by tbot?

I can see this being nice from a product PoV - as a customer I can always rely on Teleport-related stuff landing in the "teleport" namespace.

The counterpoint is that other binaries get worse subsystems - we'd have to do "tbot-sub1", "tbot-sub2", etc. It's not really a problem if "tbot" never sub-divides its metrics, for example.

Maybe, as a rule of thumb, use the default namespace but the rule could be lifted case-by-case (for example for a significantly "large" 3rd binary).

rosstimothy

Are there any gotchas in converting existing metrics from the global registry to a local registry?

rfd/0197-prometheus-metrics.md

rosstimothy · 2025-01-16T22:16:04Z

rfd/0197-prometheus-metrics.md

+
+#### Do
+
+- Use `teleport.MetricsNamespace` as the namespace.


Do we have any strategies for migrating legacy metrics which do not have a namespace? Should we register the same metric with and without the namespace?

Currently I did not suggest a migration strategy because this is a pretty disruptive change. I don't like double-registering metrics very much because it increases cardinality.

I guess we could pull a metric breaking change in a major version.

Would it be terrible to double register the same metric in major version N and announce that the non-namespaced variants will be removed in version N+2? That would give people ~8 months of notice to adjust to use the correct metrics.

I think that's acceptable, maybe we can gate the behaviour behind a TELEPORT_UNSTABLE_ env var or something, so if someone has an issue with this they can disable the duplication and use only the new or old metrics.

I think it's worth checking which metrics would be affected and how many labels they have, it could be that the problem is small-ish.

Do we have metrics in the "wrong" namespaces that we would like to fix too?

codingllama

Thanks, Hugo!

LGTM.

codingllama · 2025-01-20T19:53:14Z

rfd/0197-prometheus-metrics.md

+
+- Take the local in-process registry as an argument in your service constructor/main routine, like you would receive a
+  logger, and register your metrics against it.
+- Pass the registry as a `promtetheus.Registerer`


Why by value? That seems to go against prometheus.NewRegistry, for example.

*prometheus.Registry implements the prometheus.Registerer, prometheus.Gatherer, and prometheus.Collector interfaces.

We should prefer passing the prometheus.Registerer interface over *prometheus.Registry for 2 reasons:

the code registering metrics has no legitimate reason to be allowed to collect metrics

this leaves room for custom registry implementations, which might be very useful in tests

The GoDoc also agrees:

Registerer is the interface for the part of a registry in charge of registering and unregistering. Users of custom registries should use Registerer as type for registration purposes (rather than the Registry type directly). In that way, they are free to use custom Registerer implementation (e.g. for testing purposes).

Note: here the term custom is ambiguous, I understand the "custom registries" as "not the default global registry, but still a *prometherus.Registry", and "custom Registerer implementation" as "the register we developed".

My bad, I was confusing prometheus.Registry (the struct) and prometheus.Registerer (the interface). Makes sense now.

Maybe call out the difference:

Suggested change

- Pass the registry as a `promtetheus.Registerer`

- Pass the registry as a `prometheus.Registerer` (instead of `*prometheus.Registry`)

codingllama · 2025-01-20T19:54:21Z

rfd/0197-prometheus-metrics.md

+
+#### Do
+
+- Store metrics in a private struct in your package


I'm also curious about this one. If the correct registry is used then why does the scope matter? I would assume double-registration is by name, so the variable scope wouldn't make a difference?

Here's an example of how using package-scoped metrics lead to inconsistent metrics if the package is used several times in different registries: https://go.dev/play/p/lrOD4O7Hm6m

Sorry, this does not build in the playground, I put the output in the comments.

So because the underlying gauge is the same its value will "leak" to both registries, whereas if it were a variable it would be separate per-registry/service, correct?

I suppose this is mainly a concern for testing but no objections on my part.

I suppose this is mainly a concern for testing but no objections on my part.

This might become a large problem in testing if we have code doing fun things with the metrics (i.e. Unregistering old ones to avoid infinite cardinality explosion). This is the case for the the autoupdate rollout controller. In this case, leaking registries will break the metric cleanup logic.

This is also an issue for services/libs that can be started multiple times (e.g. embeddedtbot, or hosted plugins), or even utils libs that might be used by different services.

codingllama · 2025-01-20T19:55:57Z

rfd/0197-prometheus-metrics.md

+
+#### Do
+
+- Store metrics in a private struct in your package


I do like the suggestion for a metrics-focused struct. Is the intent that this gets added to the service as a field? I think it's worth showing it in the example too.

Maybe incentivize naming the metrics type after the service too:

type fooMetrics struct { fooCounter *prometheus.CounterVec barGauge *prometheus.Gauge } type fooService struct { metrics *fooMetrics }

codingllama · 2025-01-20T20:01:59Z

rfd/0197-prometheus-metrics.md

+
+#### Do
+
+- Use `teleport.MetricsNamespace` as the namespace.


I think it's worth checking which metrics would be affected and how many labels they have, it could be that the problem is small-ish.

Do we have metrics in the "wrong" namespaces that we would like to fix too?

codingllama · 2025-01-20T20:06:58Z

rfd/0197-prometheus-metrics.md

+
+#### Don't
+
+- Register against the global prometheus registry


Suggestion: I think "Don't" sections could be mainly bullets, so readers focus more on the "Do"s (and have less risk of copying the Don't code).

codingllama · 2025-01-20T20:09:19Z

rfd/0197-prometheus-metrics.md

+
+#### Do
+
+- Use `teleport.MetricsNamespace` as the namespace.


I can see this being nice from a product PoV - as a customer I can always rely on Teleport-related stuff landing in the "teleport" namespace.

The counterpoint is that other binaries get worse subsystems - we'd have to do "tbot-sub1", "tbot-sub2", etc. It's not really a problem if "tbot" never sub-divides its metrics, for example.

Maybe, as a rule of thumb, use the default namespace but the rule could be lifted case-by-case (for example for a significantly "large" 3rd binary).

codingllama · 2025-01-20T20:11:11Z

rfd/0197-prometheus-metrics.md

+
+- Put user input directly in the labels
+- Create large (1k+) metric combinations
+


Should we talk about metric conflicts? Global vs local, local vs local, in what situations it errors vs it gets "summed", etc?

What happens if a gauge conflicts?

Anything else we need to know?

RFD 197 - Prometheus metrics guidelines

11be523

hugoShaka requested review from zmb3, codingllama and rosstimothy January 16, 2025 21:37

github-actions bot requested review from atburke and tigrato January 16, 2025 21:37

github-actions bot added rfd Request for Discussion size/md labels Jan 16, 2025

hugoShaka commented Jan 16, 2025

View reviewed changes

rosstimothy reviewed Jan 16, 2025

View reviewed changes

Fix indent and trailing spaces

164e16e

hugoShaka force-pushed the rfd/0197-prometheus-metrics branch from 52d7674 to 164e16e Compare January 16, 2025 22:34

Add section about lint

92a08ed

codingllama reviewed Jan 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFD 197 - Prometheus metrics guidelines #51139

RFD 197 - Prometheus metrics guidelines #51139

hugoShaka commented Jan 16, 2025

hugoShaka Jan 16, 2025

codingllama Jan 20, 2025

rosstimothy left a comment

rosstimothy Jan 16, 2025

hugoShaka Jan 16, 2025 •

edited

Loading

rosstimothy Jan 17, 2025

hugoShaka Jan 17, 2025

codingllama Jan 20, 2025

codingllama left a comment

codingllama Jan 20, 2025

hugoShaka Jan 21, 2025

codingllama Jan 21, 2025

codingllama Jan 20, 2025

hugoShaka Jan 21, 2025

codingllama Jan 21, 2025

hugoShaka Jan 21, 2025 •

edited

Loading

codingllama Jan 20, 2025

codingllama Jan 20, 2025

codingllama Jan 20, 2025

codingllama Jan 20, 2025

codingllama Jan 20, 2025

	- Pass the registry as a `promtetheus.Registerer`
	- Pass the registry as a `prometheus.Registerer` (instead of `*prometheus.Registry`)


		- Put user input directly in the labels
		- Create large (1k+) metric combinations

RFD 197 - Prometheus metrics guidelines #51139

Are you sure you want to change the base?

RFD 197 - Prometheus metrics guidelines #51139

Conversation

hugoShaka commented Jan 16, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rosstimothy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hugoShaka Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codingllama left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hugoShaka Jan 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hugoShaka Jan 16, 2025 •

edited

Loading

hugoShaka Jan 21, 2025 •

edited

Loading