-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFD 197 - Prometheus metrics guidelines #51139
base: master
Are you sure you want to change the base?
Conversation
|
||
#### Do | ||
|
||
- Use `teleport.MetricsNamespace` as the namespace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want this to be true for every executable? Should tbot metrics be prefixed by tbot
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see this being nice from a product PoV - as a customer I can always rely on Teleport-related stuff landing in the "teleport" namespace.
The counterpoint is that other binaries get worse subsystems - we'd have to do "tbot-sub1", "tbot-sub2", etc. It's not really a problem if "tbot" never sub-divides its metrics, for example.
Maybe, as a rule of thumb, use the default namespace but the rule could be lifted case-by-case (for example for a significantly "large" 3rd binary).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any gotchas in converting existing metrics from the global registry to a local registry?
|
||
#### Do | ||
|
||
- Use `teleport.MetricsNamespace` as the namespace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any strategies for migrating legacy metrics which do not have a namespace? Should we register the same metric with and without the namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently I did not suggest a migration strategy because this is a pretty disruptive change. I don't like double-registering metrics very much because it increases cardinality.
I guess we could pull a metric breaking change in a major version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be terrible to double register the same metric in major version N and announce that the non-namespaced variants will be removed in version N+2? That would give people ~8 months of notice to adjust to use the correct metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's acceptable, maybe we can gate the behaviour behind a TELEPORT_UNSTABLE_ env var or something, so if someone has an issue with this they can disable the duplication and use only the new or old metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth checking which metrics would be affected and how many labels they have, it could be that the problem is small-ish.
Do we have metrics in the "wrong" namespaces that we would like to fix too?
52d7674
to
164e16e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Hugo!
LGTM.
|
||
- Take the local in-process registry as an argument in your service constructor/main routine, like you would receive a | ||
logger, and register your metrics against it. | ||
- Pass the registry as a `promtetheus.Registerer` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why by value? That seems to go against prometheus.NewRegistry, for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*prometheus.Registry
implements the prometheus.Registerer
, prometheus.Gatherer
, and prometheus.Collector
interfaces.
We should prefer passing the prometheus.Registerer
interface over *prometheus.Registry
for 2 reasons:
- the code registering metrics has no legitimate reason to be allowed to collect metrics
- this leaves room for custom registry implementations, which might be very useful in tests
The GoDoc also agrees:
Registerer is the interface for the part of a registry in charge of registering and unregistering. Users of custom registries should use Registerer as type for registration purposes (rather than the Registry type directly). In that way, they are free to use custom Registerer implementation (e.g. for testing purposes).
Note: here the term custom is ambiguous, I understand the "custom registries" as "not the default global registry, but still a *prometherus.Registry
", and "custom Registerer implementation" as "the register we developed".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, I was confusing prometheus.Registry (the struct) and prometheus.Registerer (the interface). Makes sense now.
Maybe call out the difference:
- Pass the registry as a `promtetheus.Registerer` | |
- Pass the registry as a `prometheus.Registerer` (instead of `*prometheus.Registry`) |
|
||
#### Do | ||
|
||
- Store metrics in a private struct in your package |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also curious about this one. If the correct registry is used then why does the scope matter? I would assume double-registration is by name, so the variable scope wouldn't make a difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an example of how using package-scoped metrics lead to inconsistent metrics if the package is used several times in different registries: https://go.dev/play/p/lrOD4O7Hm6m
Sorry, this does not build in the playground, I put the output in the comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So because the underlying gauge is the same its value will "leak" to both registries, whereas if it were a variable it would be separate per-registry/service, correct?
I suppose this is mainly a concern for testing but no objections on my part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose this is mainly a concern for testing but no objections on my part.
This might become a large problem in testing if we have code doing fun things with the metrics (i.e. Unregistering old ones to avoid infinite cardinality explosion). This is the case for the the autoupdate rollout controller. In this case, leaking registries will break the metric cleanup logic.
This is also an issue for services/libs that can be started multiple times (e.g. embeddedtbot
, or hosted plugins), or even utils libs that might be used by different services.
|
||
#### Do | ||
|
||
- Store metrics in a private struct in your package |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do like the suggestion for a metrics-focused struct. Is the intent that this gets added to the service as a field? I think it's worth showing it in the example too.
Maybe incentivize naming the metrics type after the service too:
type fooMetrics struct {
fooCounter *prometheus.CounterVec
barGauge *prometheus.Gauge
}
type fooService struct {
metrics *fooMetrics
}
|
||
#### Do | ||
|
||
- Use `teleport.MetricsNamespace` as the namespace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth checking which metrics would be affected and how many labels they have, it could be that the problem is small-ish.
Do we have metrics in the "wrong" namespaces that we would like to fix too?
|
||
#### Don't | ||
|
||
- Register against the global prometheus registry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: I think "Don't" sections could be mainly bullets, so readers focus more on the "Do"s (and have less risk of copying the Don't code).
|
||
#### Do | ||
|
||
- Use `teleport.MetricsNamespace` as the namespace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see this being nice from a product PoV - as a customer I can always rely on Teleport-related stuff landing in the "teleport" namespace.
The counterpoint is that other binaries get worse subsystems - we'd have to do "tbot-sub1", "tbot-sub2", etc. It's not really a problem if "tbot" never sub-divides its metrics, for example.
Maybe, as a rule of thumb, use the default namespace but the rule could be lifted case-by-case (for example for a significantly "large" 3rd binary).
|
||
- Put user input directly in the labels | ||
- Create large (1k+) metric combinations | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we talk about metric conflicts? Global vs local, local vs local, in what situations it errors vs it gets "summed", etc?
What happens if a gauge conflicts?
Anything else we need to know?
Following the addition of the internal metric registry in Teleport, here's a mini RFD about how to add metrics in teleport.
Rendered version.