Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFD 197 - Prometheus metrics guidelines #51139

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

hugoShaka
Copy link
Contributor

Following the addition of the internal metric registry in Teleport, here's a mini RFD about how to add metrics in teleport.

Rendered version.

@github-actions github-actions bot requested review from atburke and tigrato January 16, 2025 21:37
@github-actions github-actions bot added rfd Request for Discussion size/md labels Jan 16, 2025

#### Do

- Use `teleport.MetricsNamespace` as the namespace.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this to be true for every executable? Should tbot metrics be prefixed by tbot?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see this being nice from a product PoV - as a customer I can always rely on Teleport-related stuff landing in the "teleport" namespace.

The counterpoint is that other binaries get worse subsystems - we'd have to do "tbot-sub1", "tbot-sub2", etc. It's not really a problem if "tbot" never sub-divides its metrics, for example.

Maybe, as a rule of thumb, use the default namespace but the rule could be lifted case-by-case (for example for a significantly "large" 3rd binary).

Copy link
Contributor

@rosstimothy rosstimothy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any gotchas in converting existing metrics from the global registry to a local registry?

rfd/0197-prometheus-metrics.md Outdated Show resolved Hide resolved
rfd/0197-prometheus-metrics.md Show resolved Hide resolved
rfd/0197-prometheus-metrics.md Show resolved Hide resolved

#### Do

- Use `teleport.MetricsNamespace` as the namespace.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any strategies for migrating legacy metrics which do not have a namespace? Should we register the same metric with and without the namespace?

Copy link
Contributor Author

@hugoShaka hugoShaka Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently I did not suggest a migration strategy because this is a pretty disruptive change. I don't like double-registering metrics very much because it increases cardinality.

I guess we could pull a metric breaking change in a major version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be terrible to double register the same metric in major version N and announce that the non-namespaced variants will be removed in version N+2? That would give people ~8 months of notice to adjust to use the correct metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's acceptable, maybe we can gate the behaviour behind a TELEPORT_UNSTABLE_ env var or something, so if someone has an issue with this they can disable the duplication and use only the new or old metrics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth checking which metrics would be affected and how many labels they have, it could be that the problem is small-ish.

Do we have metrics in the "wrong" namespaces that we would like to fix too?

@hugoShaka hugoShaka force-pushed the rfd/0197-prometheus-metrics branch from 52d7674 to 164e16e Compare January 16, 2025 22:34
Copy link
Contributor

@codingllama codingllama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Hugo!

LGTM.


- Take the local in-process registry as an argument in your service constructor/main routine, like you would receive a
logger, and register your metrics against it.
- Pass the registry as a `promtetheus.Registerer`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why by value? That seems to go against prometheus.NewRegistry, for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*prometheus.Registry implements the prometheus.Registerer, prometheus.Gatherer, and prometheus.Collector interfaces.

We should prefer passing the prometheus.Registerer interface over *prometheus.Registry for 2 reasons:

  • the code registering metrics has no legitimate reason to be allowed to collect metrics
  • this leaves room for custom registry implementations, which might be very useful in tests

The GoDoc also agrees:

Registerer is the interface for the part of a registry in charge of registering and unregistering. Users of custom registries should use Registerer as type for registration purposes (rather than the Registry type directly). In that way, they are free to use custom Registerer implementation (e.g. for testing purposes).

Note: here the term custom is ambiguous, I understand the "custom registries" as "not the default global registry, but still a *prometherus.Registry", and "custom Registerer implementation" as "the register we developed".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, I was confusing prometheus.Registry (the struct) and prometheus.Registerer (the interface). Makes sense now.

Maybe call out the difference:

Suggested change
- Pass the registry as a `promtetheus.Registerer`
- Pass the registry as a `prometheus.Registerer` (instead of `*prometheus.Registry`)


#### Do

- Store metrics in a private struct in your package
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also curious about this one. If the correct registry is used then why does the scope matter? I would assume double-registration is by name, so the variable scope wouldn't make a difference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an example of how using package-scoped metrics lead to inconsistent metrics if the package is used several times in different registries: https://go.dev/play/p/lrOD4O7Hm6m

Sorry, this does not build in the playground, I put the output in the comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So because the underlying gauge is the same its value will "leak" to both registries, whereas if it were a variable it would be separate per-registry/service, correct?

I suppose this is mainly a concern for testing but no objections on my part.

Copy link
Contributor Author

@hugoShaka hugoShaka Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this is mainly a concern for testing but no objections on my part.

This might become a large problem in testing if we have code doing fun things with the metrics (i.e. Unregistering old ones to avoid infinite cardinality explosion). This is the case for the the autoupdate rollout controller. In this case, leaking registries will break the metric cleanup logic.

This is also an issue for services/libs that can be started multiple times (e.g. embeddedtbot, or hosted plugins), or even utils libs that might be used by different services.


#### Do

- Store metrics in a private struct in your package
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do like the suggestion for a metrics-focused struct. Is the intent that this gets added to the service as a field? I think it's worth showing it in the example too.

Maybe incentivize naming the metrics type after the service too:

type fooMetrics struct {
  fooCounter *prometheus.CounterVec
  barGauge *prometheus.Gauge
}

type fooService struct {
  metrics *fooMetrics
}


#### Do

- Use `teleport.MetricsNamespace` as the namespace.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth checking which metrics would be affected and how many labels they have, it could be that the problem is small-ish.

Do we have metrics in the "wrong" namespaces that we would like to fix too?


#### Don't

- Register against the global prometheus registry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: I think "Don't" sections could be mainly bullets, so readers focus more on the "Do"s (and have less risk of copying the Don't code).


#### Do

- Use `teleport.MetricsNamespace` as the namespace.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see this being nice from a product PoV - as a customer I can always rely on Teleport-related stuff landing in the "teleport" namespace.

The counterpoint is that other binaries get worse subsystems - we'd have to do "tbot-sub1", "tbot-sub2", etc. It's not really a problem if "tbot" never sub-divides its metrics, for example.

Maybe, as a rule of thumb, use the default namespace but the rule could be lifted case-by-case (for example for a significantly "large" 3rd binary).


- Put user input directly in the labels
- Create large (1k+) metric combinations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we talk about metric conflicts? Global vs local, local vs local, in what situations it errors vs it gets "summed", etc?

What happens if a gauge conflicts?

Anything else we need to know?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rfd Request for Discussion size/md
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants