Investigation: Alert Routing in Multi-Provider Setup #3801

Rotfuks · 2024-12-17T09:57:07Z

Feedback from Jose:

On many of our PrometheusRules we page the team depending on the provider, using a variable like providerTeam that changes depending on the provider. This doesn't work for MCs supporting multiple providers. For example, we phoenix are getting paged by vsphere clusters on wallaby, just because wallaby is a capz MC. Has this been discussed? Are there potential solutions that we can apply to page the right team?

See: https://gigantic.slack.com/archives/CLPMFRVU6/p1734009831450799

We need to create a process on how we can do alert routing for giantswarm teams with multi-provider setups

Todo

"Document" the process of how alert routing for giantswarm teams (in multi-provider setups) currently works for review
List the alerts that are tricky in multi-provider setups having issues with routing to the right team
Review the current process and Alerts and Investigate how we could improve the Alert routing

Outcome

We know how we can improve the situation.

QuentinBisson · 2025-01-08T20:35:08Z

cc @JosephSalisbury so we had an issue :)

QuentinBisson · 2025-01-13T12:48:38Z

Monitoring for kaas components in multi-provider setup works like this:

Observability operator detects the cluster provider based on the used CAPI CR for the cluster and configure the metric agent external label
The agent send the cluster metrics to mimir
Mimir ruler evaluates the alerts and send the alerts to alertmanager with the team label based on the provider

This usually works fine for WCs as the cluster provider is quite static.
The issue we really have is about metrics from the MC that targets WCs like capi_cluster_info as this metric comes from the cluster-api-monitoring component that is running on the MC. This implies that the provider in the metric will be the mc provider and not the WC provider which is where the main issues we have come from.

IMO, I think we should get rid if this provider label set as an external label in our metrics and create a new metric that will contain the provider label for a cluster.

In that case, whenever we need the provider label in a query, we will always be able to join on that metric.

As an example, we would have the following metrics:

# WC api server metric
kubernetes_info{cluster_id="test-wc", kubeVersion="1.29.1",  ...}
# MC api server metric
kubernetes_info{cluster_id="test-mc", kubeVersion="1.31.1",  ...}
## CAPI cluster metrics
capi_cluster_info{cluster_id="test-wc", provider="vsphere",  ...}
capi_cluster_info{cluster_id="test-mc", provider="capa",  ...}

Then we can easily join with:

kubernetes_info{cluster_id="test-wc", kubeVersion="1.29.1",  ...} * on (cluster_id) group_left() capi_cluster_info

Now i agree this makes queries harder to read if we have multiple joins so we could try to use this referential to contain more informations so we reduce the number of joins but this is probably the best solution there is and we would also have smaller metric footprint in agents if we get rid of some of that clutter.

This would also remove the need for us to detect the capi provider in our operator as it would all be provided by the cluster-api-monitoring app which is kaas related

@giantswarm/team-atlas before I ask about this to kaas, what do you think?

Rotfuks · 2025-01-14T10:32:44Z

I like the idea, I think it's quite clean :)

TheoBrigitte · 2025-01-14T13:09:03Z

I don't like the idea off using join this will add another layer of complexity on top of already complex queries.
I think it would be best when we can set the correct provider label for each WC.

QuentinBisson · 2025-01-14T13:14:33Z

I don't see how adding one join will make queries harder as it will also reduce the by clauses.

The main issue is that we cannot retrieve the provider label from the wc metric in the mc and we need to join it anyway so making this a default when we need it looks like a better fit to me that building the providing label all other the place when it's only useful when going outside of alertmanager

hervenicol · 2025-01-14T13:16:01Z

That logic fits the metrics metadata idea, which sadly does not exist yet in prometheus and mimir.
However, with current state of promql, the syntax for joining this "metadata" is quite ugly, and clutters the queries IMO.

So, I'd love to split this proposal in 2 parts:

WC metrics coming from the WCs: currently they have the proper provider set, let's not touch it for the moment.
WC metrics coming from the MC: these ones have the MC's provider set, which can be a problem. They'd be better with no provider set, so we're not misleaded.

The problem is the provider is set by the agent (alloy) via external labels. So I'm not sure we can exclude a target from these common external labels?

QuentinBisson · 2025-01-14T13:21:09Z

@hervenicol no we cannot exclude it from the external labels unless we use a custom agent for those only.

And I'm not sure how removing the provider from the metric would solve the alerting issues as we still need to find the team responsible for a given component right?

hervenicol · 2025-01-14T13:24:10Z

And I'm not sure how removing the provider from the metric would solve the alerting issues as we still need to find the team responsible for a given component right?

It avoids being mislead by an existing-but-wrong provider label.
Which means we know for which metrics we need to have a join to add the provider label from cluster information.

QuentinBisson · 2025-01-14T13:39:55Z

But then in that case we might as well remove it from everywhere right?

QuentinBisson · 2025-01-14T14:29:08Z

@T-Kukawka so after discussing it with the team, the easy solution is that you update your alerts to extract the correct provider like so:

app_operator_app_info * on(cluster_id) group_left(provider) sum(kubernetes_build_info) by (cluster_id, provider)

This query will match on the cluster_id the metric on the left and will extra the provider from the metric on the right

cc @JosephSalisbury

T-Kukawka · 2025-01-15T09:31:30Z

@yulianedyalkova is this what team tenet can do? most of such alerts paging are in Tenet territory actually

JosephSalisbury · 2025-01-15T09:40:16Z

@QuentinBisson is it possible to make this more transparent to the other teams? this joining solution is kind of tough and feels like it patches an underlying problem

if i understand correctly, the problem is that the provider label can refer to either:

the provider of the cluster the metric is scraped from (i.e: the source)
the provider of the cluster the metric is about (i.e: some kind of target)

is it possible to make this clearer in the schema? does it make more sense to have two labels? is there some other way to make this clearer?

QuentinBisson · 2025-01-15T09:49:39Z

This solution is definitely a patch yes and you understood the problem fine :)

There are multiple reasons why 2 labels is not good:

Cardinality and storage cost of the extra metrics for 3-4 alerts seems like a bad choice
Having the provider of the cluster the metric is about means that the shared exporters like app operator needs to be aware of the actual provider of the capi cluster which it should not need to know anything about. Whenever we add a new provider, this will require a new release and this is really bad.

TBH, I really don't understand why people do not like joins in the company when this is one of the most basic concepts in PromQL :D

Rotfuks · 2025-01-15T10:43:47Z

Not just in PromQL :D Coming from Backend Development, SQL and even GraphQL, this approach just sounded like a case of "why don't we do it until now?" :D

JosephSalisbury · 2025-01-15T15:42:21Z

actually fair

"for 3-4 alerts" - if it's less than a handful of alerts, having something that needs a larger comment in those alerts should be reasonable

and we keep the provider ambiguity in mind if we update the schema in future, in cool

QuentinBisson · 2025-01-21T10:01:38Z

@yulianedyalkova I've tested the join on wallaby and that does not work on some cases (for instance the lille-dev cluster there is not reporting any kubernetes_build_info metric).

So the join needs to happen on the capi_cluster_info metrics but then this metric needs to have a provider label set to the correct value. So I think we need a recording rules that can get the correct provider based on the infrastructure CR :(

QuentinBisson · 2025-01-21T10:46:02Z

I can't promise that all alerts are solved but here is the rough idea giantswarm/prometheus-rules#1474

QuentinBisson · 2025-01-21T10:57:49Z

This PR is now merged. I will assume that the remaining alerts can be fixed over time ;) Closing this

github-project-automation bot added this to Roadmap Dec 17, 2024

github-project-automation bot moved this to Inbox 📥 in Roadmap Dec 17, 2024

Rotfuks added the postmortem label Dec 17, 2024

Rotfuks changed the title ~~Alert Routing in Multi-Provider Setup~~ Investigation: Alert Routing in Multi-Provider Setup Dec 17, 2024

QuentinBisson self-assigned this Jan 13, 2025

architectbot added the team/atlas Team Atlas label Jan 13, 2025

Rotfuks self-assigned this Jan 15, 2025

QuentinBisson mentioned this issue Jan 21, 2025

fix app platform alerting for hybrid MCs giantswarm/prometheus-rules#1474

Merged

5 tasks

QuentinBisson closed this as completed Jan 21, 2025

github-project-automation bot moved this from Inbox 📥 to Validation ☑️ in Roadmap Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigation: Alert Routing in Multi-Provider Setup #3801

Investigation: Alert Routing in Multi-Provider Setup #3801

Rotfuks commented Dec 17, 2024 •

edited

Loading

QuentinBisson commented Jan 8, 2025

QuentinBisson commented Jan 13, 2025

Rotfuks commented Jan 14, 2025

TheoBrigitte commented Jan 14, 2025

QuentinBisson commented Jan 14, 2025

hervenicol commented Jan 14, 2025 •

edited

Loading

QuentinBisson commented Jan 14, 2025

hervenicol commented Jan 14, 2025

QuentinBisson commented Jan 14, 2025

QuentinBisson commented Jan 14, 2025

T-Kukawka commented Jan 15, 2025

JosephSalisbury commented Jan 15, 2025

QuentinBisson commented Jan 15, 2025

Rotfuks commented Jan 15, 2025

JosephSalisbury commented Jan 15, 2025 •

edited

Loading

QuentinBisson commented Jan 21, 2025

QuentinBisson commented Jan 21, 2025

QuentinBisson commented Jan 21, 2025

Investigation: Alert Routing in Multi-Provider Setup #3801

Investigation: Alert Routing in Multi-Provider Setup #3801

Comments

Rotfuks commented Dec 17, 2024 • edited Loading

Todo

Outcome

QuentinBisson commented Jan 8, 2025

QuentinBisson commented Jan 13, 2025

Rotfuks commented Jan 14, 2025

TheoBrigitte commented Jan 14, 2025

QuentinBisson commented Jan 14, 2025

hervenicol commented Jan 14, 2025 • edited Loading

QuentinBisson commented Jan 14, 2025

hervenicol commented Jan 14, 2025

QuentinBisson commented Jan 14, 2025

QuentinBisson commented Jan 14, 2025

T-Kukawka commented Jan 15, 2025

JosephSalisbury commented Jan 15, 2025

QuentinBisson commented Jan 15, 2025

Rotfuks commented Jan 15, 2025

JosephSalisbury commented Jan 15, 2025 • edited Loading

QuentinBisson commented Jan 21, 2025

QuentinBisson commented Jan 21, 2025

QuentinBisson commented Jan 21, 2025

Rotfuks commented Dec 17, 2024 •

edited

Loading

hervenicol commented Jan 14, 2025 •

edited

Loading

JosephSalisbury commented Jan 15, 2025 •

edited

Loading