Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigation: Alert Routing in Multi-Provider Setup #3801

Closed
3 tasks
Rotfuks opened this issue Dec 17, 2024 · 18 comments
Closed
3 tasks

Investigation: Alert Routing in Multi-Provider Setup #3801

Rotfuks opened this issue Dec 17, 2024 · 18 comments
Assignees
Labels

Comments

@Rotfuks
Copy link
Contributor

Rotfuks commented Dec 17, 2024

Feedback from Jose:

On many of our PrometheusRules we page the team depending on the provider, using a variable like providerTeam that changes depending on the provider. This doesn't work for MCs supporting multiple providers. For example, we phoenix are getting paged by vsphere clusters on wallaby, just because wallaby is a capz MC. Has this been discussed? Are there potential solutions that we can apply to page the right team?

See: https://gigantic.slack.com/archives/CLPMFRVU6/p1734009831450799

We need to create a process on how we can do alert routing for giantswarm teams with multi-provider setups

Todo

  • "Document" the process of how alert routing for giantswarm teams (in multi-provider setups) currently works for review
  • List the alerts that are tricky in multi-provider setups having issues with routing to the right team
  • Review the current process and Alerts and Investigate how we could improve the Alert routing

Outcome

  • We know how we can improve the situation.
@github-project-automation github-project-automation bot moved this to Inbox 📥 in Roadmap Dec 17, 2024
@Rotfuks Rotfuks changed the title Alert Routing in Multi-Provider Setup Investigation: Alert Routing in Multi-Provider Setup Dec 17, 2024
@QuentinBisson
Copy link

cc @JosephSalisbury so we had an issue :)

@QuentinBisson
Copy link

Monitoring for kaas components in multi-provider setup works like this:

  • Observability operator detects the cluster provider based on the used CAPI CR for the cluster and configure the metric agent external label
  • The agent send the cluster metrics to mimir
  • Mimir ruler evaluates the alerts and send the alerts to alertmanager with the team label based on the provider

This usually works fine for WCs as the cluster provider is quite static.
The issue we really have is about metrics from the MC that targets WCs like capi_cluster_info as this metric comes from the cluster-api-monitoring component that is running on the MC. This implies that the provider in the metric will be the mc provider and not the WC provider which is where the main issues we have come from.

IMO, I think we should get rid if this provider label set as an external label in our metrics and create a new metric that will contain the provider label for a cluster.

In that case, whenever we need the provider label in a query, we will always be able to join on that metric.

As an example, we would have the following metrics:

# WC api server metric
kubernetes_info{cluster_id="test-wc", kubeVersion="1.29.1",  ...}
# MC api server metric
kubernetes_info{cluster_id="test-mc", kubeVersion="1.31.1",  ...}
## CAPI cluster metrics
capi_cluster_info{cluster_id="test-wc", provider="vsphere",  ...}
capi_cluster_info{cluster_id="test-mc", provider="capa",  ...}

Then we can easily join with:

kubernetes_info{cluster_id="test-wc", kubeVersion="1.29.1",  ...} * on (cluster_id) group_left() capi_cluster_info

Now i agree this makes queries harder to read if we have multiple joins so we could try to use this referential to contain more informations so we reduce the number of joins but this is probably the best solution there is and we would also have smaller metric footprint in agents if we get rid of some of that clutter.

This would also remove the need for us to detect the capi provider in our operator as it would all be provided by the cluster-api-monitoring app which is kaas related

@giantswarm/team-atlas before I ask about this to kaas, what do you think?

@QuentinBisson QuentinBisson self-assigned this Jan 13, 2025
@architectbot architectbot added the team/atlas Team Atlas label Jan 13, 2025
@Rotfuks
Copy link
Contributor Author

Rotfuks commented Jan 14, 2025

I like the idea, I think it's quite clean :)

@TheoBrigitte
Copy link
Member

I don't like the idea off using join this will add another layer of complexity on top of already complex queries.
I think it would be best when we can set the correct provider label for each WC.

@QuentinBisson
Copy link

I don't see how adding one join will make queries harder as it will also reduce the by clauses.

The main issue is that we cannot retrieve the provider label from the wc metric in the mc and we need to join it anyway so making this a default when we need it looks like a better fit to me that building the providing label all other the place when it's only useful when going outside of alertmanager

@hervenicol
Copy link

hervenicol commented Jan 14, 2025

That logic fits the metrics metadata idea, which sadly does not exist yet in prometheus and mimir.
However, with current state of promql, the syntax for joining this "metadata" is quite ugly, and clutters the queries IMO.

So, I'd love to split this proposal in 2 parts:

  • WC metrics coming from the WCs: currently they have the proper provider set, let's not touch it for the moment.
  • WC metrics coming from the MC: these ones have the MC's provider set, which can be a problem. They'd be better with no provider set, so we're not misleaded.

The problem is the provider is set by the agent (alloy) via external labels. So I'm not sure we can exclude a target from these common external labels?

@QuentinBisson
Copy link

@hervenicol no we cannot exclude it from the external labels unless we use a custom agent for those only.

And I'm not sure how removing the provider from the metric would solve the alerting issues as we still need to find the team responsible for a given component right?

@hervenicol
Copy link

And I'm not sure how removing the provider from the metric would solve the alerting issues as we still need to find the team responsible for a given component right?

It avoids being mislead by an existing-but-wrong provider label.
Which means we know for which metrics we need to have a join to add the provider label from cluster information.

@QuentinBisson
Copy link

But then in that case we might as well remove it from everywhere right?

@QuentinBisson
Copy link

@T-Kukawka so after discussing it with the team, the easy solution is that you update your alerts to extract the correct provider like so:

app_operator_app_info * on(cluster_id) group_left(provider) sum(kubernetes_build_info) by (cluster_id, provider)

This query will match on the cluster_id the metric on the left and will extra the provider from the metric on the right

cc @JosephSalisbury

@Rotfuks Rotfuks self-assigned this Jan 15, 2025
@T-Kukawka
Copy link
Contributor

@yulianedyalkova is this what team tenet can do? most of such alerts paging are in Tenet territory actually

@JosephSalisbury
Copy link
Contributor

@QuentinBisson is it possible to make this more transparent to the other teams? this joining solution is kind of tough and feels like it patches an underlying problem

if i understand correctly, the problem is that the provider label can refer to either:

  • the provider of the cluster the metric is scraped from (i.e: the source)
  • the provider of the cluster the metric is about (i.e: some kind of target)

is it possible to make this clearer in the schema? does it make more sense to have two labels? is there some other way to make this clearer?

@QuentinBisson
Copy link

This solution is definitely a patch yes and you understood the problem fine :)

There are multiple reasons why 2 labels is not good:

  • Cardinality and storage cost of the extra metrics for 3-4 alerts seems like a bad choice
  • Having the provider of the cluster the metric is about means that the shared exporters like app operator needs to be aware of the actual provider of the capi cluster which it should not need to know anything about. Whenever we add a new provider, this will require a new release and this is really bad.

TBH, I really don't understand why people do not like joins in the company when this is one of the most basic concepts in PromQL :D

@Rotfuks
Copy link
Contributor Author

Rotfuks commented Jan 15, 2025

Not just in PromQL :D Coming from Backend Development, SQL and even GraphQL, this approach just sounded like a case of "why don't we do it until now?" :D

@JosephSalisbury
Copy link
Contributor

JosephSalisbury commented Jan 15, 2025

actually fair

"for 3-4 alerts" - if it's less than a handful of alerts, having something that needs a larger comment in those alerts should be reasonable

and we keep the provider ambiguity in mind if we update the schema in future, in cool

@QuentinBisson
Copy link

@yulianedyalkova I've tested the join on wallaby and that does not work on some cases (for instance the lille-dev cluster there is not reporting any kubernetes_build_info metric).

So the join needs to happen on the capi_cluster_info metrics but then this metric needs to have a provider label set to the correct value. So I think we need a recording rules that can get the correct provider based on the infrastructure CR :(

@QuentinBisson
Copy link

I can't promise that all alerts are solved but here is the rough idea giantswarm/prometheus-rules#1474

@QuentinBisson
Copy link

This PR is now merged. I will assume that the remaining alerts can be fixed over time ;) Closing this

@github-project-automation github-project-automation bot moved this from Inbox 📥 to Validation ☑️ in Roadmap Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Validation ☑️
Development

No branches or pull requests

7 participants