-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigation: Alert Routing in Multi-Provider Setup #3801
Comments
cc @JosephSalisbury so we had an issue :) |
Monitoring for kaas components in multi-provider setup works like this:
This usually works fine for WCs as the cluster provider is quite static. IMO, I think we should get rid if this provider label set as an external label in our metrics and create a new metric that will contain the provider label for a cluster. In that case, whenever we need the provider label in a query, we will always be able to join on that metric. As an example, we would have the following metrics:
Then we can easily join with:
Now i agree this makes queries harder to read if we have multiple joins so we could try to use this referential to contain more informations so we reduce the number of joins but this is probably the best solution there is and we would also have smaller metric footprint in agents if we get rid of some of that clutter. This would also remove the need for us to detect the capi provider in our operator as it would all be provided by the cluster-api-monitoring app which is kaas related @giantswarm/team-atlas before I ask about this to kaas, what do you think? |
I like the idea, I think it's quite clean :) |
I don't like the idea off using join this will add another layer of complexity on top of already complex queries. |
I don't see how adding one join will make queries harder as it will also reduce the by clauses. The main issue is that we cannot retrieve the provider label from the wc metric in the mc and we need to join it anyway so making this a default when we need it looks like a better fit to me that building the providing label all other the place when it's only useful when going outside of alertmanager |
That logic fits the metrics metadata idea, which sadly does not exist yet in prometheus and mimir. So, I'd love to split this proposal in 2 parts:
The problem is the provider is set by the agent (alloy) via external labels. So I'm not sure we can exclude a target from these common external labels? |
@hervenicol no we cannot exclude it from the external labels unless we use a custom agent for those only. And I'm not sure how removing the provider from the metric would solve the alerting issues as we still need to find the team responsible for a given component right? |
It avoids being mislead by an existing-but-wrong provider label. |
But then in that case we might as well remove it from everywhere right? |
@T-Kukawka so after discussing it with the team, the easy solution is that you update your alerts to extract the correct provider like so:
This query will match on the cluster_id the metric on the left and will extra the provider from the metric on the right |
@yulianedyalkova is this what team tenet can do? most of such alerts paging are in Tenet territory actually |
@QuentinBisson is it possible to make this more transparent to the other teams? this joining solution is kind of tough and feels like it patches an underlying problem if i understand correctly, the problem is that the
is it possible to make this clearer in the schema? does it make more sense to have two labels? is there some other way to make this clearer? |
This solution is definitely a patch yes and you understood the problem fine :) There are multiple reasons why 2 labels is not good:
TBH, I really don't understand why people do not like joins in the company when this is one of the most basic concepts in PromQL :D |
Not just in PromQL :D Coming from Backend Development, SQL and even GraphQL, this approach just sounded like a case of "why don't we do it until now?" :D |
actually fair "for 3-4 alerts" - if it's less than a handful of alerts, having something that needs a larger comment in those alerts should be reasonable and we keep the provider ambiguity in mind if we update the schema in future, in cool |
@yulianedyalkova I've tested the join on wallaby and that does not work on some cases (for instance the lille-dev cluster there is not reporting any kubernetes_build_info metric). So the join needs to happen on the capi_cluster_info metrics but then this metric needs to have a provider label set to the correct value. So I think we need a recording rules that can get the correct provider based on the infrastructure CR :( |
I can't promise that all alerts are solved but here is the rough idea giantswarm/prometheus-rules#1474 |
This PR is now merged. I will assume that the remaining alerts can be fixed over time ;) Closing this |
Feedback from Jose:
See: https://gigantic.slack.com/archives/CLPMFRVU6/p1734009831450799
We need to create a process on how we can do alert routing for giantswarm teams with multi-provider setups
Todo
Outcome
The text was updated successfully, but these errors were encountered: