Allow customisation of the name of the recording rules #1071

marcemv90 · 2024-02-20T11:15:17Z

Hi 👋,

First of all, thank you very much for creating Pyrra, we use it at our company and we fell in love with it since the very first moment it was up and running. Such a lovely and easy way of creating SLOs 😍.

Now I let me put you all in context.

Motivation

At our company we have a service that is deployed 3 times, each with a different configuration. The 3 instances of the service expose the same metrics, but the values are slightly different as the configurations are different between the 3.

As the 3 instances use the same metric in the ServiceLevelObjective resources to count the total and the errors. Example follows:

SLO of instance A:

apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
  labels:
    pyrra.dev/slack_channel: my-slack-channel
    release: prometheus
  name: service-instance-a-http-errors
  namespace: my-namespace
spec:
  alerting:
    absent: true
    burnrates: true
  description: "SLO for Instance A of the service"
  indicator:
    ratio:
      errors:
        metric: my_service_http_requests_total{http_code=~"5.+", namespace="my-namespace", container="instance-a"}
      grouping: null
      total:
        metric: my_service_http_requests_total{namespace="my-namespace", container="instance-a"}
  target: "99.9"
  window: 4w

SLO of instance B:

apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
  labels:
    pyrra.dev/slack_channel: my-slack-channel
    release: prometheus
  name: service-instance-b-http-errors
  namespace: my-namespace
spec:
  alerting:
    absent: true
    burnrates: true
  description: "SLO for Instance B of the service"
  indicator:
    ratio:
      errors:
        metric: my_service_http_requests_total{http_code=~"5.+", namespace="my-namespace", container="instance-b"}
      grouping: null
      total:
        metric: my_service_http_requests_total{namespace="my-namespace", container="instance-b"}
  target: "99.9"
  window: 4w

SLO of instance C

apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
  labels:
    pyrra.dev/slack_channel: my-slack-channel
    release: prometheus
  name: my-service-instance-c-http-errors
  namespace: my-namespace
spec:
  alerting:
    absent: true
    burnrates: true
  description: "SLO for Instance C of the service"
  indicator:
    ratio:
      errors:
        metric: my_service_http_requests_total{http_code=~"5.+", namespace="my-namespace", container="instance-c"}
      grouping: null
      total:
        # This filters just by namespace, not by container
        metric: my_service_http_requests_total{namespace="my-namespace"}
  target: "99.9"
  window: 4w

As Pyrra will use the names of the metrics used in the query to generate the name the recording rules for the burnrate in Promethgeus accordingly (removing the _total and _count sufixes, etc.). As the 3 SLO use the same source metric, Pyrra creates 3 sets of recording rules with the same name:

Rule group my-service-instance-a-http-errors
- Recording rule my_service_http_requests:burnrate5m
- Recording rule my_service_http_requests:burnrate30m
- Recording rule my_service_http_requests:burnrate1h
- Recording rule my_service_http_requests:burnrate2h
- etc.
Rule group my-service-instance-b-http-errors
- Recording rule my_service_http_requests:burnrate5m
- Recording rule my_service_http_requests:burnrate30m
- Recording rule my_service_http_requests:burnrate1h
- Recording rule my_service_http_requests:burnrate2h
- etc.
Rule group my-service-instance-c-http-errors
- Recording rule my_service_http_requests:burnrate5m
- Recording rule my_service_http_requests:burnrate30m
- Recording rule my_service_http_requests:burnrate1h
- Recording rule my_service_http_requests:burnrate2h
- etc.

The problem comes now. As we have a Mimir as long term storage for Prometheus metrics, prometheus continuosly fails to push some metrics to the remote write with the following error:

ts=2024-02-20T10:22:54.985Z caller=dedupe.go:112 component=remote level=error remote_name=8f7417 url=https://mimir.mycompany.com/api/v1/push msg="non-recoverable error" count=500 exemplarCount=0 err="server returned HTTP status 400 Bad Request: failed pushing to ingester: user=my-user: the sample has been rejected because another sample with the same timestamp, but a different value, has already been ingested (err-mimir-sample-duplicate-timestamp). The affected sample has timestamp 2024-02-20T10:21:12.398Z and is from series {name="my_service_http_requests:burnrate1d", cluster="my-cluster", container="instance-b", namespace="my-namespace", prometheus="monitoring/kube-prometheus-st-prometheus", slack_channel="my-slack-channel", slo="service-instance-b-http-errors", team="my-team"}"

Seems like the recording rules get evaluated in the very same moment, and as they are evaluating different things, the result is different, and that causes a problem when remote writing the series to the long term storage, as the first series get ingested just fine, but the rest fail as they try to put the same serie, but with different values.

Proposed solution

I'm not a developer neither an expert in monitoring, so I hope this is not a silly suggestion, but for me, as SRE, makes sense having the capability of customising the pyrra-generated recording rules, maybe using some annotation, like

pyrra.dev/prometheus-slo-recording-rule-prefix
pyrra.dev/prometheus-slo-recording-rule-suffix
pyrra.dev/prometheus-slo-recording-rule-name
or whatever you think makes the most sense 😬

I hope you can take this into consideration.

Thanks again for your great effort. Please let me know if you need us to provide more information.

Regards!

The text was updated successfully, but these errors were encountered:

metalmatze · 2024-02-21T10:35:41Z

Hey! 👋
Thank you for the kind words, and we are happy to hear you enjoy using Pyrra! 🎉
Thank you for the detailed report too!

Right now I'm failing to understand why this breaks.
These metrics should be different due to different labels. Time series are identified by their entire label set, including all labels and the __name__ metric name.

my_service_http_requests_total{namespace="my-namespace", container="instance-a"}
my_service_http_requests_total{namespace="my-namespace", container="instance-b"}
my_service_http_requests_total{namespace="my-namespace"}

The burn rate rules should generate something like the following:

my_service_http_requests:burnrate5m{namespace="my-namespace", container="instance-a"}
my_service_http_requests:burnrate5m{namespace="my-namespace", container="instance-b"}
my_service_http_requests:burnrate5m{namespace="my-namespace"}

These are all individual and distinct time series.
That being said, I can see how the last metric only contains the namespace label and therefore the UI and alerting might incorrectly sum up all 3 series... Is this what you are having a problem with?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow customisation of the name of the recording rules #1071

Allow customisation of the name of the recording rules #1071

marcemv90 commented Feb 20, 2024 •

edited

Loading

metalmatze commented Feb 21, 2024

Allow customisation of the name of the recording rules #1071

Allow customisation of the name of the recording rules #1071

Comments

marcemv90 commented Feb 20, 2024 • edited Loading

Motivation

SLO of instance A:

SLO of instance B:

SLO of instance C

Proposed solution

metalmatze commented Feb 21, 2024

marcemv90 commented Feb 20, 2024 •

edited

Loading