Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow customisation of the name of the recording rules #1071

Open
marcemv90 opened this issue Feb 20, 2024 · 1 comment
Open

Allow customisation of the name of the recording rules #1071

marcemv90 opened this issue Feb 20, 2024 · 1 comment

Comments

@marcemv90
Copy link

marcemv90 commented Feb 20, 2024

Hi 👋,

First of all, thank you very much for creating Pyrra, we use it at our company and we fell in love with it since the very first moment it was up and running. Such a lovely and easy way of creating SLOs 😍.

Now I let me put you all in context.

Motivation

At our company we have a service that is deployed 3 times, each with a different configuration. The 3 instances of the service expose the same metrics, but the values are slightly different as the configurations are different between the 3.

As the 3 instances use the same metric in the ServiceLevelObjective resources to count the total and the errors. Example follows:

SLO of instance A:

apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
  labels:
    pyrra.dev/slack_channel: my-slack-channel
    release: prometheus
  name: service-instance-a-http-errors
  namespace: my-namespace
spec:
  alerting:
    absent: true
    burnrates: true
  description: "SLO for Instance A of the service"
  indicator:
    ratio:
      errors:
        metric: my_service_http_requests_total{http_code=~"5.+", namespace="my-namespace", container="instance-a"}
      grouping: null
      total:
        metric: my_service_http_requests_total{namespace="my-namespace", container="instance-a"}
  target: "99.9"
  window: 4w

SLO of instance B:

apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
  labels:
    pyrra.dev/slack_channel: my-slack-channel
    release: prometheus
  name: service-instance-b-http-errors
  namespace: my-namespace
spec:
  alerting:
    absent: true
    burnrates: true
  description: "SLO for Instance B of the service"
  indicator:
    ratio:
      errors:
        metric: my_service_http_requests_total{http_code=~"5.+", namespace="my-namespace", container="instance-b"}
      grouping: null
      total:
        metric: my_service_http_requests_total{namespace="my-namespace", container="instance-b"}
  target: "99.9"
  window: 4w

SLO of instance C

apiVersion: pyrra.dev/v1alpha1
kind: ServiceLevelObjective
metadata:
  labels:
    pyrra.dev/slack_channel: my-slack-channel
    release: prometheus
  name: my-service-instance-c-http-errors
  namespace: my-namespace
spec:
  alerting:
    absent: true
    burnrates: true
  description: "SLO for Instance C of the service"
  indicator:
    ratio:
      errors:
        metric: my_service_http_requests_total{http_code=~"5.+", namespace="my-namespace", container="instance-c"}
      grouping: null
      total:
        # This filters just by namespace, not by container
        metric: my_service_http_requests_total{namespace="my-namespace"}
  target: "99.9"
  window: 4w

As Pyrra will use the names of the metrics used in the query to generate the name the recording rules for the burnrate in Promethgeus accordingly (removing the _total and _count sufixes, etc.). As the 3 SLO use the same source metric, Pyrra creates 3 sets of recording rules with the same name:

  • Rule group my-service-instance-a-http-errors

    • Recording rule my_service_http_requests:burnrate5m
    • Recording rule my_service_http_requests:burnrate30m
    • Recording rule my_service_http_requests:burnrate1h
    • Recording rule my_service_http_requests:burnrate2h
    • etc.
  • Rule group my-service-instance-b-http-errors

    • Recording rule my_service_http_requests:burnrate5m
    • Recording rule my_service_http_requests:burnrate30m
    • Recording rule my_service_http_requests:burnrate1h
    • Recording rule my_service_http_requests:burnrate2h
    • etc.
  • Rule group my-service-instance-c-http-errors

    • Recording rule my_service_http_requests:burnrate5m
    • Recording rule my_service_http_requests:burnrate30m
    • Recording rule my_service_http_requests:burnrate1h
    • Recording rule my_service_http_requests:burnrate2h
    • etc.

The problem comes now. As we have a Mimir as long term storage for Prometheus metrics, prometheus continuosly fails to push some metrics to the remote write with the following error:

ts=2024-02-20T10:22:54.985Z caller=dedupe.go:112 component=remote level=error remote_name=8f7417 url=https://mimir.mycompany.com/api/v1/push msg="non-recoverable error" count=500 exemplarCount=0 err="server returned HTTP status 400 Bad Request: failed pushing to ingester: user=my-user: the sample has been rejected because another sample with the same timestamp, but a different value, has already been ingested (err-mimir-sample-duplicate-timestamp). The affected sample has timestamp 2024-02-20T10:21:12.398Z and is from series {name="my_service_http_requests:burnrate1d", cluster="my-cluster", container="instance-b", namespace="my-namespace", prometheus="monitoring/kube-prometheus-st-prometheus", slack_channel="my-slack-channel", slo="service-instance-b-http-errors", team="my-team"}"

Seems like the recording rules get evaluated in the very same moment, and as they are evaluating different things, the result is different, and that causes a problem when remote writing the series to the long term storage, as the first series get ingested just fine, but the rest fail as they try to put the same serie, but with different values.

Proposed solution

I'm not a developer neither an expert in monitoring, so I hope this is not a silly suggestion, but for me, as SRE, makes sense having the capability of customising the pyrra-generated recording rules, maybe using some annotation, like

  • pyrra.dev/prometheus-slo-recording-rule-prefix
  • pyrra.dev/prometheus-slo-recording-rule-suffix
  • pyrra.dev/prometheus-slo-recording-rule-name
  • or whatever you think makes the most sense 😬

I hope you can take this into consideration.

Thanks again for your great effort. Please let me know if you need us to provide more information.

Regards!

@metalmatze
Copy link
Member

Hey! 👋
Thank you for the kind words, and we are happy to hear you enjoy using Pyrra! 🎉
Thank you for the detailed report too!

Right now I'm failing to understand why this breaks.
These metrics should be different due to different labels. Time series are identified by their entire label set, including all labels and the __name__ metric name.

my_service_http_requests_total{namespace="my-namespace", container="instance-a"}
my_service_http_requests_total{namespace="my-namespace", container="instance-b"}
my_service_http_requests_total{namespace="my-namespace"}

The burn rate rules should generate something like the following:

my_service_http_requests:burnrate5m{namespace="my-namespace", container="instance-a"}
my_service_http_requests:burnrate5m{namespace="my-namespace", container="instance-b"}
my_service_http_requests:burnrate5m{namespace="my-namespace"}

These are all individual and distinct time series.
That being said, I can see how the last metric only contains the namespace label and therefore the UI and alerting might incorrectly sum up all 3 series... Is this what you are having a problem with?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants