You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all, thank you very much for creating Pyrra, we use it at our company and we fell in love with it since the very first moment it was up and running. Such a lovely and easy way of creating SLOs 😍.
Now I let me put you all in context.
Motivation
At our company we have a service that is deployed 3 times, each with a different configuration. The 3 instances of the service expose the same metrics, but the values are slightly different as the configurations are different between the 3.
As the 3 instances use the same metric in the ServiceLevelObjective resources to count the total and the errors. Example follows:
SLO of instance A:
apiVersion: pyrra.dev/v1alpha1kind: ServiceLevelObjectivemetadata:
labels:
pyrra.dev/slack_channel: my-slack-channelrelease: prometheusname: service-instance-a-http-errorsnamespace: my-namespacespec:
alerting:
absent: trueburnrates: truedescription: "SLO for Instance A of the service"indicator:
ratio:
errors:
metric: my_service_http_requests_total{http_code=~"5.+", namespace="my-namespace", container="instance-a"}grouping: nulltotal:
metric: my_service_http_requests_total{namespace="my-namespace", container="instance-a"}target: "99.9"window: 4w
SLO of instance B:
apiVersion: pyrra.dev/v1alpha1kind: ServiceLevelObjectivemetadata:
labels:
pyrra.dev/slack_channel: my-slack-channelrelease: prometheusname: service-instance-b-http-errorsnamespace: my-namespacespec:
alerting:
absent: trueburnrates: truedescription: "SLO for Instance B of the service"indicator:
ratio:
errors:
metric: my_service_http_requests_total{http_code=~"5.+", namespace="my-namespace", container="instance-b"}grouping: nulltotal:
metric: my_service_http_requests_total{namespace="my-namespace", container="instance-b"}target: "99.9"window: 4w
SLO of instance C
apiVersion: pyrra.dev/v1alpha1kind: ServiceLevelObjectivemetadata:
labels:
pyrra.dev/slack_channel: my-slack-channelrelease: prometheusname: my-service-instance-c-http-errorsnamespace: my-namespacespec:
alerting:
absent: trueburnrates: truedescription: "SLO for Instance C of the service"indicator:
ratio:
errors:
metric: my_service_http_requests_total{http_code=~"5.+", namespace="my-namespace", container="instance-c"}grouping: nulltotal:
# This filters just by namespace, not by containermetric: my_service_http_requests_total{namespace="my-namespace"}target: "99.9"window: 4w
As Pyrra will use the names of the metrics used in the query to generate the name the recording rules for the burnrate in Promethgeus accordingly (removing the _total and _count sufixes, etc.). As the 3 SLO use the same source metric, Pyrra creates 3 sets of recording rules with the same name:
The problem comes now. As we have a Mimir as long term storage for Prometheus metrics, prometheus continuosly fails to push some metrics to the remote write with the following error:
ts=2024-02-20T10:22:54.985Z caller=dedupe.go:112 component=remote level=error remote_name=8f7417 url=https://mimir.mycompany.com/api/v1/push msg="non-recoverable error" count=500 exemplarCount=0 err="server returned HTTP status 400 Bad Request: failed pushing to ingester: user=my-user: the sample has been rejected because another sample with the same timestamp, but a different value, has already been ingested (err-mimir-sample-duplicate-timestamp). The affected sample has timestamp 2024-02-20T10:21:12.398Z and is from series {name="my_service_http_requests:burnrate1d", cluster="my-cluster", container="instance-b", namespace="my-namespace", prometheus="monitoring/kube-prometheus-st-prometheus", slack_channel="my-slack-channel", slo="service-instance-b-http-errors", team="my-team"}"
Seems like the recording rules get evaluated in the very same moment, and as they are evaluating different things, the result is different, and that causes a problem when remote writing the series to the long term storage, as the first series get ingested just fine, but the rest fail as they try to put the same serie, but with different values.
Proposed solution
I'm not a developer neither an expert in monitoring, so I hope this is not a silly suggestion, but for me, as SRE, makes sense having the capability of customising the pyrra-generated recording rules, maybe using some annotation, like
pyrra.dev/prometheus-slo-recording-rule-prefix
pyrra.dev/prometheus-slo-recording-rule-suffix
pyrra.dev/prometheus-slo-recording-rule-name
or whatever you think makes the most sense 😬
I hope you can take this into consideration.
Thanks again for your great effort. Please let me know if you need us to provide more information.
Regards!
The text was updated successfully, but these errors were encountered:
Hey! 👋
Thank you for the kind words, and we are happy to hear you enjoy using Pyrra! 🎉
Thank you for the detailed report too!
Right now I'm failing to understand why this breaks.
These metrics should be different due to different labels. Time series are identified by their entire label set, including all labels and the __name__ metric name.
These are all individual and distinct time series.
That being said, I can see how the last metric only contains the namespace label and therefore the UI and alerting might incorrectly sum up all 3 series... Is this what you are having a problem with?
Hi 👋,
First of all, thank you very much for creating Pyrra, we use it at our company and we fell in love with it since the very first moment it was up and running. Such a lovely and easy way of creating SLOs 😍.
Now I let me put you all in context.
Motivation
At our company we have a service that is deployed 3 times, each with a different configuration. The 3 instances of the service expose the same metrics, but the values are slightly different as the configurations are different between the 3.
As the 3 instances use the same metric in the
ServiceLevelObjective
resources to count thetotal
and theerrors
. Example follows:SLO of instance A:
SLO of instance B:
SLO of instance C
As Pyrra will use the names of the metrics used in the query to generate the name the recording rules for the burnrate in Promethgeus accordingly (removing the
_total
and_count
sufixes, etc.). As the 3 SLO use the same source metric, Pyrra creates 3 sets of recording rules with the same name:Rule group
my-service-instance-a-http-errors
my_service_http_requests:burnrate5m
my_service_http_requests:burnrate30m
my_service_http_requests:burnrate1h
my_service_http_requests:burnrate2h
Rule group
my-service-instance-b-http-errors
my_service_http_requests:burnrate5m
my_service_http_requests:burnrate30m
my_service_http_requests:burnrate1h
my_service_http_requests:burnrate2h
Rule group
my-service-instance-c-http-errors
my_service_http_requests:burnrate5m
my_service_http_requests:burnrate30m
my_service_http_requests:burnrate1h
my_service_http_requests:burnrate2h
The problem comes now. As we have a Mimir as long term storage for Prometheus metrics, prometheus continuosly fails to push some metrics to the remote write with the following error:
Seems like the recording rules get evaluated in the very same moment, and as they are evaluating different things, the result is different, and that causes a problem when remote writing the series to the long term storage, as the first series get ingested just fine, but the rest fail as they try to put the same serie, but with different values.
Proposed solution
I'm not a developer neither an expert in monitoring, so I hope this is not a silly suggestion, but for me, as SRE, makes sense having the capability of customising the pyrra-generated recording rules, maybe using some annotation, like
pyrra.dev/prometheus-slo-recording-rule-prefix
pyrra.dev/prometheus-slo-recording-rule-suffix
pyrra.dev/prometheus-slo-recording-rule-name
I hope you can take this into consideration.
Thanks again for your great effort. Please let me know if you need us to provide more information.
Regards!
The text was updated successfully, but these errors were encountered: