-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent prometheus response from subsequent GET /metrics
scrapes (e.g. ibmmq_queue_depth
)
#238
Comments
Most of the values returned by the published metrics are counters over the interval. Returning duplicate values when none have actually been reported by the queue manager would be wrong. It would lead to incorrect calculations such as total number of messages. The more common situation we have to deal with here instead is where the scrape interval covers two sets of publications from the queue manager - cleaning out the maps on each iteration makes the aggregation where that's needed more manageable. While "depth" is an absolute value, and theoretically could be duplicated without harm, trying to handle that as a special case would get very messy, and still potentially misleading if the real depth is varying rapidly as you wouldn't be able to trust it. If you want to increase the sampling rate, then there are tuning parameters for the queue manager which cause it to publish the metrics more frequently. And you could link that to your preferred scrape interval. In particular you can put
|
Hi @ibmmqmet, thanks for your response. I'm a bit confused by the decision to use gauge metrics to return the value of a counter over an interval, rather than just giving the absolute value for the counter and using a counter metric. This would allow the of use the rate operator in promql to determine the rate of change or increase to determine the change, rather than directly reporting the increase over the scrape interval. My understanding is that only absolute values should be gauge metrics, while anything that specifies the number of occurrences of an event should be a counter. If the sensitive metrics you were referring to were counters rather than gauges, then a duplicate value doesn't seem like it would be an issue. E.g. if "total message count" is a gauge and count over an interval then yielding the same value on two subsequent scrapes could alter the outcome, while for a counter the same value being scraped twice would just be reporting no change. The problem I'm facing is not that I'd particularly like to increase the scrape interval, but that I have a HA setup for prometheus scraping and each instance will do its own scrape of the metrics endpoint on its own schedule, with the samples being deduplicated by the associated labels, and depending on timing, the accepted sample may be one that lacks all the publication metrics due to an earlier scrape causing a reset. We have worked around this by increasing the scrape interval, so |
Hello @ibmmqmet , |
Can you please elaborate on "MonitorPublishHeartBest" ? I can't find it anywhere in the docs. https://www.ibm.com/docs/en/ibm-mq/9.4?topic=qmini-tuningparameters-stanza-file Thank you. |
It's actually |
Thank you! |
Affected version: v5.5.0, latest in
master
branchMetrics based on publications are not consistently present in the response from
GET /metrics
in themq_prometheus
program, dependent on scrape interval. It is expected that all metrics would be present in the metrics endpoint response, regardless of scrape frequency.If the endpoint is scraped twice in short succession, the second scrape may not contain the values for a large number of metrics, including
ibmmq_queue_depth
. It appears the link between these missing metrics is that they are updated via publications.Expected behaviour: if there is no new value for a gauge metric, the metrics endpoint should just return the latest value in its response, rather than omitting the value.
Use case: running prometheus scraping in an HA setup and not wanting to have to carefully coordinate scraping cycles for each replica so that there are never two scrapes within a single publication interval.
It seems that this comes down to the implementation of
Collect
in the exporter formq_prometheus
, where one of the first things it does is reset all the gauge metrics before processing any new publications. It appears that scraping twice within a single publication interval will produce inconsistent results.Example to reproduce for queue depth:
More general example:
The text was updated successfully, but these errors were encountered: