-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alert queries should automatically wait for all queries #275
Comments
Full event-log for the failing 18:15 event.
|
Hi @sysadmin1139 , I've picked this up to take a look at it. What I suspect based on what you've said, is that we're hitting a timeout in either grafana or in timestream. I've looked at the changes between the 2.9.0 and 2.9.1 versions, and I'm not seeing anything that would affect the query runtime. |
It turns out the 2.9.0 vs 2.9.1 thing was in error. I've since downgraded to 2.9.0 and the effect is still present, though a bit reduced. If there is a timeout in Timestream, it's not showing up in the CloudWatch error metrics. I also don't know how to tune debug output to get more targeted data, if that would help to have. |
My current working theory is this is some kind of thundering herd problem, where smearing would be more effective than spacing out the avalanche of queries. Today I tried making a second timestream datasource, under the theory that might change the contention problem. It did not. |
Yeah, I agree that if it ends up being a timeout error we should give a more helpful message. Can you give me the query (with any confidential information removed) and expressions you have set for the alert? It'll help me understand what's happening with it. Also, you tell me what happens if you try setting the evaluation interval to be longer? It's possible that Grafana is cancelling the query after it waits the minute of the evaluation interval, which could possibly result in a NoData without causing an error as far as Timestream is concerned. |
The logging I quoted in the original report showed query times in small number seconds. I can sometimes reproduce this behavior in the 'Edit alert' screen with the 'preview' button. It gets no data. Refresh, data. Refresh, data. Refresh, no data. |
I did not have that set on my exemplar bad alert. I've set it and will let it bake a few hours and see what happens. |
It's been the long weekend and the alarm hasn't flapped once. I think we found root cause. |
Great! It sounds like we have a workaround for now, but I'll move this to our backlog for making alert queries automatically wait. Bug Notes: |
Another thing we've noticed: that selector doesn't visually stay selected after you save the alarm and leave, but definitely is effective. |
Thanks! I made a separate issue for that #276 |
I saw this particular stack trace a few times before, but wasn't able to reproduce it. This happens when Previewing timestream alert queries, but only sometimes. https://url/alerting/MHKT--ZIz/edit?returnTo=%2Falerting%2Fgrafana%2FMHKT--ZIz%2Fview%3ForgId%3D1
|
closing this since we have a workaround, but still plan to address #276 |
What happened:
We updated to 2.9.1 last week, which was fine. Earlier this week we made some adjustments to our metrics ingestion which resulted in a larger volume entering Timestream tables used in alerting. Alarms hitting our most increased tables started seeing sporadic
NoData
errors. Turning debug logging on didn't give any gun-smoke.The 18:15 evaluation triggered a NoData, where the 18:16 evaluation cleared it. This flips minute by minute, sometimes going as long as 6 minutes of success before another NoData.
The change in 9.5 to only fire NoData after the For period has elapsed would have obscured this from visibility.
This feels like something internal to the grafana process is getting overwhelmed.
What you expected to happen:
It should have behaved as it formerly did, doing alert queries without issue.
How to reproduce it (as minimally and precisely as possible):
Screenshots
Title and series-names are deliberately obscured, but this is charting SampleCount for a variety of Timestream tables. Pixel resolution in this scale is 5 minutes. The alert flapping issue started shortly after the increase in ingestion rate.
Anything else we need to know?:
Environment:
The text was updated successfully, but these errors were encountered: