Alert queries should automatically wait for all queries #275

sysadmin1139 · 2024-02-08T21:58:07Z

What happened:

We updated to 2.9.1 last week, which was fine. Earlier this week we made some adjustments to our metrics ingestion which resulted in a larger volume entering Timestream tables used in alerting. Alarms hitting our most increased tables started seeing sporadic NoData errors. Turning debug logging on didn't give any gun-smoke.

logger=ngalert.scheduler rule_uid=_Nu3p9V4z org_id=1 version=10 attempt=0 now=2024-02-08T18:16:00Z t=2024-02-08T18:16:07.472538875Z level=debug msg="Alert rule evaluated" results="[{Instance: State:Normal Error:<nil> Results:map[] Values:map[] EvaluatedAt:2024-02-08 18:16:00 +0000 UTC EvaluationDuration:7.472534108s EvaluationString:}]" duration=1.023625401s
logger=ngalert.scheduler rule_uid=_Nu3p9V4z org_id=1 version=10 attempt=0 now=2024-02-08T18:15:00Z t=2024-02-08T18:15:12.870256927Z level=debug msg="Alert rule evaluated" results="[{Instance: State:NoData Error:<nil> Results:map[] Values:map[A0:{Var:A Labels: Value:<nil>}] EvaluatedAt:2024-02-08 18:15:00 +0000 UTC EvaluationDuration:12.870241593s EvaluationString:[ var='A0' metric='NoData' labels={} value=null ]}]" duration=5.247031669s

The 18:15 evaluation triggered a NoData, where the 18:16 evaluation cleared it. This flips minute by minute, sometimes going as long as 6 minutes of success before another NoData.

CloudWatch isn't showing elevations in UserErrors or SystemErrors for Queries.
Queries are taking longer because there is more data to chew through.
Turning on debug logging made the problem much worse.
Downgrading to 2.9.0 eased it a bit, but the problem is still very much there.

The change in 9.5 to only fire NoData after the For period has elapsed would have obscured this from visibility.

This feels like something internal to the grafana process is getting overwhelmed.

What you expected to happen:

It should have behaved as it formerly did, doing alert queries without issue.

How to reproduce it (as minimally and precisely as possible):

Screenshots

Title and series-names are deliberately obscured, but this is charting SampleCount for a variety of Timestream tables. Pixel resolution in this scale is 5 minutes. The alert flapping issue started shortly after the increase in ingestion rate.

Anything else we need to know?:

Environment:

Grafana version: 9.3.8
Plugin version: 2.9.1
OS Grafana is installed on: Ubuntu 20.04
User OS & Browser: N/A
Others:
- Total timestream alerts: 196
- Total alerts: 419
- 2 CPU box, with usage staying well below 25%

The text was updated successfully, but these errors were encountered:

sysadmin1139 · 2024-02-09T17:26:45Z

Full event-log for the failing 18:15 event.

logger=datasources rule_uid=_Nu3p9V4z org_id=1 t=2024-02-08T18:15:07.62330725Z level=debug msg="Querying for data source via SQL store" uid=S_Kqe5e7k orgId=1
logger=ngalert.state.manager rule_uid=_Nu3p9V4z org_id=1 t=2024-02-08T18:15:12.870653042Z level=debug msg="State manager processing evaluation results" resultCount=1
logger=ngalert.scheduler rule_uid=_Nu3p9V4z org_id=1 version=10 attempt=0 now=2024-02-08T18:15:00Z t=2024-02-08T18:15:12.870256927Z level=debug msg="Alert rule evaluated" results="[{Instance: State:NoData Error:<nil> Results:map[] Values:map[A0:{Var:A Labels: Value:<nil>}] EvaluatedAt:2024-02-08 18:15:00 +0000 UTC EvaluationDuration:12.870241593s EvaluationString:[ var='A0' metric='NoData' labels={} value=null ]}]" duration=5.247031669s
logger=expr rule_uid=_Nu3p9V4z org_id=1 datasourceType=grafana-timestream-datasource t=2024-02-08T18:15:12.870148906Z level=debug msg="expression datasource query (seriesSet)" query=B
logger=ngalert.state.manager rule_uid=_Nu3p9V4z org_id=1 t=2024-02-08T18:15:12.871205061Z level=debug msg="Saving alert states" count=1
logger=ngalert.state.manager rule_uid=_Nu3p9V4z org_id=1 t=2024-02-08T18:15:12.871138996Z level=debug msg="Setting alert state"
logger=ngalert.state.historian rule_uid=_Nu3p9V4z org_id=1 t=2024-02-08T18:15:12.874227359Z level=debug msg="Alert state changed creating annotation" newState="Normal (NoData)" oldState=Normal
logger=ngalert.state.historian rule_uid=_Nu3p9V4z org_id=1 t=2024-02-08T18:15:12.874466849Z level=error msg="Error saving alert annotation batch" error="json: error calling MarshalJSON for type *simplejson.Json: json: unsupported value: NaN"
logger=ngalert.state.historian rule_uid=_Nu3p9V4z org_id=1 t=2024-02-08T18:15:12.874597227Z level=debug msg="Done saving alert annotation batch"

iwysiu · 2024-02-14T20:46:58Z

Hi @sysadmin1139 , I've picked this up to take a look at it. What I suspect based on what you've said, is that we're hitting a timeout in either grafana or in timestream. I've looked at the changes between the 2.9.0 and 2.9.1 versions, and I'm not seeing anything that would affect the query runtime.
I'm going to look at it a bit more this week and see if there's anything we can do to improve this, but can you give me the alert rule you're running (with any sensitive information removed)? Also, relating to my timeout theory, does it error less if you make the period between evaluations longer?

sysadmin1139 · 2024-02-14T20:50:26Z

It turns out the 2.9.0 vs 2.9.1 thing was in error. I've since downgraded to 2.9.0 and the effect is still present, though a bit reduced. If there is a timeout in Timestream, it's not showing up in the CloudWatch error metrics. I also don't know how to tune debug output to get more targeted data, if that would help to have.

sysadmin1139 · 2024-02-15T20:28:20Z

My current working theory is this is some kind of thundering herd problem, where smearing would be more effective than spacing out the avalanche of queries.

Today I tried making a second timestream datasource, under the theory that might change the contention problem. It did not.

iwysiu · 2024-02-15T21:56:00Z

Yeah, I agree that if it ends up being a timeout error we should give a more helpful message.

Can you give me the query (with any confidential information removed) and expressions you have set for the alert? It'll help me understand what's happening with it.

Also, you tell me what happens if you try setting the evaluation interval to be longer? It's possible that Grafana is cancelling the query after it waits the minute of the evaluation interval, which could possibly result in a NoData without causing an error as far as Timestream is concerned.

sysadmin1139 · 2024-02-16T22:49:57Z

The logging I quoted in the original report showed query times in small number seconds. I can sometimes reproduce this behavior in the 'Edit alert' screen with the 'preview' button. It gets no data. Refresh, data. Refresh, data. Refresh, no data.

iwysiu · 2024-02-16T23:27:31Z

Do you have waitForAllQueries toggled to true for your alerting query? By default we return partial queries from timestream, but alerting requires that we wait to return the results. If that's the issue, we should update the plugin to wait for alert queries automatically, but the toggle should fix it for now.

sysadmin1139 · 2024-02-16T23:52:34Z

I did not have that set on my exemplar bad alert. I've set it and will let it bake a few hours and see what happens.

sysadmin1139 · 2024-02-20T16:59:35Z

It's been the long weekend and the alarm hasn't flapped once. I think we found root cause.

iwysiu · 2024-02-20T19:00:29Z

Great! It sounds like we have a workaround for now, but I'll move this to our backlog for making alert queries automatically wait.

Bug Notes:
We should check the request headers for the alert header and if it's applied, we should set WaitForResult on the query to true (example of checking the request headers in cloudwatch)

sysadmin1139 · 2024-02-20T19:18:59Z

Another thing we've noticed: that selector doesn't visually stay selected after you save the alarm and leave, but definitely is effective.

iwysiu · 2024-02-20T21:11:09Z

Thanks! I made a separate issue for that #276

sysadmin1139 · 2024-02-23T00:47:11Z

I saw this particular stack trace a few times before, but wasn't able to reproduce it. This happens when Previewing timestream alert queries, but only sometimes.

https://url/alerting/MHKT--ZIz/edit?returnTo=%2Falerting%2Fgrafana%2FMHKT--ZIz%2Fview%3ForgId%3D1

TypeError: e.fields[0] is undefined

di@https://fqdn/public/build/AlertingRuleForm.d245068ee1365350e124.js:145:386
div
de@https://fqdn/public/build/3217.ea2a014a59d99960d7a7.js:2:92534
vi@https://fqdn/public/build/AlertingRuleForm.d245068ee1365350e124.js:252:158
div
de@https://fqdn/public/build/3217.ea2a014a59d99960d7a7.js:2:92534
fieldset
d@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:1406:1760
div
div
o@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:6346:172
No@https://fqdn/public/build/AlertingRuleForm.d245068ee1365350e124.js:292:2635
div
div
div
t@https://fqdn/public/build/3217.ea2a014a59d99960d7a7.js:2:711589
p@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:424:3558
div
form
R@https://fqdn/public/build/3217.ea2a014a59d99960d7a7.js:2:927698
$o@https://fqdn/public/build/AlertingRuleForm.d245068ee1365350e124.js:292:5853
Uo@https://fqdn/public/build/AlertingRuleForm.d245068ee1365350e124.js:313:67
div
37417/b.Contents@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:5810:2272
div
div
div
t@https://fqdn/public/build/3217.ea2a014a59d99960d7a7.js:2:711589
p@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:424:3558
div
b@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:5810:1289
t@https://fqdn/public/build/9076.637f4fd32666ef6824c7.js:1:211
7328/Vo<@https://fqdn/public/build/AlertingRuleForm.d245068ee1365350e124.js:313:914
c@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:1207:301
u@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:1207:855
WithErrorBoundary
Suspense
c@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:1207:301
_c@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:8842:350
t@https://fqdn/public/build/3217.ea2a014a59d99960d7a7.js:2:1060861
t@https://fqdn/public/build/3217.ea2a014a59d99960d7a7.js:2:1062882
main
Cd@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:8803:6825
t@https://fqdn/public/build/3217.ea2a014a59d99960d7a7.js:2:1057479
div
o@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:2006:4371
19175/t.KBarProvider@https://fqdn/public/build/3217.ea2a014a59d99960d7a7.js:2:271053
l@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:6154:19566
c@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:1207:301
u@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:1207:855
l@https://fqdn/public/build/3217.ea2a014a59d99960d7a7.js:2:1040511
Su@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:8855:115

katebrenner · 2024-03-22T13:38:31Z

closing this since we have a workaround, but still plan to address #276

sysadmin1139 added datasource/Timestream type/bug labels Feb 8, 2024

aws-ds-token-creator bot added this to AWS Datasources Feb 8, 2024

github-project-automation bot moved this to Incoming in AWS Datasources Feb 8, 2024

iwysiu moved this from Incoming to Waiting in AWS Datasources Feb 14, 2024

iwysiu changed the title ~~Mysterious NoData after 2.9.1 and increase in query-sizes~~ Alert queries should automatically wait for all queries Feb 20, 2024

iwysiu moved this from Waiting to Backlog in AWS Datasources Feb 20, 2024

katebrenner closed this as not planned Won't fix, can't repro, duplicate, stale Mar 22, 2024

github-project-automation bot moved this from Backlog to Done in AWS Datasources Mar 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alert queries should automatically wait for all queries #275

Alert queries should automatically wait for all queries #275

sysadmin1139 commented Feb 8, 2024 •

edited

Loading

sysadmin1139 commented Feb 9, 2024

iwysiu commented Feb 14, 2024

sysadmin1139 commented Feb 14, 2024

sysadmin1139 commented Feb 15, 2024

iwysiu commented Feb 15, 2024 •

edited

Loading

sysadmin1139 commented Feb 16, 2024

iwysiu commented Feb 16, 2024 •

edited

Loading

sysadmin1139 commented Feb 16, 2024

sysadmin1139 commented Feb 20, 2024

iwysiu commented Feb 20, 2024

sysadmin1139 commented Feb 20, 2024

iwysiu commented Feb 20, 2024

sysadmin1139 commented Feb 23, 2024

katebrenner commented Mar 22, 2024

Alert queries should automatically wait for all queries #275

Alert queries should automatically wait for all queries #275

Comments

sysadmin1139 commented Feb 8, 2024 • edited Loading

sysadmin1139 commented Feb 9, 2024

iwysiu commented Feb 14, 2024

sysadmin1139 commented Feb 14, 2024

sysadmin1139 commented Feb 15, 2024

iwysiu commented Feb 15, 2024 • edited Loading

sysadmin1139 commented Feb 16, 2024

iwysiu commented Feb 16, 2024 • edited Loading

sysadmin1139 commented Feb 16, 2024

sysadmin1139 commented Feb 20, 2024

iwysiu commented Feb 20, 2024

sysadmin1139 commented Feb 20, 2024

iwysiu commented Feb 20, 2024

sysadmin1139 commented Feb 23, 2024

katebrenner commented Mar 22, 2024

sysadmin1139 commented Feb 8, 2024 •

edited

Loading

iwysiu commented Feb 15, 2024 •

edited

Loading

iwysiu commented Feb 16, 2024 •

edited

Loading