Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alert queries should automatically wait for all queries #275

Closed
sysadmin1139 opened this issue Feb 8, 2024 · 14 comments
Closed

Alert queries should automatically wait for all queries #275

sysadmin1139 opened this issue Feb 8, 2024 · 14 comments

Comments

@sysadmin1139
Copy link

sysadmin1139 commented Feb 8, 2024

What happened:

We updated to 2.9.1 last week, which was fine. Earlier this week we made some adjustments to our metrics ingestion which resulted in a larger volume entering Timestream tables used in alerting. Alarms hitting our most increased tables started seeing sporadic NoData errors. Turning debug logging on didn't give any gun-smoke.

logger=ngalert.scheduler rule_uid=_Nu3p9V4z org_id=1 version=10 attempt=0 now=2024-02-08T18:16:00Z t=2024-02-08T18:16:07.472538875Z level=debug msg="Alert rule evaluated" results="[{Instance: State:Normal Error:<nil> Results:map[] Values:map[] EvaluatedAt:2024-02-08 18:16:00 +0000 UTC EvaluationDuration:7.472534108s EvaluationString:}]" duration=1.023625401s
logger=ngalert.scheduler rule_uid=_Nu3p9V4z org_id=1 version=10 attempt=0 now=2024-02-08T18:15:00Z t=2024-02-08T18:15:12.870256927Z level=debug msg="Alert rule evaluated" results="[{Instance: State:NoData Error:<nil> Results:map[] Values:map[A0:{Var:A Labels: Value:<nil>}] EvaluatedAt:2024-02-08 18:15:00 +0000 UTC EvaluationDuration:12.870241593s EvaluationString:[ var='A0' metric='NoData' labels={} value=null ]}]" duration=5.247031669s

The 18:15 evaluation triggered a NoData, where the 18:16 evaluation cleared it. This flips minute by minute, sometimes going as long as 6 minutes of success before another NoData.

  • CloudWatch isn't showing elevations in UserErrors or SystemErrors for Queries.
  • Queries are taking longer because there is more data to chew through.
  • Turning on debug logging made the problem much worse.
  • Downgrading to 2.9.0 eased it a bit, but the problem is still very much there.

The change in 9.5 to only fire NoData after the For period has elapsed would have obscured this from visibility.

This feels like something internal to the grafana process is getting overwhelmed.

What you expected to happen:

It should have behaved as it formerly did, doing alert queries without issue.

How to reproduce it (as minimally and precisely as possible):

Screenshots
CloudWatch table ingestion
Title and series-names are deliberately obscured, but this is charting SampleCount for a variety of Timestream tables. Pixel resolution in this scale is 5 minutes. The alert flapping issue started shortly after the increase in ingestion rate.

Anything else we need to know?:

Environment:

  • Grafana version: 9.3.8
  • Plugin version: 2.9.1
  • OS Grafana is installed on: Ubuntu 20.04
  • User OS & Browser: N/A
  • Others:
    • Total timestream alerts: 196
    • Total alerts: 419
    • 2 CPU box, with usage staying well below 25%
@sysadmin1139
Copy link
Author

Full event-log for the failing 18:15 event.

logger=datasources rule_uid=_Nu3p9V4z org_id=1 t=2024-02-08T18:15:07.62330725Z level=debug msg="Querying for data source via SQL store" uid=S_Kqe5e7k orgId=1
logger=ngalert.state.manager rule_uid=_Nu3p9V4z org_id=1 t=2024-02-08T18:15:12.870653042Z level=debug msg="State manager processing evaluation results" resultCount=1
logger=ngalert.scheduler rule_uid=_Nu3p9V4z org_id=1 version=10 attempt=0 now=2024-02-08T18:15:00Z t=2024-02-08T18:15:12.870256927Z level=debug msg="Alert rule evaluated" results="[{Instance: State:NoData Error:<nil> Results:map[] Values:map[A0:{Var:A Labels: Value:<nil>}] EvaluatedAt:2024-02-08 18:15:00 +0000 UTC EvaluationDuration:12.870241593s EvaluationString:[ var='A0' metric='NoData' labels={} value=null ]}]" duration=5.247031669s
logger=expr rule_uid=_Nu3p9V4z org_id=1 datasourceType=grafana-timestream-datasource t=2024-02-08T18:15:12.870148906Z level=debug msg="expression datasource query (seriesSet)" query=B
logger=ngalert.state.manager rule_uid=_Nu3p9V4z org_id=1 t=2024-02-08T18:15:12.871205061Z level=debug msg="Saving alert states" count=1
logger=ngalert.state.manager rule_uid=_Nu3p9V4z org_id=1 t=2024-02-08T18:15:12.871138996Z level=debug msg="Setting alert state"
logger=ngalert.state.historian rule_uid=_Nu3p9V4z org_id=1 t=2024-02-08T18:15:12.874227359Z level=debug msg="Alert state changed creating annotation" newState="Normal (NoData)" oldState=Normal
logger=ngalert.state.historian rule_uid=_Nu3p9V4z org_id=1 t=2024-02-08T18:15:12.874466849Z level=error msg="Error saving alert annotation batch" error="json: error calling MarshalJSON for type *simplejson.Json: json: unsupported value: NaN"
logger=ngalert.state.historian rule_uid=_Nu3p9V4z org_id=1 t=2024-02-08T18:15:12.874597227Z level=debug msg="Done saving alert annotation batch"

@iwysiu
Copy link
Contributor

iwysiu commented Feb 14, 2024

Hi @sysadmin1139 , I've picked this up to take a look at it. What I suspect based on what you've said, is that we're hitting a timeout in either grafana or in timestream. I've looked at the changes between the 2.9.0 and 2.9.1 versions, and I'm not seeing anything that would affect the query runtime.
I'm going to look at it a bit more this week and see if there's anything we can do to improve this, but can you give me the alert rule you're running (with any sensitive information removed)? Also, relating to my timeout theory, does it error less if you make the period between evaluations longer?

@iwysiu iwysiu moved this from Incoming to Waiting in AWS Datasources Feb 14, 2024
@sysadmin1139
Copy link
Author

It turns out the 2.9.0 vs 2.9.1 thing was in error. I've since downgraded to 2.9.0 and the effect is still present, though a bit reduced. If there is a timeout in Timestream, it's not showing up in the CloudWatch error metrics. I also don't know how to tune debug output to get more targeted data, if that would help to have.

@sysadmin1139
Copy link
Author

My current working theory is this is some kind of thundering herd problem, where smearing would be more effective than spacing out the avalanche of queries.

Today I tried making a second timestream datasource, under the theory that might change the contention problem. It did not.

@iwysiu
Copy link
Contributor

iwysiu commented Feb 15, 2024

Yeah, I agree that if it ends up being a timeout error we should give a more helpful message.

Can you give me the query (with any confidential information removed) and expressions you have set for the alert? It'll help me understand what's happening with it.

Also, you tell me what happens if you try setting the evaluation interval to be longer? It's possible that Grafana is cancelling the query after it waits the minute of the evaluation interval, which could possibly result in a NoData without causing an error as far as Timestream is concerned.

@sysadmin1139
Copy link
Author

The logging I quoted in the original report showed query times in small number seconds. I can sometimes reproduce this behavior in the 'Edit alert' screen with the 'preview' button. It gets no data. Refresh, data. Refresh, data. Refresh, no data.

@iwysiu
Copy link
Contributor

iwysiu commented Feb 16, 2024

Do you have waitForAllQueries toggled to true for your alerting query? By default we return partial queries from timestream, but alerting requires that we wait to return the results. If that's the issue, we should update the plugin to wait for alert queries automatically, but the toggle should fix it for now.
Screenshot 2024-02-16 at 6 25 10 PM

@sysadmin1139
Copy link
Author

I did not have that set on my exemplar bad alert. I've set it and will let it bake a few hours and see what happens.

@sysadmin1139
Copy link
Author

It's been the long weekend and the alarm hasn't flapped once. I think we found root cause.

@iwysiu iwysiu changed the title Mysterious NoData after 2.9.1 and increase in query-sizes Alert queries should automatically wait for all queries Feb 20, 2024
@iwysiu
Copy link
Contributor

iwysiu commented Feb 20, 2024

Great! It sounds like we have a workaround for now, but I'll move this to our backlog for making alert queries automatically wait.

Bug Notes:
We should check the request headers for the alert header and if it's applied, we should set WaitForResult on the query to true (example of checking the request headers in cloudwatch)

@iwysiu iwysiu moved this from Waiting to Backlog in AWS Datasources Feb 20, 2024
@sysadmin1139
Copy link
Author

Another thing we've noticed: that selector doesn't visually stay selected after you save the alarm and leave, but definitely is effective.

@iwysiu
Copy link
Contributor

iwysiu commented Feb 20, 2024

Thanks! I made a separate issue for that #276

@sysadmin1139
Copy link
Author

I saw this particular stack trace a few times before, but wasn't able to reproduce it. This happens when Previewing timestream alert queries, but only sometimes.

https://url/alerting/MHKT--ZIz/edit?returnTo=%2Falerting%2Fgrafana%2FMHKT--ZIz%2Fview%3ForgId%3D1

TypeError: e.fields[0] is undefined

di@https://fqdn/public/build/AlertingRuleForm.d245068ee1365350e124.js:145:386
div
de@https://fqdn/public/build/3217.ea2a014a59d99960d7a7.js:2:92534
vi@https://fqdn/public/build/AlertingRuleForm.d245068ee1365350e124.js:252:158
div
de@https://fqdn/public/build/3217.ea2a014a59d99960d7a7.js:2:92534
fieldset
d@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:1406:1760
div
div
o@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:6346:172
No@https://fqdn/public/build/AlertingRuleForm.d245068ee1365350e124.js:292:2635
div
div
div
t@https://fqdn/public/build/3217.ea2a014a59d99960d7a7.js:2:711589
p@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:424:3558
div
form
R@https://fqdn/public/build/3217.ea2a014a59d99960d7a7.js:2:927698
$o@https://fqdn/public/build/AlertingRuleForm.d245068ee1365350e124.js:292:5853
Uo@https://fqdn/public/build/AlertingRuleForm.d245068ee1365350e124.js:313:67
div
37417/b.Contents@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:5810:2272
div
div
div
t@https://fqdn/public/build/3217.ea2a014a59d99960d7a7.js:2:711589
p@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:424:3558
div
b@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:5810:1289
t@https://fqdn/public/build/9076.637f4fd32666ef6824c7.js:1:211
7328/Vo<@https://fqdn/public/build/AlertingRuleForm.d245068ee1365350e124.js:313:914
c@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:1207:301
u@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:1207:855
WithErrorBoundary
Suspense
c@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:1207:301
_c@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:8842:350
t@https://fqdn/public/build/3217.ea2a014a59d99960d7a7.js:2:1060861
t@https://fqdn/public/build/3217.ea2a014a59d99960d7a7.js:2:1062882
main
Cd@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:8803:6825
t@https://fqdn/public/build/3217.ea2a014a59d99960d7a7.js:2:1057479
div
o@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:2006:4371
19175/t.KBarProvider@https://fqdn/public/build/3217.ea2a014a59d99960d7a7.js:2:271053
l@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:6154:19566
c@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:1207:301
u@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:1207:855
l@https://fqdn/public/build/3217.ea2a014a59d99960d7a7.js:2:1040511
Su@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:8855:115

@katebrenner
Copy link
Contributor

closing this since we have a workaround, but still plan to address #276

@katebrenner katebrenner closed this as not planned Won't fix, can't repro, duplicate, stale Mar 22, 2024
@github-project-automation github-project-automation bot moved this from Backlog to Done in AWS Datasources Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

3 participants