Investigate why success rate gets worse over time #92

lidel · 2023-04-11T15:46:36Z

Restarting biforst-gateway on staging produces very close success rate, but over time, it erodes into worse and worse state:

Inspect yourself:

overview@stg board
detailed bifrost-gw staging

Summary from the latter:

Some ideas/thoughts why:

in-memory block cache perf regression is unlikely, cache size is symbolic, aims to limit roundtrips per requests.
Staging runs with BLOCK_CACHE_SIZE=16k (Adjust size of in-memory block cache #47 (comment)) and the slowness will happens way after that is filled up multiple times, and we see on the next graph the duration increase of CAR fetch happens on Caboose side:
Saturn L1 pool health gets worse for some reason:
Saturn per-L1 CAR fetch durations increase while other durations stay the same:
HTTP 499s suggests clients giving up before they get our response, which is aligned with things getting slower over time, and more and more clients giving up waiting for response. This is not specific to Rhea, the old mirrored node that runs Kubo is also seeing more 499s over time, but it is less prominent:

Any feedback / thoughts / hypothesis are welcome. 🙏

The text was updated successfully, but these errors were encountered:

lidel · 2023-04-18T13:24:03Z

fyi the problem is still present -- after a reboot it performs well, then, after ~1h things get visibly worse (499s are gone for 1h, and then back, very weird):

lidel · 2023-04-24T11:15:58Z

Ok, got bit more clarity. Works fine for a while, then erodes, and we see CPUs being capped when it happens:

Looks like a bug either in GRAPH_BACKEND=true or the new Caboose -- non-staging boxes run older version with graph backend disabled and have no CPU issues:

aarshkshah1992 · 2023-04-24T11:28:46Z

@lidel The new Caboose wasn't there when we reported this issue two weeks back. So I'd be more suspicious of something going wrong with GRAPH_BACKEND=true

lidel · 2023-04-24T15:21:55Z

I've restarted staging with GRAPH_BACKEND=false. If we don't see any regression after ~4h, then we know it can be safely deployed to prod.

lidel · 2023-04-24T18:12:14Z

GRAPH_BACKEND=false looks solid, was stable on staging for long enough to confirm the problem is limited to GRAPH_BACKEND=true:

BigLep · 2023-04-25T16:44:36Z

Related PR: #102

lidel · 2023-04-27T15:19:40Z

Just to wrap this up, tldr

The fix removed CPU load on staging:

and we've had it running for over 12h without the usual success rate going down after ~2h mark:

lidel added this to bifrost-gateway Apr 11, 2023

lidel moved this to 🏗 In progress in bifrost-gateway Apr 11, 2023

BigLep mentioned this issue Apr 24, 2023

meta: GRAPH_BACKEND fixes and latency improvements #88

Closed

lidel mentioned this issue Apr 24, 2023

Expose /debug/pprof #99

Closed

lidel mentioned this issue Apr 25, 2023

fix: break graph api notifiers across a sync map #102

Merged

willscott closed this as completed in #102 Apr 26, 2023

github-project-automation bot moved this from 🏗 In progress to ✅ Done in bifrost-gateway Apr 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate why success rate gets worse over time #92

Investigate why success rate gets worse over time #92

lidel commented Apr 11, 2023 •

edited

Loading

lidel commented Apr 18, 2023 •

edited

Loading

lidel commented Apr 24, 2023

aarshkshah1992 commented Apr 24, 2023

lidel commented Apr 24, 2023 •

edited

Loading

lidel commented Apr 24, 2023 •

edited

Loading

BigLep commented Apr 25, 2023

lidel commented Apr 27, 2023 •

edited

Loading

Investigate why success rate gets worse over time #92

Investigate why success rate gets worse over time #92

Comments

lidel commented Apr 11, 2023 • edited Loading

lidel commented Apr 18, 2023 • edited Loading

lidel commented Apr 24, 2023

aarshkshah1992 commented Apr 24, 2023

lidel commented Apr 24, 2023 • edited Loading

lidel commented Apr 24, 2023 • edited Loading

BigLep commented Apr 25, 2023

lidel commented Apr 27, 2023 • edited Loading

lidel commented Apr 11, 2023 •

edited

Loading

lidel commented Apr 18, 2023 •

edited

Loading

lidel commented Apr 24, 2023 •

edited

Loading

lidel commented Apr 24, 2023 •

edited

Loading

lidel commented Apr 27, 2023 •

edited

Loading