Skip to content
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.

Investigate why success rate gets worse over time #92

Closed
lidel opened this issue Apr 11, 2023 · 7 comments · Fixed by #102
Closed

Investigate why success rate gets worse over time #92

lidel opened this issue Apr 11, 2023 · 7 comments · Fixed by #102

Comments

@lidel
Copy link
Collaborator

lidel commented Apr 11, 2023

Restarting biforst-gateway on staging produces very close success rate, but over time, it erodes into worse and worse state:

Inspect yourself:

Summary from the latter:

Screenshot 2023-04-11 at 16-49-25 bifrost-gw staging metrics - Project Rhea - Dashboards - Grafana

Some ideas/thoughts why:

  • in-memory block cache perf regression is unlikely, cache size is symbolic, aims to limit roundtrips per requests.
    Staging runs with BLOCK_CACHE_SIZE=16k (Adjust size of in-memory block cache #47 (comment)) and the slowness will happens way after that is filled up multiple times, and we see on the next graph the duration increase of CAR fetch happens on Caboose side:

    Screenshot 2023-04-11 at 17-40-26 bifrost-gw staging metrics - Project Rhea - Dashboards - Grafana
    Screenshot 2023-04-11 at 17-43-42 bifrost-gw staging metrics - Project Rhea - Dashboards - Grafana

  • Saturn L1 pool health gets worse for some reason:

    Screenshot 2023-04-11 at 16-56-42 bifrost-gw staging metrics - Project Rhea - Dashboards - Grafana

  • Saturn per-L1 CAR fetch durations increase while other durations stay the same:

    2023-04-11_17-37

  • HTTP 499s suggests clients giving up before they get our response, which is aligned with things getting slower over time, and more and more clients giving up waiting for response. This is not specific to Rhea, the old mirrored node that runs Kubo is also seeing more 499s over time, but it is less prominent:

    Screenshot 2023-04-11 at 19-08-23 bifrost-gw staging metrics - Project Rhea - Dashboards - Grafana

Any feedback / thoughts / hypothesis are welcome. 🙏

@lidel lidel moved this to 🏗 In progress in bifrost-gateway Apr 11, 2023
@lidel
Copy link
Collaborator Author

lidel commented Apr 18, 2023

fyi the problem is still present -- after a reboot it performs well, then, after ~1h things get visibly worse (499s are gone for 1h, and then back, very weird):

2023-04-18_15-22

@lidel
Copy link
Collaborator Author

lidel commented Apr 24, 2023

Ok, got bit more clarity. Works fine for a while, then erodes, and we see CPUs being capped when it happens:

Screenshot 2023-04-24 at 12-47-48 bifrost-gw staging metrics - Project Rhea - Dashboards - Grafana

Looks like a bug either in GRAPH_BACKEND=true or the new Caboose -- non-staging boxes run older version with graph backend disabled and have no CPU issues:

Screenshot 2023-04-24 at 13-12-07 View panel - bifrost-gw staging metrics - Project Rhea - Dashboards - Grafana

2023-04-24_13-09

@aarshkshah1992
Copy link
Collaborator

@lidel The new Caboose wasn't there when we reported this issue two weeks back. So I'd be more suspicious of something going wrong with GRAPH_BACKEND=true

@lidel
Copy link
Collaborator Author

lidel commented Apr 24, 2023

I've restarted staging with GRAPH_BACKEND=false. If we don't see any regression after ~4h, then we know it can be safely deployed to prod.

@lidel
Copy link
Collaborator Author

lidel commented Apr 24, 2023

GRAPH_BACKEND=false looks solid, was stable on staging for long enough to confirm the problem is limited to GRAPH_BACKEND=true:

2023-04-24_20-08

@BigLep
Copy link

BigLep commented Apr 25, 2023

Related PR: #102

@lidel
Copy link
Collaborator Author

lidel commented Apr 27, 2023

Just to wrap this up, tldr

The fix removed CPU load on staging:

2023-04-27_17-23

and we've had it running for over 12h without the usual success rate going down after ~2h mark:

2023-04-26_17-19

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants