-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GC and persistence improvements #3787
base: master
Are you sure you want to change the base?
Conversation
It doesn't change anything logically, but transfer GC needs to have current block timestamp for its logic and if we're delaying it with MPT GC it can more often fail to obtain it: 2025-01-13T16:15:18.311+0300 ERROR failed to get block timestamp transfer GC {"time": "1.022µs", "index": 20000} It's not critical, this garbage can still be collected on the next run, but we better avoid this anyway. Refs. #3783. Signed-off-by: Roman Khimov <[email protected]>
The intention here is to reduce the amount of in-flight changes and prevent OOM. It doesn't matter what we're doing, persisting or collecting garbage, what matters is that we're behind the schedule of regular persist cycle. Refs. #3783. Signed-off-by: Roman Khimov <[email protected]>
RemoveUntraceableBlocks without RemoveUntraceableHeaders is still a valid configuration and removeOldTransfers() only checks for gcBlockTimes now for timestamps, so they should always be added. This fixes 2025-01-13T23:28:57.340+0300 ERROR failed to get block timestamp transfer GC {"time": "1.162µs", "index": 20000} for RemoveUntraceableBlocks-only configurations. Signed-off-by: Roman Khimov <[email protected]>
It is very critical in fact. Signed-off-by: Roman Khimov <[email protected]>
We're trying to delete less than 1% of data in the default configuration for mainnet, so it looks like this: 2025-01-14T15:51:39.449+0300 INFO finished MPT garbage collection {"removed": 221115, "kept": 71236766, "time": "5m40.323822085s"} Spending this much time for this low gain every 10K blocks is far from being optimal. Signed-off-by: Roman Khimov <[email protected]>
d019d32
to
c071932
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #3787 +/- ##
==========================================
- Coverage 83.09% 83.07% -0.02%
==========================================
Files 335 335
Lines 46873 46894 +21
==========================================
+ Hits 38949 38959 +10
- Misses 6337 6346 +9
- Partials 1587 1589 +2 ☔ View full report in Codecov by Sentry. |
In fact the last one does fix #3783, the default configuration can then be used without OOMs, but it's barely doing anything useful, just GCing over and over again:
|
If GC takes 5m every 10K blocks then 4M blocks will take more than a day, this obviously doesn't make any sense. |
For comparison, with LevelDB GC works like this (standard 10K settings):
|
Instead of tick-tocking with sync/async and having an unpredictable data set we can just try to check for the real amount of keys that be processed by the underlying DB. Can't be perfect, but still this adds some hard limit to the amount of in-memory data. It's also adaptive, slower machines will keep less and faster machines will keep more. This gives almost perfect 4s cycles for mainnet BoltDB with no tail cutting, it makes zero sense to process more blocks since we're clearly DB-bound: 2025-01-15T11:35:00.567+0300 INFO persisted to disk {"blocks": 1469, "keys": 40579, "headerHeight": 5438141, "blockHeight": 5438140, "velocity": 9912, "took": "4.378939648s"} 2025-01-15T11:35:04.699+0300 INFO persisted to disk {"blocks": 1060, "keys": 39976, "headerHeight": 5439201, "blockHeight": 5439200, "velocity": 9888, "took": "4.131985438s"} 2025-01-15T11:35:08.752+0300 INFO persisted to disk {"blocks": 1508, "keys": 39658, "headerHeight": 5440709, "blockHeight": 5440708, "velocity": 9877, "took": "4.052347569s"} 2025-01-15T11:35:12.807+0300 INFO persisted to disk {"blocks": 1645, "keys": 39565, "headerHeight": 5442354, "blockHeight": 5442353, "velocity": 9864, "took": "4.05547743s"} 2025-01-15T11:35:17.011+0300 INFO persisted to disk {"blocks": 1472, "keys": 39519, "headerHeight": 5443826, "blockHeight": 5443825, "velocity": 9817, "took": "4.203258142s"} 2025-01-15T11:35:21.089+0300 INFO persisted to disk {"blocks": 1345, "keys": 39529, "headerHeight": 5445171, "blockHeight": 5445170, "velocity": 9804, "took": "4.078297579s"} 2025-01-15T11:35:25.090+0300 INFO persisted to disk {"blocks": 1054, "keys": 39326, "headerHeight": 5446225, "blockHeight": 5446224, "velocity": 9806, "took": "4.000524899s"} 2025-01-15T11:35:30.372+0300 INFO persisted to disk {"blocks": 1239, "keys": 39349, "headerHeight": 5447464, "blockHeight": 5447463, "velocity": 9744, "took": "4.281444939s"} 2× can be considered, but this calculation isn't perfect for low number of keys, so somewhat bigger tolerance is preferable for now. Overall it's not degrading performance, my mainnet/bolt run was even 8% better with this. Fixes #3249, we don't need any option this way. Fixes #3783 as well, it no longer OOMs in that scenario. It however can OOM in case of big GarbageCollectionPeriod (like 400K), but this can't be solved easily. Signed-off-by: Roman Khimov <[email protected]>
Signed-off-by: Roman Khimov <[email protected]>
c071932
to
9d9b40f
Compare
Problem
The real problem is #3783, but it's not solved here.
Solution
Still some improvements are made.