horizon: TestFillGaps and TestReingestDB are flaky #5035

2opremio · 2023-09-01T22:50:29Z

The TestFillGaps and TestReingestDB integration tests are flaky

Here is a failing instance: https://github.com/stellar/go/actions/runs/6039252190/job/16387478256

2opremio · 2023-09-01T23:11:07Z

Another instance: https://github.com/stellar/go/actions/runs/6054511730/job/16432024057?pr=5034

2opremio · 2023-09-06T14:15:08Z

Yet another instance https://github.com/stellar/go/actions/runs/6097252148/job/16544514110?pr=5040

tamirms · 2023-09-06T19:20:25Z

This appears to be caused by a bug in stellar-core. To reproduce:

start a standalone network upgraded to protocol 20
wait until the first or second checkpoint ledger when the history archive snapshots have been published
run the following command with this stellar.conf: stellar-core --conf stellar-core.conf --console catchup 39/38 --in-memory --metadata-output-stream dump

After running the catchup command stellar-core will crash and produce the following logs:

Warning: soroban-env-host-curr is running a pre-release version 0.0.17
2023-09-06T20:12:49.903 [default INFO] Config from /var/folders/5f/nz27jvzd7mngfqh_r7ys4dlc0000gn/T/TestReingestDB3045610017/002/captive-core-b332df7a/stellar-core.conf
2023-09-06T20:12:49.903 [default WARNING] Overriding MINIMUM_PERSISTENT_ENTRY_LIFETIME to 10
2023-09-06T20:12:49.903 [default WARNING] Overriding Soroban limits with TESTING_SOROBAN_HIGH_LIMIT_OVERRIDE
2023-09-06T20:12:49.903 [default INFO] 'RUN_STANDALONE' enabled in configuration file - node will not function properly with most networks
2023-09-06T20:12:49.903 [default INFO] 'ARTIFICIALLY_ACCELERATE_TIME_FOR_TESTING' enabled in configuration file - node will not function properly with most networks
2023-09-06T20:12:49.903 [default INFO] Generated QUORUM_SET: {
   "t" : 1,
   "v" : [ "local_core" ]
}

2023-09-06T20:12:49.904 [default WARNING] Quorum can be reached without validators with an archive
2023-09-06T20:12:49.904 [default WARNING] Adjusted MAX_PENDING_CONNECTIONS from 500 to 120 due to OS limits (the maximum number of file descriptors)
2023-09-06T20:12:49.908 GCBAK [Database INFO] Connecting to: sqlite3://:memory:
2023-09-06T20:12:49.909 GCBAK [Ledger INFO] Streaming metadata to 'dump'
2023-09-06T20:12:49.909 GCBAK [SCP INFO] LocalNode::LocalNode@GCBAK qSet: 3bdc92
2023-09-06T20:12:49.910 GCBAK [default INFO] * 
2023-09-06T20:12:49.910 GCBAK [default INFO] * The database has been initialized
2023-09-06T20:12:49.910 GCBAK [default INFO] * 
2023-09-06T20:12:49.910 GCBAK [Database INFO] Applying DB schema upgrade to version 14
2023-09-06T20:12:49.910 GCBAK [Database INFO] Applying DB schema upgrade to version 15
2023-09-06T20:12:49.910 GCBAK [Database INFO] Applying DB schema upgrade to version 16
2023-09-06T20:12:49.910 GCBAK [Database INFO] Applying DB schema upgrade to version 17
2023-09-06T20:12:49.910 GCBAK [Database INFO] Applying DB schema upgrade to version 18
2023-09-06T20:12:49.910 GCBAK [Database INFO] Applying DB schema upgrade to version 19
2023-09-06T20:12:49.910 GCBAK [Database INFO] Applying DB schema upgrade to version 20
2023-09-06T20:12:49.910 GCBAK [Database INFO] DB schema is in current version
2023-09-06T20:12:49.910 GCBAK [Ledger INFO] Established genesis ledger, closing
2023-09-06T20:12:49.910 GCBAK [Ledger INFO] Root account: GBZXN7PIRZGNMHGA7MUUUF4GWPY5AYPV6LY4UV2GL6VJGIQRXFDNMADI
2023-09-06T20:12:49.910 GCBAK [Ledger INFO] Root account seed: SC5O7VZUXDJ6JBDSZ74DSERXL7W3Y5LTOAMRF7RQRL3TAGAPS7LUVG3L
2023-09-06T20:12:49.911 GCBAK [default INFO] Connection effective settings:
2023-09-06T20:12:49.911 GCBAK [default INFO] TARGET_PEER_CONNECTIONS: 8
2023-09-06T20:12:49.911 GCBAK [default INFO] MAX_ADDITIONAL_PEER_CONNECTIONS: 64
2023-09-06T20:12:49.911 GCBAK [default INFO] MAX_PENDING_CONNECTIONS: 120
2023-09-06T20:12:49.911 GCBAK [default INFO] MAX_OUTBOUND_PENDING_CONNECTIONS: 14
2023-09-06T20:12:49.911 GCBAK [default INFO] MAX_INBOUND_PENDING_CONNECTIONS: 106
2023-09-06T20:12:49.911 GCBAK [History WARNING] Unsafe command: use --trusted-checkpoint-hashes or --trusted-hash to ensure catchup integrity. If you want to run untrusted catchup, use --force-untrusted-catchup.
2023-09-06T20:12:49.911 GCBAK [Ledger INFO] Starting up application
2023-09-06T20:12:49.911 GCBAK [Ledger INFO] Last closed ledger (LCL) hash is 76df1b9ff762a7a8af1f3c88549fe742e488e80fde561e34b1b6115bdac8d397
2023-09-06T20:12:49.911 GCBAK [Ledger INFO] LCL is genesis: [seq=1, hash=76df1b]
2023-09-06T20:12:49.912 GCBAK [Ledger INFO] Assumed bucket-state for LCL: [seq=1, hash=76df1b]
2023-09-06T20:12:49.912 GCBAK [Ledger INFO] Changing state LM_BOOTING_STATE -> LM_CATCHING_UP_STATE
2023-09-06T20:12:49.912 GCBAK [History INFO] Downloading history archive state: .well-known/stellar-history.json
2023-09-06T20:12:49.912 GCBAK [History INFO] Starting catchup with configuration:
  lastClosedLedger: 1
  toLedger: 39
  count: 38
2023-09-06T20:12:49.912 GCBAK [History INFO] Catching up to ledger 39: Downloading state file history/00/00/00/history-00000027.json for ledger 39
2023-09-06T20:12:49.913 GCBAK [History INFO] Downloading history archive state: history/00/00/00/history-00000027.json
2023-09-06T20:12:49.943 GCBAK [History INFO] Archive information: [name: h0, server: stellar-core 19.13.1.vnext (22b9bb38416e5ce058eb9c1fadfdd414a4661833), currentLedger: 95]
2023-09-06T20:12:49.945 GCBAK [History INFO] Catching up to ledger 39: downloading ledger files 1/5 (20%)
2023-09-06T20:12:49.945 GCBAK [History INFO] Catching up to ledger 39: downloading ledger files 5/5 (100%)
2023-09-06T20:12:49.974 GCBAK [History INFO] Catching up to ledger 39: Succeeded: batch-download-ledger-00000007-0000002f : 5/5 children completed
2023-09-06T20:12:49.974 GCBAK [History INFO] Verifying ledgers [2,40)
2023-09-06T20:12:49.974 GCBAK [History INFO] Skipping verification for ledger [seq=39, hash=42326a] against SCP hash
2023-09-06T20:12:49.974 GCBAK [History INFO] Catching up to ledger 39: verifying checkpoint 1/5 (20%)
2023-09-06T20:12:49.974 GCBAK [History INFO] Catching up to ledger 39: verifying checkpoint 2/5 (40%)
2023-09-06T20:12:49.974 GCBAK [History INFO] Catching up to ledger 39: verifying checkpoint 3/5 (60%)
2023-09-06T20:12:49.974 GCBAK [History INFO] Catching up to ledger 39: verifying checkpoint 4/5 (80%)
2023-09-06T20:12:49.974 GCBAK [History INFO] History chain [2,39] verified
2023-09-06T20:12:49.974 GCBAK [History INFO] Catching up to ledger 39: Succeeded: verify-ledger-chain
2023-09-06T20:12:49.974 GCBAK [History INFO] Catching up to ledger 39: Ready to run: herder-state-consistency-work
2023-09-06T20:12:49.974 GCBAK [History INFO] Herder state is set! tracking=39, closeTime=1694027508
2023-09-06T20:12:49.974 GCBAK [History INFO] Catching up to ledger 39: Succeeded: herder-state-consistency-work
2023-09-06T20:12:49.974 GCBAK [History INFO] Downloading, unzipping and applying transactions for checkpoint 7
2023-09-06T20:12:49.974 GCBAK [History INFO] Downloading, unzipping and applying transactions for checkpoint 15
2023-09-06T20:12:49.974 GCBAK [History INFO] Downloading, unzipping and applying transactions for checkpoint 23
2023-09-06T20:12:49.974 GCBAK [History INFO] Downloading, unzipping and applying transactions for checkpoint 31
2023-09-06T20:12:49.974 GCBAK [History INFO] Downloading, unzipping and applying transactions for checkpoint 39
2023-09-06T20:12:49.974 GCBAK [History INFO] Catching up to ledger 39: Download & apply checkpoints: num checkpoints left to apply:5 (0% done)
2023-09-06T20:12:49.995 GCBAK [Ledger INFO] Starting prepareLiabilities
2023-09-06T20:12:49.995 GCBAK [Ledger INFO] prepareLiabilities completed with 0 accounts modified, 0 trustlines modified, 0 offers adjusted, 0 offers adjusted to zero, and 0 offers erased
mVersion == 2 at ledger/LedgerCloseMetaFrame.cpp:208
backtrace unavailable
Abort trap: 6

This bug only seems to occur in protocol 20. When I used the same stellar-core binary but on protocol 19 I was not able to reproduce the crash.

I used a stellar-core binary built from https://github.com/stellar/stellar-core/tree/22b9bb38416e5ce058eb9c1fadfdd414a4661833

$ stellar-core version
Warning: soroban-env-host-curr is running a pre-release version 0.0.17
quarkslab-core-54-g22b9bb384-dirty
rust version: rustc 1.71.1 (eb26296b5 2023-08-03)
soroban-env-host: 
    curr:
        package version: 0.0.17
        git version: eb5a9ba053a7b64a8eff605db625525378f7bea0
        ledger protocol version: 20
        pre-release version: 57
        rs-stellar-xdr:
            package version: 0.0.17
            git version: 39904e09941046dab61e6e35fc89e31bf2dea1cd
            base XDR git version: 65afa63b7f52c898143ebbe9541ef91fcf290ade

2opremio · 2023-09-06T19:22:32Z

This has been happening for a while (for a few stellar core builds I would say)

graydon · 2023-09-06T20:59:43Z

Hmm. The LedgerCloseMetaFrame sets its version to 2 when built on a >= v20 protocol ledger. But ... that actually happens before upgrades are applied (in LedgerManagerImpl::closeLedger , which means that when we are passing through the upgrade ledger we build a LedgerCloseMetaFrame thinking we're closing a v19 header (which will set version to 1), but the header we actually have when we complete the close is a v20 header.

Probably started failing with Garand's change to enable eviction (which is asserting this condition) but it was wrong for months before. The test meta stream contains reasonable meta should probably have caught this, but it's possible the checking isn't strict enough to catch the exact transition ledger (i.e. it might admit the case that's occurring here where ledger X that transitions from v19->v20 is emitting a v19 LCM, and only X+1 emits a v20 LCM). Tightening this test should accompany any fix.

Relatively straightforward fix -- just carefully adjust the LCMF construction path in closeLedger to happen after upgrades and make sure doing that doesn't break anything else -- just another symptom of code we haven't got especially strong testing of yet.

EDIT: I misunderstood apply-vs-upgrade order. Apparently upgrades are supposed to happen after apply so the LCM should be v19 in that close. Unfortunately I also don't quite think it's easy to call the upgrade-to-v20 ledger a v19 ledger: we already have numerous codepaths that extract the post-upgrade version number and pass it elsewhere to gate new logic (eg. the bucketlist will treat the upgrade-to-vNN ledger as turning on all vNN-gated logic, has for years; if we change that to mean only activate that logic on the next ledger after the upgrade, replay will break).

graydon · 2023-09-06T23:36:22Z

From conversation, a summary:

For a ledger that upgrades from protocol vN to vN+1, we currently have two different assumptions in different parts of the ledger-close path:
- In closeLedger we mostly treat the ledger as being on vN, eg. during tx apply and LCM construction
- In the final stage, when we call ledgerClosed, we pass vN+1 because the upgrade completed and modified the ltx header, and we fish the protocol out of the ltx header
This is probably a historical mistake but it's far too late to change: we pass vN+1 to the bucketlist, for example, which bakes it into the bucket meta record for the newly formed bucket (and activates any vN+1-gated code).
The LCM we construct in closeLedger has lately (with the eviction code) started to be amended in ledgerClosed and this is the source of the bug: it was constructed thinking it was a vN LCM, but it gets further amended under the assumption that it's a vN+1 LCM.
There are a couple ways we can fix this:
1. Since the LCM amending only happens on-or-after v20, and there will be nothing much to do on the v19->v20 boundary ledger, we can just relax the assertion that's tripping here and let the LCM-amending methods work in v19 as well. Make them no-ops -- you can call them but they just won't access the v20-related fields of the LCM. Put a big comment in explaining why this is ok and hope there's no other weird case that we're letting through by weakening that condition.
2. Add a method to the LCM that fishes out the protocol version it was constructed with and use that as the condition in ledgerClosed (and its callees) to decide to call the LCM-amending methods, rather than the post-close ledger protocol number. Again, if we do this there had better be a big comment explaining it, and some tests checking, because it's very subtle!

mollykarcher added this to Platform Scrum Sep 6, 2023

github-project-automation bot moved this to Backlog in Platform Scrum Sep 6, 2023

mollykarcher added this to the Soroban Testnet Release milestone Sep 6, 2023

mollykarcher assigned tamirms Sep 6, 2023

mollykarcher moved this from Backlog to Current Sprint in Platform Scrum Sep 6, 2023

mollykarcher added the soroban-scrum label Sep 6, 2023

anupsdf mentioned this issue Sep 7, 2023

Core crash caused by LedgerCloseMeta during upgrade of protocol stellar/stellar-core#3909

Closed

tamirms mentioned this issue Sep 7, 2023

.github: Fix caching of integration test runs #5041

Merged

7 tasks

mollykarcher moved this from Current Sprint to Blocked in Platform Scrum Sep 12, 2023

mollykarcher moved this from Blocked to Done in Platform Scrum Sep 14, 2023

mollykarcher closed this as completed Sep 14, 2023

tamirms mentioned this issue Dec 20, 2023

Support manual closing N ledgers stellar/stellar-core#4040

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

horizon: TestFillGaps and TestReingestDB are flaky #5035

horizon: TestFillGaps and TestReingestDB are flaky #5035

2opremio commented Sep 1, 2023 •

edited

Loading

2opremio commented Sep 1, 2023

2opremio commented Sep 6, 2023

tamirms commented Sep 6, 2023

2opremio commented Sep 6, 2023

graydon commented Sep 6, 2023 •

edited

Loading

graydon commented Sep 6, 2023

horizon: TestFillGaps and TestReingestDB are flaky #5035

horizon: TestFillGaps and TestReingestDB are flaky #5035

Comments

2opremio commented Sep 1, 2023 • edited Loading

2opremio commented Sep 1, 2023

2opremio commented Sep 6, 2023

tamirms commented Sep 6, 2023

2opremio commented Sep 6, 2023

graydon commented Sep 6, 2023 • edited Loading

graydon commented Sep 6, 2023

2opremio commented Sep 1, 2023 •

edited

Loading

graydon commented Sep 6, 2023 •

edited

Loading