ingest/ledgerbackend: add trusted hash to captive core catchup #5431

sreuland · 2024-08-19T19:46:59Z

PR Checklist

PR Structure

This PR has reasonably narrow scope (if not, break it down into smaller PRs).
This PR avoids mixing refactoring changes with feature changes (split into two PRs
otherwise).
This PR's title starts with name of package that is most changed in the PR, ex.
services/friendbot, or all or doc if the changes are broad or impact many
packages.

Thoroughness

This PR adds tests for the most critical parts of the new functionality or fixes.
I've updated any docs (developer docs, .md
files, etc... affected by this change). Take a look in the docs folder for a given service,
like this one.

Release planning

I've reviewed the changes in this PR and if I consider them worthwhile for being mentioned on release notes then I have updated the relevant CHANGELOG.md within the component folder structure. For example, if I changed horizon, then I updated (services/horizon/CHANGELOG.md. I add a new line item describing the change and reference to this PR. If I don't update a CHANGELOG, I acknowledge this PR's change may not be mentioned in future release notes.
I've decided if this PR requires a new major/minor version according to
semver, or if it's mainly a patch change. The PR is targeted at the next
release branch if it's not a patch change.

What

The Captive Core backend now performs 'online' stellar-core run for bounded modes of tx-meta retrieval, which will be used for Horizon's db reingest range and ingest verify-range commands.

Using run enables core to build, validate, and emit trusted ledger hashes into the tx-meta stream for the requested ledger range. The bounded range commands will no longer do the 'offline' mode of running core catchup for generating tx-meta from just history archives, which does not guarantee verification of the ledger hashes to that of live network as mentioned in (#4538).

Due to the usage of run with LCL set to the from , there is now potential for longer run time reingest and verify-range execution durations due to core having to perform online replay upfront from network latest ledger back to from. The longer runtime duration will be proportional to the older age of the from ledger.

Why

Prevents consuming tx-meta from a potentially tampered history archives server.

Closes #4538

Known limitations

The execution duration will be longer for bounded ranges with relatively old 'from' ledger sequence compared to latest network age(ledger sequence).

Evaluated other paths for obtaining trusted hashes, by enhancing the usage of existing catchup first, what was discovered is that command usage with trusted hash input flags leads to complex scenarios for obtaining the hash(es) in deployments and out-of-band from the horizon command(reingest, verify-state) invocation:

--trusted-checkpoint-hashes - to use this with catchup, need to first run verify-checkpoints from core to build the file, this will take just as long as run with LCL=1 at first time. After that, need to find a persistent disk storage path to keep the file, which contradicts the reingest cli expectations, which allows (re)running from any host/path, and always creates a new storage path for parallel purposes, and avoiding conflicts in general with multiple captive instances on same host. This doesn't scale for the range of uses cases that reingest can be used and would impinge on it.
--trusted-hash - to use this with catchup, requires first getting the live network side trusted hash for the intended 'to' ledger first, this can only be done via run with LCL={to-1} and ingesting the 'to' ledger and then stopping core.
- The horizon db could have 'to' ledger previously ingested from prior, but that probably shouldn't be considered a trusted source as the db table is no different than a HA server.
- This approach is similar to that of current proposed in pr, the difference being the PR uses run with LCL={from-1} and collects the meta for range from the online 'internal catchup/replay' mode that core runs regardless, whereas this would generate the meta for range from offline catchup command.

…sts that use catchup

…for tx size limit test

…it loop

sreuland · 2024-08-20T23:57:23Z

services/horizon/internal/test/integration/integration.go

@@ -566,6 +567,13 @@ func (i *Test) waitForCore() {
 			if durationSince := time.Since(infoTime); durationSince < coreStartupPingInterval {
 				time.Sleep(coreStartupPingInterval - durationSince)
 			}
+			cmdline := []string{"-f", integrationYaml, "logs", "core"}


added this for visibility on core startup during test execution, as the docker compose container output is detached, this helped identify a core dump that was happening in the container as the horizon test logs just repeat 'cannot connect to localhost:11626'

…on tests

sreuland · 2024-08-26T20:45:42Z

ingest/ledgerbackend/captive_core_backend.go

+		last = &boundedTo
+	}
+
+	c.lastLedger = last


this is the key to being able to (re)use live runFrom with bounded range, trigger the captive core to be stopped once it emits ledger on pipe that matched up to the bounded to if exists, which would have been same as core catchup process terminating at point of same to ledger emitted.

…out of space

…f disk space

…g it takes on CI

sreuland · 2024-08-27T21:56:00Z

.github/workflows/horizon.yml

-          # Any range should do for basic testing, this range was chosen pretty early in history so that it only takes a few mins to run
-          docker run -e BRANCH=$(git rev-parse HEAD) -e FROM=10000063 -e TO=10000127 stellar/horizon-verify-range
+          # Use small default range of two most recent checkpoints back from latest archived checkpoint. 
+          docker run -e TESTNET=true -e BRANCH=$(git rev-parse HEAD) -e FROM=0 -e TO=0 stellar/horizon-verify-range


using latest checkpoint on testnet for verify-range, tried similar on pubnet, takes 45+ minutes, mostly on change processor ledger entry processing.

tamirms · 2024-08-28T08:28:20Z

@sreuland could you provide some data on how much slower db reingest range now that we use stellar-core run instead of stellar-core catchup?

…md flag parsing as it's needed now since captive core does run mode

sreuland · 2024-08-29T02:09:49Z

@sreuland could you provide some data on how much slower db reingest range now that we use stellar-core run instead of stellar-core catchup?

On mac/m1, with core v21.3.1, pubnet network:

Range	this pr	horizon v2.32.0	notes
53136575-53136639	53 minutes	27 minutes	recent 64 ledger range

i'm running older/larger ranges, will post results once complete.

tamirms · 2024-08-29T07:58:15Z

@sreuland I think we need to take a step back and consider alternative implementations.

The 26 minute overhead on a recent 64 ledger range is pretty significant. Assuming the overhead increases linearly as you go further back in time, it seems likely that this issue becomes worse when reingesting older ledger ranges. The overhead of using stellar-core run instead of stellar-core catchup is also compounded when doing parallel reingestion with several workers on a large ledger range.

You mentioned --trusted-checkpoint-hashes as one possible alternative implementation. Could you measure how long it takes to generate the trusted hashes file by running stellar-core verify-checkpoints ? Generating the file once and then using it during parallel reingestion of large ledger ranges seems appealing because the overhead to generate the file is just a one time cost compared to the overhead of using stellar-core run on every task in the parallel reingestion task queue.

Another idea to consider is that we can verify the ledger chain at the end of a running reingestion task. After ingesting a bounded range with captive core we can check the ledgers at the boundaries of that range in the horizon db.

For example if we reingest the bounded range [start, end] we would query the horizon db for ledger end+1 and, if it exists, we check that previous ledger hash of ledger end+1 matches the ledger hash of end. Similarly, we can query the horizon db for ledger start-1 and, if it exists, we check that the previous ledger hash of ledger start matches the ledger hash of start-1. If there is a mismatch in ledger hashes we can exit with an error.

Also, when ingesting an unbounded range with captive core we would need check that the start ledger's previous ledger hash matches whatever we have in the horizon db. Howevr, one of the downsides of this approach is that if there is a ledger mismatch we only find out at the very end.

It would be helpful to have a design document which outlines different possible implementations with their pros / cons to make it easier to analyze which is the best solution.

sreuland · 2024-08-29T19:07:37Z

It would be helpful to have a design document which outlines different possible implementations with their pros / cons to make it easier to analyze which is the best solution.

yes, good suggestion, I've started design doc to continue discussion on options.

sreuland · 2024-08-29T19:12:01Z

Also, when ingesting an unbounded range with captive core we would need check that the start ledger's previous ledger hash matches whatever we have in the horizon db. Howevr, one of the downsides of this approach is that if there is a ledger mismatch we only find out at the very end.

Ah, good call out, I think this could be a separate scope/ticket, created #5450, to add current network validation via ledger hash to horizon's db as part of forward 'live' ingestion mode? This ticket was focused on hash validation specific to reingestion of bounded offline.

sreuland · 2024-10-03T17:22:53Z

closing this pr as obsolete now in favor of a different design approach mentioned for repairing skip list usage in captive core binary instead, once this repair propagates, it will enable existing 'catchup' to emit anchored(to current network) trusted hashes within ledger metadata.

sreuland added 3 commits August 19, 2024 12:20

stellar#4538: obtain trusted hash for captive core catchup command

61a5f80

Merge remote-tracking branch 'upstream/master' into trusted_catchup

1a50ed7

stellar#4538: updated changelog notes

3b706e4

sreuland marked this pull request as draft August 20, 2024 00:08

sreuland added 2 commits August 19, 2024 23:26

stellar#4538: use LedgerBackend interface for NewCaptive, fix unit te…

2937c18

…sts that use catchup

Merge remote-tracking branch 'upstream/master' into trusted_catchup

116f3b1

sreuland marked this pull request as ready for review August 20, 2024 16:11

sreuland added 4 commits August 20, 2024 09:26

stellar#4538: fix govet warn

8da847f

stellar#4538: fixed TestTxSubLimitsBodySize to not depend on soroban …

f91fa55

…for tx size limit test

stellar#4538: added core container log output on integrationt test wa…

d049e73

…it loop

stellar#4538: pin stellar-core image in CI to last stable 21.2.1 image

805950a

sreuland commented Aug 20, 2024

View reviewed changes

stellar#4538: fix captive listen port conflicts on reingest integrati…

3b335b7

…on tests

sreuland marked this pull request as draft August 22, 2024 16:09

sreuland added 3 commits August 26, 2024 11:59

message

c1295eb

Merge remote-tracking branch 'upstream/master' into trusted_catchup

be4df19

stellar#4538: updated CHANGELOGs

e56e346

sreuland marked this pull request as ready for review August 26, 2024 20:15

stellar#4538: fixed verify-range ci test

2270bba

sreuland commented Aug 26, 2024

View reviewed changes

sreuland added 7 commits August 26, 2024 14:02

stellar#4538: fixed verify-range ci test, again

7315813

stellar#4538: fixed shell script syntax on verify range

d236308

stellar#4538: create free space on gh runner for verify-range due to …

32bb6b9

…out of space

stellar#4538: use next larger gh runner for verify-range due to out o…

0ee83dc

…f disk space

stellar#4538: try older range on pubnet for verify-range, see how lon…

659d0e0

…g it takes on CI

stellar#4538: use testnet for verify-range, shorter duration than pubnet

6e61306

Merge remote-tracking branch 'upstream/master' into trusted_catchup

3d23723

sreuland commented Aug 27, 2024

View reviewed changes

stellar#4538: enabled captive core full config option from reingest c…

d58b15a

…md flag parsing as it's needed now since captive core does run mode

stellar#4538: included new file for db command tests

41eeb8f

sreuland marked this pull request as draft August 30, 2024 16:22

This was referenced Aug 30, 2024

Validate ledger chain of prior history during ingestion catch up #5450

Open

Horizon: missing information passed to captive-core when configured to run "on disk" #4538

Open

sreuland closed this Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingest/ledgerbackend: add trusted hash to captive core catchup #5431

ingest/ledgerbackend: add trusted hash to captive core catchup #5431

sreuland commented Aug 19, 2024 •

edited

Loading

sreuland Aug 20, 2024

sreuland Aug 26, 2024

sreuland Aug 27, 2024

tamirms commented Aug 28, 2024

sreuland commented Aug 29, 2024

tamirms commented Aug 29, 2024

sreuland commented Aug 29, 2024

sreuland commented Aug 29, 2024 •

edited

Loading

sreuland commented Oct 3, 2024

ingest/ledgerbackend: add trusted hash to captive core catchup #5431

ingest/ledgerbackend: add trusted hash to captive core catchup #5431

Conversation

sreuland commented Aug 19, 2024 • edited Loading

PR Structure

Thoroughness

Release planning

What

Why

Known limitations

sreuland Aug 20, 2024

Choose a reason for hiding this comment

sreuland Aug 26, 2024

Choose a reason for hiding this comment

sreuland Aug 27, 2024

Choose a reason for hiding this comment

tamirms commented Aug 28, 2024

sreuland commented Aug 29, 2024

tamirms commented Aug 29, 2024

sreuland commented Aug 29, 2024

sreuland commented Aug 29, 2024 • edited Loading

sreuland commented Oct 3, 2024

sreuland commented Aug 19, 2024 •

edited

Loading

sreuland commented Aug 29, 2024 •

edited

Loading