Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ingest/ledgerbackend: add trusted hash to captive core catchup #5431

Closed
wants to merge 23 commits into from

Conversation

sreuland
Copy link
Contributor

@sreuland sreuland commented Aug 19, 2024

PR Checklist

PR Structure

  • This PR has reasonably narrow scope (if not, break it down into smaller PRs).
  • This PR avoids mixing refactoring changes with feature changes (split into two PRs
    otherwise).
  • This PR's title starts with name of package that is most changed in the PR, ex.
    services/friendbot, or all or doc if the changes are broad or impact many
    packages.

Thoroughness

  • This PR adds tests for the most critical parts of the new functionality or fixes.
  • I've updated any docs (developer docs, .md
    files, etc... affected by this change). Take a look in the docs folder for a given service,
    like this one.

Release planning

  • I've reviewed the changes in this PR and if I consider them worthwhile for being mentioned on release notes then I have updated the relevant CHANGELOG.md within the component folder structure. For example, if I changed horizon, then I updated (services/horizon/CHANGELOG.md. I add a new line item describing the change and reference to this PR. If I don't update a CHANGELOG, I acknowledge this PR's change may not be mentioned in future release notes.
  • I've decided if this PR requires a new major/minor version according to
    semver, or if it's mainly a patch change. The PR is targeted at the next
    release branch if it's not a patch change.

What

The Captive Core backend now performs 'online' stellar-core run for bounded modes of tx-meta retrieval, which will be used for Horizon's db reingest range and ingest verify-range commands.

Using run enables core to build, validate, and emit trusted ledger hashes into the tx-meta stream for the requested ledger range. The bounded range commands will no longer do the 'offline' mode of running core catchup for generating tx-meta from just history archives, which does not guarantee verification of the ledger hashes to that of live network as mentioned in (#4538).

Due to the usage of run with LCL set to the from , there is now potential for longer run time reingest and verify-range execution durations due to core having to perform online replay upfront from network latest ledger back to from. The longer runtime duration will be proportional to the older age of the from ledger.

Why

Prevents consuming tx-meta from a potentially tampered history archives server.

Closes #4538

Known limitations

The execution duration will be longer for bounded ranges with relatively old 'from' ledger sequence compared to latest network age(ledger sequence).

Evaluated other paths for obtaining trusted hashes, by enhancing the usage of existing catchup first, what was discovered is that command usage with trusted hash input flags leads to complex scenarios for obtaining the hash(es) in deployments and out-of-band from the horizon command(reingest, verify-state) invocation:

  • --trusted-checkpoint-hashes - to use this with catchup, need to first run verify-checkpoints from core to build the file, this will take just as long as run with LCL=1 at first time. After that, need to find a persistent disk storage path to keep the file, which contradicts the reingest cli expectations, which allows (re)running from any host/path, and always creates a new storage path for parallel purposes, and avoiding conflicts in general with multiple captive instances on same host. This doesn't scale for the range of uses cases that reingest can be used and would impinge on it.

  • --trusted-hash - to use this with catchup, requires first getting the live network side trusted hash for the intended 'to' ledger first, this can only be done via run with LCL={to-1} and ingesting the 'to' ledger and then stopping core.

    • The horizon db could have 'to' ledger previously ingested from prior, but that probably shouldn't be considered a trusted source as the db table is no different than a HA server.
    • This approach is similar to that of current proposed in pr, the difference being the PR uses run with LCL={from-1} and collects the meta for range from the online 'internal catchup/replay' mode that core runs regardless, whereas this would generate the meta for range from offline catchup command.

@sreuland sreuland marked this pull request as draft August 20, 2024 00:08
@sreuland sreuland marked this pull request as ready for review August 20, 2024 16:11
@@ -566,6 +567,13 @@ func (i *Test) waitForCore() {
if durationSince := time.Since(infoTime); durationSince < coreStartupPingInterval {
time.Sleep(coreStartupPingInterval - durationSince)
}
cmdline := []string{"-f", integrationYaml, "logs", "core"}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added this for visibility on core startup during test execution, as the docker compose container output is detached, this helped identify a core dump that was happening in the container as the horizon test logs just repeat 'cannot connect to localhost:11626'

@sreuland sreuland marked this pull request as draft August 22, 2024 16:09
@sreuland sreuland marked this pull request as ready for review August 26, 2024 20:15
last = &boundedTo
}

c.lastLedger = last
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the key to being able to (re)use live runFrom with bounded range, trigger the captive core to be stopped once it emits ledger on pipe that matched up to the bounded to if exists, which would have been same as core catchup process terminating at point of same to ledger emitted.

# Any range should do for basic testing, this range was chosen pretty early in history so that it only takes a few mins to run
docker run -e BRANCH=$(git rev-parse HEAD) -e FROM=10000063 -e TO=10000127 stellar/horizon-verify-range
# Use small default range of two most recent checkpoints back from latest archived checkpoint.
docker run -e TESTNET=true -e BRANCH=$(git rev-parse HEAD) -e FROM=0 -e TO=0 stellar/horizon-verify-range
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using latest checkpoint on testnet for verify-range, tried similar on pubnet, takes 45+ minutes, mostly on change processor ledger entry processing.

@tamirms
Copy link
Contributor

tamirms commented Aug 28, 2024

@sreuland could you provide some data on how much slower db reingest range now that we use stellar-core run instead of stellar-core catchup?

…md flag parsing as it's needed now since captive core does run mode
@sreuland
Copy link
Contributor Author

@sreuland could you provide some data on how much slower db reingest range now that we use stellar-core run instead of stellar-core catchup?

On mac/m1, with core v21.3.1, pubnet network:

Range this pr horizon v2.32.0 notes
53136575-53136639 53 minutes 27 minutes recent 64 ledger range

i'm running older/larger ranges, will post results once complete.

@tamirms
Copy link
Contributor

tamirms commented Aug 29, 2024

@sreuland I think we need to take a step back and consider alternative implementations.

The 26 minute overhead on a recent 64 ledger range is pretty significant. Assuming the overhead increases linearly as you go further back in time, it seems likely that this issue becomes worse when reingesting older ledger ranges. The overhead of using stellar-core run instead of stellar-core catchup is also compounded when doing parallel reingestion with several workers on a large ledger range.

You mentioned --trusted-checkpoint-hashes as one possible alternative implementation. Could you measure how long it takes to generate the trusted hashes file by running stellar-core verify-checkpoints ? Generating the file once and then using it during parallel reingestion of large ledger ranges seems appealing because the overhead to generate the file is just a one time cost compared to the overhead of using stellar-core run on every task in the parallel reingestion task queue.

Another idea to consider is that we can verify the ledger chain at the end of a running reingestion task. After ingesting a bounded range with captive core we can check the ledgers at the boundaries of that range in the horizon db.

For example if we reingest the bounded range [start, end] we would query the horizon db for ledger end+1 and, if it exists, we check that previous ledger hash of ledger end+1 matches the ledger hash of end. Similarly, we can query the horizon db for ledger start-1 and, if it exists, we check that the previous ledger hash of ledger start matches the ledger hash of start-1. If there is a mismatch in ledger hashes we can exit with an error.

Also, when ingesting an unbounded range with captive core we would need check that the start ledger's previous ledger hash matches whatever we have in the horizon db. Howevr, one of the downsides of this approach is that if there is a ledger mismatch we only find out at the very end.

It would be helpful to have a design document which outlines different possible implementations with their pros / cons to make it easier to analyze which is the best solution.

@sreuland
Copy link
Contributor Author

It would be helpful to have a design document which outlines different possible implementations with their pros / cons to make it easier to analyze which is the best solution.

yes, good suggestion, I've started design doc to continue discussion on options.

@sreuland
Copy link
Contributor Author

sreuland commented Aug 29, 2024

Also, when ingesting an unbounded range with captive core we would need check that the start ledger's previous ledger hash matches whatever we have in the horizon db. Howevr, one of the downsides of this approach is that if there is a ledger mismatch we only find out at the very end.

Ah, good call out, I think this could be a separate scope/ticket, created #5450, to add current network validation via ledger hash to horizon's db as part of forward 'live' ingestion mode? This ticket was focused on hash validation specific to reingestion of bounded offline.

@sreuland
Copy link
Contributor Author

sreuland commented Oct 3, 2024

closing this pr as obsolete now in favor of a different design approach mentioned for repairing skip list usage in captive core binary instead, once this repair propagates, it will enable existing 'catchup' to emit anchored(to current network) trusted hashes within ledger metadata.

@sreuland sreuland closed this Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Horizon: missing information passed to captive-core when configured to run "on disk"
2 participants