Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support manual closing N ledgers #4040

Open
leighmcculloch opened this issue Nov 17, 2023 · 9 comments
Open

Support manual closing N ledgers #4040

leighmcculloch opened this issue Nov 17, 2023 · 9 comments

Comments

@leighmcculloch
Copy link
Member

leighmcculloch commented Nov 17, 2023

Description

Explain in detail the additional functionality you would like to see in stellar-core.

Expand the capabilities of the manualclose http endpoint so that it accepts a parameter count that defaults to 1 if not provided, preserving existing behavior, but if another value is passed that the node closes a ledger and progresses the network the given number of ledgers.

Explain why this feature is important

Applications need to integration test with the network.

Describe the solution you'd like

See above under the section titled "Explain in detail the additional functionality you would like to see in stellar-core.

Describe alternatives you've considered

  1. Waiting around for ledgers to close manually, resulting in really long tests.

  2. Mocking stellar-core and not running integration tests for functionality where closing many ledgers is required.

  3. Using stellar-core's offline-close cli command to close many ledgers, but this requires stopping stellar-core which is difficult to do in an integration test environment.

Additional context

Because the network is adding state archival, which only runs after a large number of ledgers occur, it's becoming more critical that integration tests can control the progression of ledgers.

cc @tamirms

@MonsieurNicolas
Copy link
Contributor

can you describe better the actual test setup?
I am not sure why there is a need for a live network to perform this kind of testing: should just using catchup (that does not depend on a live network, just static files) would work as Horizon is supposed to be able to ingest historical data?

Also as far as test environment, afaik integration tests spin up several nodes: one validator and one ingestion node (configured with captive core)/transaction submission node.

Manual close right now only works with nodes configured with RUN_STANDALONE=false and those nodes don't work networked.

@tamirms
Copy link
Contributor

tamirms commented Nov 21, 2023

@MonsieurNicolas

can you describe better the actual test setup?

Horizon ingests asset balances which are written to storage by the stellar asset contract. There is code in horizon that accounts for when these asset balances expire and when they get restored. The expiration / restoration case is difficult to test currently because the stellar asset contract configures the lifetime of asset balances to span a month's worth of ledgers:

https://github.com/stellar/rs-soroban-env/blob/main/soroban-env-host/src/builtin_contracts/stellar_asset_contract/balance.rs#L42-L47

If we had the capability to fast-forward stellar-core by an arbitrary amount of ledgers then this would no longer be an issue for us.

@MonsieurNicolas
Copy link
Contributor

that's not what I was asking wrt test setup: I was asking if you have 1 core node or multiple ones. If it's just 1 you should be able to use command line catchup instead of a live node?

@tamirms
Copy link
Contributor

tamirms commented Nov 28, 2023

@MonsieurNicolas in our tests we have a core validator node running in its own docker container and horizon runs with a captive-core watcher node. I don't know if that counts as 1 node because although there is only 1 validator we do have the captive-core instance

@MonsieurNicolas
Copy link
Contributor

I see so some non-trivial topology. Manual close won't work in that context as it only works on a standalone node (manual closing would have to be triggered on all nodes to work properly).

For most end to end tests, any reason you need this setup and can't just rely on command line catchup (which I think is supported from Horizon)?
I would imagine that the only tests where you need to submit transactions + ingest them are smoke tests to ensure that tx submission and ingestion from a live node works as expected?

@tamirms
Copy link
Contributor

tamirms commented Nov 29, 2023

We need to submit transactions in order to populate the ledger with the ledger entries required by the test. For example, if we're testing the claimable balances endpoint, we need to submit a transaction to create a claimable balance in the ledger.

I suppose we could submit a transaction directly to core without going through horizon but going through horizon is easier in terms of debugging because we can immediately check whether the transaction succeeded or failed.

Also, in a lot of our tests, we submit a transaction, assert that the current state of a ledger entry matches our expectation, then we submit another transaction to modify the ledger entry and assert that the ledger entry changed according to expectation. This workflow where you test several changes to a ledger entry within a single test requires live ingestion. The command line catchup mode of ingestion in Horizon only ingests immutable time series data (e.g. transaction history or effects), it does not update the state of mutable ledger entries.

@MonsieurNicolas
Copy link
Contributor

sorry I missed your response -- yeah what you are describing is fine for legacy tests.

What I am proposing is to potentially write new tests (like the ones that depend on potentially closing a lot of ledgers) to be decoupled into:

  1. code that generates an interesting sequence of transaction (this can span an arbitrary number of ledgers, and can potentially take a long time) -> code uses a horizon node on the developer's dev box to generate "catchup files" for the next step
  2. code that replays the catchup files from 1 and asserts that certain things indeed happened like you expect
  3. have meta generator utilities that allow to mock core's behavior in other tests as to test actual ingestion logic (you may already do this if you fuzz ingestion somehow?)

This approach would be similar to what core does today by attaching various footprints to meta data (and other host state) as to ensure that we catch issues as early as possible in core (as opposed to more complex integration tests).

The benefits of 3 are that you get to write your own "testing protocol rules": Horizon/Soroban-RPC are supposed to be agnostic of protocol behavior (and only care about "meta"), and this allows you to test edge cases that are probably very hard to repro using core directly. Testing the actual protocol is not the best use of time for the platform team as the core team is supposed to do this for you.

@tamirms
Copy link
Contributor

tamirms commented Dec 20, 2023

@MonsieurNicolas for step (1) wouldn't it take a really long time to generate the tx meta if we wanted to close a month's worth of ledgers (the TTL for asset balances is 1 month in the stellar asset contract), or is there a stellar-core command to close lots of ledgers ?

Regarding point (3), we do have unit tests on the horizon ingestion system which run ingestion on mocked up tx meta. That allows us to test corner cases as you alluded to.

However, I disagree with:

Testing the actual protocol is not the best use of time for the platform team as the core team is supposed to do this for you.

The horizon integration tests have found a decent amount of bugs in core and horizon (where horizon misconfigures core). Testing horizon with a real core is particularly useful when we're integrating with core builds for the upcoming protocol release. Here are several examples where our integration tests with core have found legitimate bugs / issues:

https://stellarfoundation.slack.com/archives/C02B04RMK/p1621545167065100?thread_ts=1621274095.241800&cid=C02B04RMK

https://stellarfoundation.slack.com/archives/C0367MGFL/p1663965045480269

#3562

https://stellarfoundation.slack.com/archives/C02U19A2A/p1685143971176019?thread_ts=1685075476.184579&cid=C02U19A2A

https://stellarfoundation.slack.com/archives/C02B04RMK/p1691064220584809

stellar/go#5035 (comment)

https://stellarfoundation.slack.com/archives/C030Z9EHVQE/p1689772536526369?thread_ts=1689707858.145119&cid=C030Z9EHVQE

https://stellarfoundation.slack.com/archives/C02B04RMK/p1694593920928249?thread_ts=1694550900.383139&cid=C02B04RMK

@MonsieurNicolas
Copy link
Contributor

bumping this as I know you're about to meet people from core.

@MonsieurNicolas for step (1) wouldn't it take a really long time to generate the tx meta if we wanted to close a month's worth of ledgers (the TTL for asset balances is 1 month in the stellar asset contract), or is there a stellar-core command to close lots of ledgers ?

manual close in a loop would do that. the time it takes to run this is not super relevant as long as it does not take too long as we could generate this when a new core build is produced for example.
That being said, we may want to make it possible to build arbitrary ledger state even in that workflow as to create interesting "families" of history. As we're looking at improving the end to end experience for contract developers, investing in tooling to enable this might be useful.

Note that this is pretty much the pattern you will have to follow when validating an architecture built on top of "data lake".

Testing the actual protocol is not the best use of time for the platform team as the core team is supposed to do this for you.

The horizon integration tests have found a decent amount of bugs in core and horizon (where horizon misconfigures core). Testing horizon with a real core is particularly useful when we're integrating with core builds for the upcoming protocol release. Here are several examples where our integration tests with core have found legitimate bugs / issues:

I think the cost paid by the platform team in this context is just not worth it:

  • it's the core's team responsibility to write an extensive test suite. if there are problems, they should fix them, and for that they need to know the gaps instead of building complex infra downstream. From some of the issues you linked to were caused by the simple fact that there is no "expected meta" test in core, so if something happens there, it guarantees to break downstream.
  • there should be clear invariants/rules on the meta that can be enforced at runtime as well (a bit like what is done on the core side when producing it, but potentially richer based on the state you have accumulated downstream), which then allows platform to focus on automated testing like fuzzing of ingestion.
  • a bunch of the issues you linked to would be caught by simple smoke tests as certain things were completely broken (not necessarily core bugs, but things broken in platform after a core change). Like I said earlier, I support having those as long as they don't try to test too many variations already covered within core (where there is nothing to learn on the platform side) or by what "step 1" above does.

In any case, I think you should sit down with the core team and discuss how you can together improve coverage and agility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants
@leighmcculloch @tamirms @MonsieurNicolas and others