Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test framework for restartability #1610

Closed
wants to merge 21 commits into from
Closed

Test framework for restartability #1610

wants to merge 21 commits into from

Conversation

jbearer
Copy link
Member

@jbearer jbearer commented Jun 17, 2024

Test framework

Set up some Rust automation for tests that spin up a sequencer network and restart various combinations of nodes, checking that we recover liveness. Instantiate the framework with several combinations of nodes as outlined in https://www.notion.so/espressosys/Persistence-catchup-and-restartability-cf4ddb79df2e41a993e60e3beaa28992.

There are many things left to test here, including:

These can be done in follow-up work

I considered doing this with something more dynamic like Bash or Python scripting, leaning on our existing docker-compose or process-compose infrastructure to spin up a network. I avoided this for a few reasons:

  • process-compose is annoying to script and in particular has limited capabilities for shutting down and starting up processes
  • both docker-compose and process-compose make it hard to dynamically choose the network topology
  • once the basic test infrastructure is out of the way, Rust is far easier to work with for writing new checks and assertions. For example, checking for progress is way easier when we can plug directly into the HotShot event stream, vs subscribing to some stream via HTTP and parsing responses with jq

Storage refactor

The tests where all nodes restart together exposed an issue. There was a race condition where DA nodes may decide on a block, but shut down before they have updated their query storage. After the restart, they will have lost the corresponding block payload and VID info from the in-memory HotShot data structures, and they will never get it back.

This PR refactors the way query storage is updated: instead of having separate, unsynchronized event handling tasks for updating consensus storage (e.g. storing DA proposals) and updating query storage, we now have just a single task for populating consensus storage, and query storage is populated from consensus storage. This means we no longer rely at all on in-memory payload storage in order to populate our query service. This makes DA certs much more meaningful, since DA votes are conditioned on successfully storing the DA proposal in consensus storage. We can also now remove the PayloadStore from HotShot.

For the SQL backend, query service population is coupled with consensus storage garbage collection, so that we can delete old data and collect it/move it to archival storage in an atomic transaction.

Key places to review

  • New test suite: sequencer/src/restart_tests.rs
  • Storage refactor: primarily in types/src/v0/traits.rs, sequencer/src/persistence.rs, sequencer/src/persistence/sql.rs

@jbearer jbearer force-pushed the jb/restart-tests branch from 6fa3af4 to e327cdf Compare June 17, 2024 21:05
@jbearer jbearer force-pushed the jb/restart-tests branch 3 times, most recently from 49bae2a to 56b9231 Compare July 1, 2024 15:19
@jbearer jbearer force-pushed the jb/restart-tests branch from 56b9231 to 3d1959b Compare July 9, 2024 14:50
@jbearer jbearer force-pushed the jb/restart-tests branch from 3159de4 to ec123ad Compare July 17, 2024 14:21
@jbearer jbearer marked this pull request as ready for review July 17, 2024 15:10
@jbearer jbearer force-pushed the jb/restart-tests branch 2 times, most recently from aa7807d to f9461dd Compare July 19, 2024 18:49
@jbearer jbearer force-pushed the jb/restart-tests branch 2 times, most recently from 9df4f81 to 4c3df54 Compare July 29, 2024 19:01
@jbearer jbearer requested a review from imabdulbasit as a code owner July 29, 2024 23:23
@jbearer jbearer force-pushed the jb/restart-tests branch from 8519138 to d525824 Compare July 30, 2024 15:28
@jbearer
Copy link
Member Author

jbearer commented Jul 30, 2024

Alright, I've found a potential deadlock that might explain the CI timeouts.

During the process of initializing a node, we (sometimes) call out to the API data source. Basically, this happens when there were unprocessed leaves from before the shutdown, that we now try to process. The event consumer then tries to get an exclusive lock to update the data source.

Unfortunately, there are certain API endpoints that also lock the data source (for reading) which then block on initialization completing so they can read from the sequencer context.

Thus, the potential deadlock is:

  • Node starts API server
  • Another node sends a catchup request (for chain config, frontier, or fee account)
  • Request handler locks API state and then awaits future for context initialization
  • Context initialization gets to the point where it has to process old leaves and blocks on a write lock on API state

I believe a clean solution to this is to separate the back-processing of old persisted leaves from the loading of consensus state, and move it instead to the beginning of the asynchronous task which processes future leaves. This means that node initialization will no longer block on the data source lock. This will also make consensus state loading read-only, which the fact that it wasn't was a code smell anyways

@jbearer jbearer force-pushed the jb/restart-tests branch from d525824 to 7bda75e Compare July 30, 2024 20:55
@jbearer
Copy link
Member Author

jbearer commented Jul 30, 2024

Fix for the above deadlock in 7bda75e, and tests are now passing!

Copy link
Contributor

@tbro tbro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot here, but I've done as thorough of a review as I am able. I think the extensive test coverage should make up the difference. I'm approving. I've made a couple of minor comments, but I leave it up to you if should continue discussing them in another venue.

I considered doing this with something more dynamic like Bash or Python scripting

I'm glad you didn't do that.

@@ -55,4 +55,4 @@ jobs:
cargo build --locked --bin diff-test --release
cargo test --locked --release --workspace --all-features --no-run
cargo test --locked --release --workspace --all-features --verbose -- --test-threads 1 --nocapture
timeout-minutes: 40
timeout-minutes: 90
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we absolutely need to run restart-ability on every PR? It adds significant overhead and it feels like something ppl making minor changes shouldn't have to worry about. Can we filter out restart-ability tests here, and add run them in a separate workflow that runs on merge to main?

.await
.unwrap();
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think doc comments in selective places can help us understand what some of the less straightforward commands are doing. Then we don't have to think some much when we get to explanations of how it works in the inline comments.

Suggested change
/// Restart indicated number of nodes and check that remaining
/// nodes can still make progress.

Set up some Rust automation for tests that spin up a sequencer
network and restart various combinations of nodes, checking that we
recover liveness. Instantiate the framework with several combinations
of nodes as outlined in https://www.notion.so/espressosys/Persistence-catchup-and-restartability-cf4ddb79df2e41a993e60e3beaa28992.

As expected, the tests where we restart >f nodes do not pass yet,
and are ignored. The others pass locally.

There are many things left to test here, including:
* Testing with a valid libp2p setup
* Testing with _only_ libp2p and no CDN
* Checking integrity of the DA/query service during and after restart
But this is a pretty good starting point.

I considered doing this with something more dynamic like Bash or
Python scripting, leaning on our existing docker-compose or process-compose
infrastructure to spin up a network. I avoided this for a few reasons:
* process-compose is annoying to script and in particular has limited
  capabilities for shutting down and starting up processes
* both docker-compose and process-compose make it hard to dynamically
  choose the network topology
* once the basic test infrastructure is out of the way, Rust is far
  easier to work with for writing new checks and assertions. For
  example, checking for progress is way easier when we can plug
  directly into the HotShot event stream, vs subscribing to some
  stream via HTTP and parsing responses with jq
This is needed for the restart tests, where initialization can sometimes
fail after a restart due to the libp2p port not being deallocated by the
OS quickly enough. This necessitates a retry loop, which means all error
cases need to return an error rather than panicking.
jbearer and others added 10 commits August 15, 2024 16:51
Previously, the database used by the query API was populated from a
completely separate event handling task than the consensus storage.
This could lead to a situation where consensus storage has already
been updated with a newly decided leaf, but API storage has not, and
then the node restarts, so that consensus things it is on a later leaf,
but the query API has never and will never see this leaf, and thus cannot
make it available: a DA failure.

With this change, the query database is now populated from the consensus
storage, so that consensus storage is authoritative, and the query datbase
is guaranteed to always eventually reflect the status of consensus storage.
The movement of data from consensus storage to query storage is tied in with
consensus garbage collection, so that we do not delete any data until we are
sure it has been recorded in the DA database, if appropriate.

This also obsoletes the in-memory payload storage in HotShot, since we are
now able to load full payloads from storage on each decide, if available.
* Don't panic in SQL persistence `collect_garbage` when no new leaves
  are decided
* Don't fail fs persistence `load_quorum_proposals` when the proposals
  directory does not exist
* Better logging for libp2p startup
…processed

Store last processed leaf view in Postgres rather than trying to
dead reckon.
These tests require non-DA nodes to store merklized state. Disabling
until we have that functionality.
@Ayiga Ayiga requested a review from tbro August 19, 2024 17:56
@jbearer
Copy link
Member Author

jbearer commented Oct 4, 2024

Superseded by #1985

@jbearer jbearer closed this Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants