-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test framework for restartability #1610
Conversation
6fa3af4
to
e327cdf
Compare
49bae2a
to
56b9231
Compare
3159de4
to
ec123ad
Compare
aa7807d
to
f9461dd
Compare
9df4f81
to
4c3df54
Compare
8519138
to
d525824
Compare
Alright, I've found a potential deadlock that might explain the CI timeouts. During the process of initializing a node, we (sometimes) call out to the API data source. Basically, this happens when there were unprocessed leaves from before the shutdown, that we now try to process. The event consumer then tries to get an exclusive lock to update the data source. Unfortunately, there are certain API endpoints that also lock the data source (for reading) which then block on initialization completing so they can read from the sequencer context. Thus, the potential deadlock is:
I believe a clean solution to this is to separate the back-processing of old persisted leaves from the loading of consensus state, and move it instead to the beginning of the asynchronous task which processes future leaves. This means that node initialization will no longer block on the data source lock. This will also make consensus state loading read-only, which the fact that it wasn't was a code smell anyways |
d525824
to
7bda75e
Compare
Fix for the above deadlock in 7bda75e, and tests are now passing! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a lot here, but I've done as thorough of a review as I am able. I think the extensive test coverage should make up the difference. I'm approving. I've made a couple of minor comments, but I leave it up to you if should continue discussing them in another venue.
I considered doing this with something more dynamic like Bash or Python scripting
I'm glad you didn't do that.
.github/workflows/test.yml
Outdated
@@ -55,4 +55,4 @@ jobs: | |||
cargo build --locked --bin diff-test --release | |||
cargo test --locked --release --workspace --all-features --no-run | |||
cargo test --locked --release --workspace --all-features --verbose -- --test-threads 1 --nocapture | |||
timeout-minutes: 40 | |||
timeout-minutes: 90 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we absolutely need to run restart-ability on every PR? It adds significant overhead and it feels like something ppl making minor changes shouldn't have to worry about. Can we filter out restart-ability tests here, and add run them in a separate workflow that runs on merge to main?
.await | ||
.unwrap(); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think doc comments in selective places can help us understand what some of the less straightforward commands are doing. Then we don't have to think some much when we get to explanations of how it works in the inline comments.
/// Restart indicated number of nodes and check that remaining | |
/// nodes can still make progress. |
7bda75e
to
3baddea
Compare
Set up some Rust automation for tests that spin up a sequencer network and restart various combinations of nodes, checking that we recover liveness. Instantiate the framework with several combinations of nodes as outlined in https://www.notion.so/espressosys/Persistence-catchup-and-restartability-cf4ddb79df2e41a993e60e3beaa28992. As expected, the tests where we restart >f nodes do not pass yet, and are ignored. The others pass locally. There are many things left to test here, including: * Testing with a valid libp2p setup * Testing with _only_ libp2p and no CDN * Checking integrity of the DA/query service during and after restart But this is a pretty good starting point. I considered doing this with something more dynamic like Bash or Python scripting, leaning on our existing docker-compose or process-compose infrastructure to spin up a network. I avoided this for a few reasons: * process-compose is annoying to script and in particular has limited capabilities for shutting down and starting up processes * both docker-compose and process-compose make it hard to dynamically choose the network topology * once the basic test infrastructure is out of the way, Rust is far easier to work with for writing new checks and assertions. For example, checking for progress is way easier when we can plug directly into the HotShot event stream, vs subscribing to some stream via HTTP and parsing responses with jq
This is needed for the restart tests, where initialization can sometimes fail after a restart due to the libp2p port not being deallocated by the OS quickly enough. This necessitates a retry loop, which means all error cases need to return an error rather than panicking.
Previously, the database used by the query API was populated from a completely separate event handling task than the consensus storage. This could lead to a situation where consensus storage has already been updated with a newly decided leaf, but API storage has not, and then the node restarts, so that consensus things it is on a later leaf, but the query API has never and will never see this leaf, and thus cannot make it available: a DA failure. With this change, the query database is now populated from the consensus storage, so that consensus storage is authoritative, and the query datbase is guaranteed to always eventually reflect the status of consensus storage. The movement of data from consensus storage to query storage is tied in with consensus garbage collection, so that we do not delete any data until we are sure it has been recorded in the DA database, if appropriate. This also obsoletes the in-memory payload storage in HotShot, since we are now able to load full payloads from storage on each decide, if available.
…processed Store last processed leaf view in Postgres rather than trying to dead reckon.
These tests require non-DA nodes to store merklized state. Disabling until we have that functionality.
3ef7c94
to
6c15478
Compare
Superseded by #1985 |
Test framework
Set up some Rust automation for tests that spin up a sequencer network and restart various combinations of nodes, checking that we recover liveness. Instantiate the framework with several combinations of nodes as outlined in https://www.notion.so/espressosys/Persistence-catchup-and-restartability-cf4ddb79df2e41a993e60e3beaa28992.
There are many things left to test here, including:
These can be done in follow-up work
I considered doing this with something more dynamic like Bash or Python scripting, leaning on our existing docker-compose or process-compose infrastructure to spin up a network. I avoided this for a few reasons:
Storage refactor
The tests where all nodes restart together exposed an issue. There was a race condition where DA nodes may decide on a block, but shut down before they have updated their query storage. After the restart, they will have lost the corresponding block payload and VID info from the in-memory HotShot data structures, and they will never get it back.
This PR refactors the way query storage is updated: instead of having separate, unsynchronized event handling tasks for updating consensus storage (e.g. storing DA proposals) and updating query storage, we now have just a single task for populating consensus storage, and query storage is populated from consensus storage. This means we no longer rely at all on in-memory payload storage in order to populate our query service. This makes DA certs much more meaningful, since DA votes are conditioned on successfully storing the DA proposal in consensus storage. We can also now remove the
PayloadStore
from HotShot.For the SQL backend, query service population is coupled with consensus storage garbage collection, so that we can delete old data and collect it/move it to archival storage in an atomic transaction.
Key places to review
sequencer/src/restart_tests.rs
types/src/v0/traits.rs
,sequencer/src/persistence.rs
,sequencer/src/persistence/sql.rs