Test framework for restartability #1610

jbearer · 2024-06-17T20:57:56Z

Test framework

Set up some Rust automation for tests that spin up a sequencer network and restart various combinations of nodes, checking that we recover liveness. Instantiate the framework with several combinations of nodes as outlined in https://www.notion.so/espressosys/Persistence-catchup-and-restartability-cf4ddb79df2e41a993e60e3beaa28992.

There are many things left to test here, including:

Test all nodes restarting at the same time (this is the one case that doesn't work yet, requires all nodes to store state which depends on [MACRO] non-DA state storage hotshot-query-service#664)
Checking integrity of the DA/query service during and after restart

These can be done in follow-up work

I considered doing this with something more dynamic like Bash or Python scripting, leaning on our existing docker-compose or process-compose infrastructure to spin up a network. I avoided this for a few reasons:

process-compose is annoying to script and in particular has limited capabilities for shutting down and starting up processes
both docker-compose and process-compose make it hard to dynamically choose the network topology
once the basic test infrastructure is out of the way, Rust is far easier to work with for writing new checks and assertions. For example, checking for progress is way easier when we can plug directly into the HotShot event stream, vs subscribing to some stream via HTTP and parsing responses with jq

Storage refactor

The tests where all nodes restart together exposed an issue. There was a race condition where DA nodes may decide on a block, but shut down before they have updated their query storage. After the restart, they will have lost the corresponding block payload and VID info from the in-memory HotShot data structures, and they will never get it back.

This PR refactors the way query storage is updated: instead of having separate, unsynchronized event handling tasks for updating consensus storage (e.g. storing DA proposals) and updating query storage, we now have just a single task for populating consensus storage, and query storage is populated from consensus storage. This means we no longer rely at all on in-memory payload storage in order to populate our query service. This makes DA certs much more meaningful, since DA votes are conditioned on successfully storing the DA proposal in consensus storage. We can also now remove the PayloadStore from HotShot.

For the SQL backend, query service population is coupled with consensus storage garbage collection, so that we can delete old data and collect it/move it to archival storage in an atomic transaction.

Key places to review

New test suite: sequencer/src/restart_tests.rs
Storage refactor: primarily in types/src/v0/traits.rs, sequencer/src/persistence.rs, sequencer/src/persistence/sql.rs

jbearer · 2024-07-30T20:13:19Z

Alright, I've found a potential deadlock that might explain the CI timeouts.

During the process of initializing a node, we (sometimes) call out to the API data source. Basically, this happens when there were unprocessed leaves from before the shutdown, that we now try to process. The event consumer then tries to get an exclusive lock to update the data source.

Unfortunately, there are certain API endpoints that also lock the data source (for reading) which then block on initialization completing so they can read from the sequencer context.

Thus, the potential deadlock is:

Node starts API server
Another node sends a catchup request (for chain config, frontier, or fee account)
Request handler locks API state and then awaits future for context initialization
Context initialization gets to the point where it has to process old leaves and blocks on a write lock on API state

I believe a clean solution to this is to separate the back-processing of old persisted leaves from the loading of consensus state, and move it instead to the beginning of the asynchronous task which processes future leaves. This means that node initialization will no longer block on the data source lock. This will also make consensus state loading read-only, which the fact that it wasn't was a code smell anyways

jbearer · 2024-07-30T21:39:35Z

Fix for the above deadlock in 7bda75e, and tests are now passing!

tbro

There is a lot here, but I've done as thorough of a review as I am able. I think the extensive test coverage should make up the difference. I'm approving. I've made a couple of minor comments, but I leave it up to you if should continue discussing them in another venue.

I considered doing this with something more dynamic like Bash or Python scripting

I'm glad you didn't do that.

tbro · 2024-08-08T15:42:18Z

.github/workflows/test.yml

@@ -55,4 +55,4 @@ jobs:
          cargo build --locked --bin diff-test --release
          cargo test --locked --release --workspace --all-features --no-run
          cargo test --locked --release --workspace --all-features --verbose -- --test-threads 1 --nocapture
-        timeout-minutes: 40
+        timeout-minutes: 90


Do we absolutely need to run restart-ability on every PR? It adds significant overhead and it feels like something ppl making minor changes shouldn't have to worry about. Can we filter out restart-ability tests here, and add run them in a separate workflow that runs on merge to main?

tbro · 2024-08-08T15:49:09Z

sequencer/src/restart_tests.rs

+        .await
+        .unwrap();
+    }
+


I think doc comments in selective places can help us understand what some of the less straightforward commands are doing. Then we don't have to think some much when we get to explanations of how it works in the inline comments.

Suggested change

/// Restart indicated number of nodes and check that remaining

/// nodes can still make progress.

Set up some Rust automation for tests that spin up a sequencer network and restart various combinations of nodes, checking that we recover liveness. Instantiate the framework with several combinations of nodes as outlined in https://www.notion.so/espressosys/Persistence-catchup-and-restartability-cf4ddb79df2e41a993e60e3beaa28992. As expected, the tests where we restart >f nodes do not pass yet, and are ignored. The others pass locally. There are many things left to test here, including: * Testing with a valid libp2p setup * Testing with _only_ libp2p and no CDN * Checking integrity of the DA/query service during and after restart But this is a pretty good starting point. I considered doing this with something more dynamic like Bash or Python scripting, leaning on our existing docker-compose or process-compose infrastructure to spin up a network. I avoided this for a few reasons: * process-compose is annoying to script and in particular has limited capabilities for shutting down and starting up processes * both docker-compose and process-compose make it hard to dynamically choose the network topology * once the basic test infrastructure is out of the way, Rust is far easier to work with for writing new checks and assertions. For example, checking for progress is way easier when we can plug directly into the HotShot event stream, vs subscribing to some stream via HTTP and parsing responses with jq

…ult features

This is needed for the restart tests, where initialization can sometimes fail after a restart due to the libp2p port not being deallocated by the OS quickly enough. This necessitates a retry loop, which means all error cases need to return an error rather than panicking.

Previously, the database used by the query API was populated from a completely separate event handling task than the consensus storage. This could lead to a situation where consensus storage has already been updated with a newly decided leaf, but API storage has not, and then the node restarts, so that consensus things it is on a later leaf, but the query API has never and will never see this leaf, and thus cannot make it available: a DA failure. With this change, the query database is now populated from the consensus storage, so that consensus storage is authoritative, and the query datbase is guaranteed to always eventually reflect the status of consensus storage. The movement of data from consensus storage to query storage is tied in with consensus garbage collection, so that we do not delete any data until we are sure it has been recorded in the DA database, if appropriate. This also obsoletes the in-memory payload storage in HotShot, since we are now able to load full payloads from storage on each decide, if available.

* Don't panic in SQL persistence `collect_garbage` when no new leaves are decided * Don't fail fs persistence `load_quorum_proposals` when the proposals directory does not exist * Better logging for libp2p startup

…processed Store last processed leaf view in Postgres rather than trying to dead reckon.

These tests require non-DA nodes to store merklized state. Disabling until we have that functionality.

… called

jbearer · 2024-10-04T16:37:15Z

Superseded by #1985

jbearer force-pushed the jb/restart-tests branch from 6fa3af4 to e327cdf Compare June 17, 2024 21:05

jbearer force-pushed the jb/restart-tests branch 3 times, most recently from 49bae2a to 56b9231 Compare July 1, 2024 15:19

jbearer force-pushed the jb/restart-tests branch from 56b9231 to 3d1959b Compare July 9, 2024 14:50

jbearer force-pushed the jb/restart-tests branch from 3159de4 to ec123ad Compare July 17, 2024 14:21

jbearer marked this pull request as ready for review July 17, 2024 15:10

jbearer requested review from nomaxg, philippecamacho, ImJeremyHe, sveitser and tbro as code owners July 17, 2024 15:10

jbearer force-pushed the jb/restart-tests branch 2 times, most recently from aa7807d to f9461dd Compare July 19, 2024 18:49

jbearer force-pushed the jb/restart-tests branch 2 times, most recently from 9df4f81 to 4c3df54 Compare July 29, 2024 19:01

jbearer requested a review from imabdulbasit as a code owner July 29, 2024 23:23

jbearer force-pushed the jb/restart-tests branch from 8519138 to d525824 Compare July 30, 2024 15:28

jbearer force-pushed the jb/restart-tests branch from d525824 to 7bda75e Compare July 30, 2024 20:55

tbro approved these changes Aug 9, 2024

View reviewed changes

jbearer force-pushed the jb/restart-tests branch from 7bda75e to 3baddea Compare August 12, 2024 16:15

jbearer added 7 commits August 15, 2024 15:52

Add hotshot-query-service/testing feature, remove testing from defa…

4a56525

…ult features

Configure libp2p in restart tests

9cab8f8

Adjust timeouts and thresholds so tests consistently pass

61d9e8d

Deterministically avoid port collisions

efb5233

Improve debug logging for event handling tasks

537a8cd

jbearer and others added 10 commits August 15, 2024 16:51

Bug fixes

05efffc

* Don't panic in SQL persistence `collect_garbage` when no new leaves are decided * Don't fail fs persistence `load_quorum_proposals` when the proposals directory does not exist * Better logging for libp2p startup

Improved logging around decide events

b3774f3

Use a more robust method for deciding which leaves have already been …

de7ea85

…processed Store last processed leaf view in Postgres rather than trying to dead reckon.

Disable "restart all" tests

ee1ab80

These tests require non-DA nodes to store merklized state. Disabling until we have that functionality.

update the cdn

8d2b6ad

Move event back-processing to async task so it doesn't block startup

331da9f

Document restart function

21622f4

Avoid blocking drop with task cancellation if shut_down is explicitly…

9f607ff

… called

Avoid blocking drop at end of restart tests

6c15478

jbearer force-pushed the jb/restart-tests branch from 3ef7c94 to 6c15478 Compare August 15, 2024 21:07

jbearer added 4 commits August 15, 2024 17:34

Mark restart tests slow

5bda27e

Mark restart tests as heavy

9a941af

Increase slow test timeout

8271254

Don't capture test output so I can debug timing out job

045f862

Ayiga requested a review from tbro August 19, 2024 17:56

jbearer closed this Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test framework for restartability #1610

Test framework for restartability #1610

jbearer commented Jun 17, 2024 •

edited

Loading

jbearer commented Jul 30, 2024

jbearer commented Jul 30, 2024

tbro left a comment

tbro Aug 8, 2024

tbro Aug 8, 2024

jbearer commented Oct 4, 2024


	/// Restart indicated number of nodes and check that remaining
	/// nodes can still make progress.

Test framework for restartability #1610

Test framework for restartability #1610

Conversation

jbearer commented Jun 17, 2024 • edited Loading

Test framework

Storage refactor

Key places to review

jbearer commented Jul 30, 2024

jbearer commented Jul 30, 2024

tbro left a comment

Choose a reason for hiding this comment

tbro Aug 8, 2024

Choose a reason for hiding this comment

tbro Aug 8, 2024

Choose a reason for hiding this comment

jbearer commented Oct 4, 2024

jbearer commented Jun 17, 2024 •

edited

Loading