Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix fee catchup after restart #2160

Merged
merged 23 commits into from
Oct 22, 2024
Merged

Fix fee catchup after restart #2160

merged 23 commits into from
Oct 22, 2024

Conversation

jbearer
Copy link
Member

@jbearer jbearer commented Oct 10, 2024

This PR reproduces and will fix an issue discovered in staging, in which, after a staggered restart of all nodes, the network has forgotten the fee state for all accounts except those which actively produced blocks during the restart. Since the state is needed for undecided blocks, it cannot directly be obtained from persisted merklized state, which is only available for decided blocks.

This PR:

  • Adds a restart-style test reproducing the issue
  • Adds a new catchup endpoint to fetch multiple accounts at once, for performance
  • Modifies SQL-based fee account catchup to reconstruct fee state by searching back to the last decided state and replaying headers
  • Caches results of successful reconstruction in HotShot's validated state map
  • Decouples storage from API layer by creating CatchupStorage trait, distinct from CatchupDataSource. This allows us to get more information when querying storage, such as the leaf corresponding to fetched account state, which in turn ensures we are always able to cache fetched state in the HotShot view map
  • Adds a background task to fetch missing quorum proposals that are parents of proposals we have. This ensures that every node will eventually have every proposal in a chain between valid proposals and the anchor, which is required for replay to work

Key places to review

This test reproduces an issue discovered in staging, in which, after
a staggered restart of all nodes, the network has forgotten the fee
state for all accounts except those which actively produced blocks
during the restart. Since the state is needed for undecided blocks,
it cannot directly be obtained from persisted merklized state, which
is only available for decided blocks.
@jbearer jbearer force-pushed the jb/state-catchup-2 branch from 79dc84c to ef41777 Compare October 10, 2024 21:43
@jbearer jbearer force-pushed the jb/state-catchup-2 branch from 79d17b9 to d56be55 Compare October 11, 2024 00:35
@jbearer jbearer requested a review from bfish713 October 11, 2024 16:02
Copy link
Contributor

@bfish713 bfish713 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the code which uses the new request_proposal fn provided by Hotshot and it looks correct to me

During experimentation I changed the restart tests from starting up
all nodes in parallel to one at a time. This broke the no-CDN tests
because, in some cases, you need multiple nodes to restart before
libp2p can become ready, but this blocks the initialization of the
first restarted node.

Switched it back to starting up in parallel and things are working
again.
@jbearer
Copy link
Member Author

jbearer commented Oct 11, 2024

@bfish713 can you also look at the state map update during catchup? (This basically acts as a cache for successful catchup requests)

jbearer added a commit that referenced this pull request Oct 14, 2024
This is a backport of #2160 to release-gambit. It is a separate PR
because it is not a clean cherry-pick. I recommend reviewing the `main`
PR first, and then you can just look at this one to convince yourself
it's doing the same thing.
@bfish713
Copy link
Contributor

@bfish713 can you also look at the state map update during catchup? (This basically acts as a cache for successful catchup requests)

Took a look and that all looks fine to me

jbearer and others added 2 commits October 17, 2024 15:25
The simplest, most robust change was to make the task a flat retry
loop, instead of having a nested retry loop for the fetch itself,
and to update the anchor view each iteration of the loop, so that
the loop exits if we reach a decide and the anchor view becomes
more recent than the view we are fetching.
* Enable frontier catchup from storage

* Prevent recursive state reconstruction

* Prefetch accounts used for paying fees during state reconstruction
@jbearer jbearer requested a review from bfish713 October 17, 2024 23:38
@jbearer jbearer marked this pull request as ready for review October 22, 2024 17:05
@jbearer jbearer merged commit 3563c90 into main Oct 22, 2024
13 checks passed
@jbearer jbearer deleted the jb/state-catchup-2 branch October 22, 2024 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants