Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix(node): altda failover to ethda should keep finalizing l2 chain #12845

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from

Conversation

samlaf
Copy link
Contributor

@samlaf samlaf commented Nov 6, 2024

Description

Wenwei from Polymer raised this bug with us.

We are currently in the process of adding failover to altda (see this; batcher PR coming soon), and realized that when failover happens, the L2 finalized head completely stalls.

This PR consists of 2 commits:

  1. bb6be8d: test to show the behavior (fails on this commit)
  2. 672f820: fix damgr so that finalizedHead can keep advancing even after failover

Tests

First commit (see above) contains test that shows the buggy behavior. To run:

git checkout bb6be8d
go test -run ^TestAltDA_FinalizationAfterEthDAFailover$ github.com/ethereum-optimism/optimism/op-e2e/actions/altda

Fix

The problem with the damgr as it is is that

d.finalizedHead = d.state.lastPrunedCommitment

is always run no matter what. This means that finalization is always driven by altda commitments, but after failover, damgr stops seeing the commitments (because they arent altda commitments anymore). The fix in the second commit is to let the finalization be driven by L1 finalization when there are no altda commitments managed by the damgr.

NOTE: I don't fully understand the subtleties of the damgr and derivation pipeline interactions, so might be doing something wrong here... but seems sound to me, especially since it passes the test.

Additional Context
The test is a bit ugly currently (see TODO comment below). Someone with better understanding of op-node and event processing order would surely be able to help me clean this up. I use this patch for debugging event processing, in case this is of use to anyone:

diff --git a/op-e2e/actions/altda/altda_test.go b/op-e2e/actions/altda/altda_test.go
index 9d60f2fee..c6ef361d7 100644
--- a/op-e2e/actions/altda/altda_test.go
+++ b/op-e2e/actions/altda/altda_test.go
@@ -194,7 +194,7 @@ func (a *L2AltDA) ActNewL2Tx(t helpers.Testing) {
 //
 // 17 makes sense because challengeWindow=16 and we create 1 extra block before that,
 // and 204 L2blocks = 17 L1blocks * 12 L2blocks/L1block (L1blocktime=12s, L2blocktime=1s)
-func (a *L2AltDA) ActNewL2TxFinalized(t helpers.Testing) {
+func (a *L2AltDA) ActNewL2TxFinalized(t helpers.Testing, logEvents ...bool) {
 	// Include a new l2 batcher transaction, submitting an input commitment to the l1.
 	a.ActNewL2Tx(t)
 	// Create ChallengeWindow empty blocks so the above batcher blocks can finalize (can't be challenged anymore)
@@ -203,7 +203,20 @@ func (a *L2AltDA) ActNewL2TxFinalized(t helpers.Testing) {
 	// TODO: understand why we need to drain the pipeline before AND after actL1Finalized
 	a.sequencer.ActL2PipelineFull(t)
 	a.ActL1Finalized(t)
-	a.sequencer.ActL2PipelineFull(t)
+	if logEvents == nil {
+		a.sequencer.ActL2PipelineFull(t)
+	} else {
+		// Log all events until the end of the pipeline
+		count := 0
+		a.sequencer.ActL2EventsUntil(t, func(ev event.Event) bool {
+			count++
+			a.log.Info("new event", "event", ev, "count", count)
+			if count == 100 {
+				return true
+			}
+			return false
+		}, 100, false)
+	}
 
 	// Uncomment the below code to observe the behavior described in the TODO above
 	// syncStatus := a.sequencer.SyncStatus()
@@ -680,5 +693,12 @@ func TestAltDA_FinalizationAfterEthDAFailover(gt *testing.T) {
 		ssAfter := harness.sequencer.SyncStatus()
 		// Even after failover, the finalized head should continue advancing normally
 		require.Equal(t, ssBefore.FinalizedL2.Number+diffL2Blocks, ssAfter.FinalizedL2.Number)
+
+		harness.log.Info("Sync status before",
+			"unsafeL1", ssBefore.HeadL1.Number, "safeL1", ssBefore.SafeL1.Number, "finalizedL1", ssBefore.FinalizedL1.Number,
+			"unsafeL2", ssBefore.UnsafeL2.Number, "safeL2", ssBefore.SafeL2.Number, "finalizedL2", ssBefore.FinalizedL2.Number)
+		harness.log.Info("Sync status after",
+			"unsafeL1", ssAfter.HeadL1.Number, "safeL1", ssAfter.SafeL1.Number, "finalizedL1", ssAfter.FinalizedL1.Number,
+			"unsafeL2", ssAfter.UnsafeL2.Number, "safeL2", ssAfter.SafeL2.Number, "finalizedL2", ssAfter.FinalizedL2.Number)
 	}
 }

… after failover to ethda

Currently it does not, as shown by the test TestAltDA_FinalizationAfterEthDAFailover failing
Weiwei from Polymer found this bug. He proposed a solution. This is an alternative solution which seems simpler, but not 100% of its soundness.
Comment on lines 674 to 679
if i == 0 {
// TODO: figure out why this is needed
// I think it's because the L1 driven finalizedHead is set to L1FinalizedHead-ChallengeWindow (see damgr.go updateFinalizedFromL1),
// so it trails behind by an extra challenge_window when we switch over to ethDA.
harness.ActNewL2TxFinalized(t)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: this felt weird to me, but maybe its expected and fine?

Copy link

@alfredo-stonk alfredo-stonk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

// If a commitment was pruned, set the finalized head to that commitment's inclusion block
// When no commitments are left to be pruned (one example is if we have failed over to ethda)
// then updateFinalizedFromL1 becomes the main driver of the finalized head.
// Note that updateFinalizedFromL1 is only called when len(d.state.commitments) == 0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ... only called when d.state.NoCommitments() is true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 646 to 647

func TestAltDA_FinalizationAfterEthDAFailover(gt *testing.T) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this test case covers ethDA -> altDA. Can we also have a similar case ethDA -> altDA -> ethDA, simulating a temp altDA failure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: ce0267b
I don't fully understand the finalization behavior. Look at the weird if i==0 cases. Would appreciate help figuring out whether this behavior is really what we want, and if so, how to best explain it in the comment/assert logic.

// then updateFinalizedFromL1 becomes the main driver of the finalized head.
// Note that updateFinalizedFromL1 is only called when len(d.state.commitments) == 0
var zero eth.L1BlockRef
if lastPrunedCommIncBlock != zero {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use zero value directly, to be consistent with other zero value cases in the repo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len(d.state.commitments) == 0 => d.state.NoCommitments() is true.
@samlaf samlaf requested review from a team as code owners November 18, 2024 10:25
@samlaf samlaf requested a review from geoknee November 18, 2024 10:25
@samlaf samlaf changed the title Fix: altda failover to ethda should keep finalizing l2 chain Fix(node): altda failover to ethda should keep finalizing l2 chain Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants