Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: EigenDA failover #46

Merged
merged 8 commits into from
Nov 20, 2024
Merged

Conversation

epociask
Copy link
Collaborator

@epociask epociask commented Nov 2, 2024

TLDR: Added failover from ETH DA to EigenDA in the event of detected 503 or service unavailable errors.

arbnode/batch_poster.go Show resolved Hide resolved
@@ -125,6 +125,7 @@ type BatchPoster struct {
batchReverted atomic.Bool // indicates whether data poster batch was reverted
nextRevertCheckBlock int64 // the last parent block scanned for reverting batches
postedFirstBatch bool // indicates if batch poster has posted the first batch
failoverETHDA bool // indicates if batch poster should failover to ETHDA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add more context here, like "when a failover signal is received from proxy". Or "indicates whether batch posted is allowed to failover" or something

@@ -148,6 +149,7 @@ type BatchPosterDangerousConfig struct {
type BatchPosterConfig struct {
Enable bool `koanf:"enable"`
DisableDapFallbackStoreDataOnChain bool `koanf:"disable-dap-fallback-store-data-on-chain" reload:"hot"`
EnableEigenDAFailover bool `koanf:"eigenda-failover-to-anytrust" reload:"hot"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh they use koanf, cool! Been meaning to give it a try. Do you like it?

Also check the lint issue, the flag name and tag are different.

@@ -1241,7 +1253,7 @@ func (b *BatchPoster) maybePostSequencerBatch(ctx context.Context) (bool, error)
}

var useEigenDA bool
if b.eigenDAWriter != nil {
if b.eigenDAWriter != nil && !b.failoverETHDA {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comment explaining the logic here. Why are you setting useEigenDA to true only when we don't failover?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we still want to use eigenda until we failover?

arbnode/node.go Show resolved Hide resolved
arbnode/node.go Outdated
return nil, errors.New("eigenDA and anytrust cannot both be enabled without EnableEigenDAFailover=true in batch poster config")
}

if config.EigenDA.Enable { // anytrust is enabled as an EigenDA failover
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can I know that anytrust is enabled here? if just seems to check that eigenda is enabled

Comment on lines 1485 to 1486
// to update an internal field and retrigger the poster's event loop. Since this update should be down across batch posters on a distributed cluster, we
// must also
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: down -> done

also comment unfinished

arbnode/batch_poster.go Show resolved Hide resolved
Comment on lines +69 to +71
defer func() {
cancel()
}()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
defer func() {
cancel()
}()
defer cancel()

@samlaf samlaf marked this pull request as draft November 4, 2024 12:20
@epociask epociask marked this pull request as ready for review November 7, 2024 20:31
@epociask epociask requested a review from samlaf November 9, 2024 04:18
@samlaf samlaf merged commit 48843f8 into eigenda-v3.2.1 Nov 20, 2024
8 checks passed
@samlaf
Copy link
Contributor

samlaf commented Nov 20, 2024

@epociask my baaaddddd..... :( :(
Was doing work on a bunch of proxy PRs and merging them, and merged this one as well by mistake.

Copy link
Contributor

@samlaf samlaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some nits

batchReverted atomic.Bool // indicates whether data poster batch was reverted
nextRevertCheckBlock int64 // the last parent block scanned for reverting batches
postedFirstBatch bool // indicates if batch poster has posted the first batch
eigenDAFailoverToETHDA bool // indicates if batch poster should failover to ETHDA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eigenDAFailoverToETHDA sounds like a config variable and not a state variable. Consider renaming to failedOverToETHDA or something more "stateful". Right now it sounds like a config flag that just failover is enabled.

Comment on lines +1479 to +1486
if err != nil && errors.Is(err, eigenda.SvcUnavailableErr) && b.config().EnableEigenDAFailover && b.dapWriter != nil { // Failover to anytrust commitee if enabled
log.Error("EigenDA service is unavailable, failing over to any trust mode")
b.building.useEigenDA = false
failOver = true
}

if err != nil && errors.Is(err, eigenda.SvcUnavailableErr) && b.config().EnableEigenDAFailover && b.dapWriter == nil { // Failover to ETH DA if enabled
// when failing over to ETHDA (i.e 4844, calldata), we may need to re-encode the batch. To do this in compliance with the existing code, it's easiest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe refactor to have an outer if with the 3 same conditions and then an internal if/else only for the dapWriter. Took me a while to parse the difference, and to realized that dapWriter was probably a writer to anytrust.


if err != nil && errors.Is(err, eigenda.SvcUnavailableErr) && b.config().EnableEigenDAFailover && b.dapWriter == nil { // Failover to ETH DA if enabled
// when failing over to ETHDA (i.e 4844, calldata), we may need to re-encode the batch. To do this in compliance with the existing code, it's easiest
// to update an internal field and retrigger the poster's event loop. Since the batch poster can be distributed across mulitple nodes, there could be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nodes sounds a bit weird to me in this context. Is this standard arbitrum terminology? If not I would consider renaming to "multiple batch_posters". Node reminds me of too many different things like eth node, like k8s node, etc, but typically it means "a computer/physical server", and not a process (k8s pod)

Comment on lines +1488 to +1489
// degraded temporary performance as each batch poster will re-encode the batch on another event loop tick using the coordination lock which could worst case
// could require every batcher instance to fail dispersal to EigenDA.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: could worst case could

if b.dapWriter != nil {
var eigenDaBlobInfo *eigenda.EigenDABlobInfo
eigenDADispersed := false
failOver := false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this failOver variable different from eigenDAFailoverToETHDA?

log.Error("EigenDA service is unavailable and anytrust is disabled, failing over to ETH DA")
b.eigenDAFailoverToETHDA = true

// // if the batch's size exceeds the native DA max size limit, we must re-encode the batch to accomodate the AnyTrust, calldata, and 4844 size limits
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// //

b.eigenDAFailoverToETHDA = true

// // if the batch's size exceeds the native DA max size limit, we must re-encode the batch to accomodate the AnyTrust, calldata, and 4844 size limits
if (len(sequencerMsg) > b.config().MaxSize && !b.building.use4844) || (len(sequencerMsg) > b.config().Max4844BatchSize && b.building.use4844) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: consider changing order to make it cleaner:

if (b.building.use4844 && len(sequencerMsg) > b.config().Max4844BatchSize) || (!b.building.use4844 && len(sequencerMsg) > b.config().MaxSize)


batchPosterDASuccessCounter.Inc(1)
batchPosterDALastSuccessfulActionGauge.Update(time.Now().Unix())
} else if failOver {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use eigenDAFailoverToETHDA instead? Or are those variables different?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants