Add a test for L1 reorgs #250

jbearer · 2023-10-06T14:51:56Z

This test successfully reproduces the exact failure in the preconfirmations node we saw in production:

test-reorg-zkevm-1-preconfirmations-node-1      | 2023-10-06T14:48:29.385Z      ERROR   synchronizer/synchronizer.go:
862     error storing the verifiedB in processTrustedVerifyBatches. BlockNumber: 42, error: %!w(*pgconn.PgError=&{ERR
OR 23503 insert or update on table "verified_batch" violates foreign key constraint "verified_batch_batch_num_fkey" K
ey (batch_num)=(1) is not present in table "virtual_batch".  0 0   state verified_batch   verified_batch_batch_num_fk
ey ri_triggers.c 2608 ri_ReportViolation})%!(EXTRA string=

Now to try and fix it

Closes #246

jbearer · 2023-10-12T11:06:50Z

Ok. Take 2. The Anvil scripting didn't work out for a whole bunch of reasons, the latest one being that initialization of the zkEVM node depends on event logs emitted during contract creation, which means it depends not only on the state of the L1 but also the history. This does not work with the way Anvil snapshots only capture state but not history. At least we got a good test for the sequencer L1 client out of that adventure 🤷

It seems that there is no way to simulate a reorg that is sufficiently realistic for the zkEVM node, except to actually do a reorg. Amazingly, there is not a good way to do this locally with any of the usual Ethereum dev tools. The best we can do is start a local Geth PoW network, disconnect one node, then later reconnect it so that it switches over to the longest chain. Luckily, I found an open source tool for automating this, https://github.com/0xsequence/reorgme. The tool is a bit old and I needed to make some changes for our use case, so this is currently using the fork https://github.com/EspressoSystems/reorgme.

This test now reproduces the foreign key error, not every single time, but pretty reliably. Unfortunately it doesn't necessarily fail when the error reproduces, because the problem only affects the preconfirmations node's ability to sync L1 state, but all of the observable state of that node, the blocks it is executing, comes from HotShot, not the L1. But the problem is clearly visible in the logs so this should be good enough at least to determine whether or not it's been fixed.

jbearer · 2023-10-12T15:45:26Z

With EspressoSystems/zkevm-node#87, this test now consistently passes without concerning error logs when run locally. Of course it won't work in CI until the zkevm-node PR is merged

philippecamacho · 2023-10-13T20:04:55Z

I get an error when running pnpm i:

pnpm i
Packages are hard linked from the content-addressable store to the virtual store.
  Content-addressable store is at: /home/leloup/.local/share/pnpm/store/v3
  Virtual store is at:             node_modules/.pnpm
../../../.local/share/pnpm/store/v3/tmp/_tmp_134803_13f1204fe81617ce0cb898101b7f7fbf [[email protected]]: Running yarn-install script...
 WARN  deprecated @types/[email protected]: This is a stub types definition. winston provides its own type definitions, so you do not need this installed.
Packages: +199
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 ERR_PNPM_PREPARE_PACKAGE  Failed to prepare git-hosted package fetched from "https://codeload.github.com/EspressoSystems/reorgme/tar.gz/6f1f079f8c522229e948fe950890b24270226153": [email protected] yarn-install: `yarn install`
spawn ENOENT
Progress: resolved 199, reused 197, downloaded 0, added 0

philippecamacho · 2023-10-13T20:07:03Z

I get an error when running pnpm i:

pnpm i
Packages are hard linked from the content-addressable store to the virtual store.
  Content-addressable store is at: /home/leloup/.local/share/pnpm/store/v3
  Virtual store is at:             node_modules/.pnpm
../../../.local/share/pnpm/store/v3/tmp/_tmp_134803_13f1204fe81617ce0cb898101b7f7fbf [[email protected]]: Running yarn-install script...
 WARN  deprecated @types/[email protected]: This is a stub types definition. winston provides its own type definitions, so you do not need this installed.
Packages: +199
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 ERR_PNPM_PREPARE_PACKAGE  Failed to prepare git-hosted package fetched from "https://codeload.github.com/EspressoSystems/reorgme/tar.gz/6f1f079f8c522229e948fe950890b24270226153": [email protected] yarn-install: `yarn install`
spawn ENOENT
Progress: resolved 199, reused 197, downloaded 0, added 0

Looks like "nodePackages.yarn" was missing in flake.nix.

Here is the diff:

diff --git a/flake.nix b/flake.nix
index 2fd26b6..1c8007c 100644
--- a/flake.nix
+++ b/flake.nix
@@ -120,6 +120,7 @@
                 jq
                 nodejs
                 nodePackages.pnpm
+                nodePackages.yarn
 
                 # Figures
                 graphviz

philippecamacho · 2023-10-13T20:37:30Z

Attaching the logs of with the failed tests.
logs.txt

jbearer · 2023-10-24T14:41:35Z

From the error it looks like the Docker network was duplicated. Is it possible you already had something running on your machine, or a leaked network from a previous run? I think you can use docker network ls to see all the networks that exist and docker network rm <id> to remove one that shouldn't be.

The new reorg test takes quite a long time. If this works, I will reorganize the test suite to run this in a scheduled job or something. It's too long and not really necessary to run every PR, since it's not a great regression test anyways (you need to look at the logs to see if the reorg problem occurred). But I want to get it passing in CI at least once before scaling it back.

jbearer · 2023-11-01T12:41:49Z

It passed! I think all that remains is to move this into a slow-tests workflow to speed up the PR requirements

nomaxg · 2023-11-02T00:46:18Z

polygon-zkevm-adaptor/tests/end_to_end.rs

+    reorgme.join();
+
+    // Wait a bit for the nodes to recover.
+    sleep(Duration::from_secs(10)).await;


Is there some way to check that the nodes have recovered without sleeping? Like some transaction effect we know for sure will have been reversed by the reorg?

Ah that comment is a bit misleading. The actual intention of delaying here is to ensure that the nodes have seen the reorg, and so that the next transaction we submit is observed after the nodes have observed the reorg. Then the success of that transaction is what tells us the nodes have recovered.

But looking back at this, I think this was more necessary in my earlier attempts to simulate reorgs, where things were messier and not as determinstic. Now, reorgme.join() already waits until it has actually seen the reorg happen on the main L1 RPC node, so this delay might not be necessary at all. I'll see if I can delete it and still get the behavior I want in the logs.

Yep, removing this seems to work

nomaxg · 2023-11-02T00:47:16Z

polygon-zkevm-adaptor/tests/end_to_end.rs

+            .unwrap()
+            .unwrap();
+        tracing::info!("current verified batch is {}", event.num_batch);
+        if event.num_batch >= l2_height {


Should this time out?

Yeah but I'd rather handle it at the CI job config level (currently has a 1 hour timeout for the entire slow tests workflow)

nomaxg

Good stuff, just a couple of questions

Synchronization in this part of the test is handled by reorgme

jbearer added 4 commits October 5, 2023 17:07

Set up end-to-end tests to optionally use host Anvil instead of Docker

120806c

Update sequencer

3177e06

Add reorg test

bbd679c

Switch to reorgme to automate L1 reorgs

f7a4de5

jbearer added 6 commits October 12, 2023 13:11

Point back at main of sequencer and zkevm-node

a2890c6

Clarify purpose of waiting for verified blocks in reorg test

6c775ce

Add RUSTFLAGS to CI jobs

010309f

Add RUSTFLAGS to cross shell

1706767

Update zkevm-node

d63d8e9

Update dependencies

14cf281

jbearer marked this pull request as ready for review October 12, 2023 15:45

jbearer requested review from nomaxg, philippecamacho, ImJeremyHe and sveitser as code owners October 12, 2023 15:45

jbearer added 2 commits October 30, 2023 16:32

Update zkevm-node

c1e8b84

Merge remote-tracking branch 'origin/main' into test/reorg

5b39790

jbearer force-pushed the test/reorg branch from 2b8207b to eaa7785 Compare October 31, 2023 13:28

Install node packages before running tests in CI

e04be0f

jbearer force-pushed the test/reorg branch from eaa7785 to e04be0f Compare October 31, 2023 13:31

Add extra hosts for services that need to talk to host L1

f68f9e1

jbearer force-pushed the test/reorg branch from 01c71b6 to f68f9e1 Compare October 31, 2023 18:28

Move slow tests to their own workflow

9de5577

nomaxg reviewed Nov 2, 2023

View reviewed changes

nomaxg approved these changes Nov 2, 2023

View reviewed changes

Remove unnecessary delay

f2b661a

Synchronization in this part of the test is handled by reorgme

jbearer enabled auto-merge November 6, 2023 14:57

jbearer merged commit 6710665 into main Nov 6, 2023
8 checks passed

jbearer deleted the test/reorg branch November 6, 2023 15:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a test for L1 reorgs #250

Add a test for L1 reorgs #250

jbearer commented Oct 6, 2023 •

edited

Loading

jbearer commented Oct 12, 2023 •

edited

Loading

jbearer commented Oct 12, 2023

philippecamacho commented Oct 13, 2023

philippecamacho commented Oct 13, 2023 •

edited

Loading

philippecamacho commented Oct 13, 2023

jbearer commented Oct 24, 2023

jbearer commented Nov 1, 2023

nomaxg Nov 2, 2023

jbearer Nov 2, 2023

jbearer Nov 6, 2023

nomaxg Nov 2, 2023

jbearer Nov 2, 2023

nomaxg left a comment

Add a test for L1 reorgs #250

Add a test for L1 reorgs #250

Conversation

jbearer commented Oct 6, 2023 • edited Loading

jbearer commented Oct 12, 2023 • edited Loading

jbearer commented Oct 12, 2023

philippecamacho commented Oct 13, 2023

philippecamacho commented Oct 13, 2023 • edited Loading

philippecamacho commented Oct 13, 2023

jbearer commented Oct 24, 2023

jbearer commented Nov 1, 2023

nomaxg Nov 2, 2023

Choose a reason for hiding this comment

jbearer Nov 2, 2023

Choose a reason for hiding this comment

jbearer Nov 6, 2023

Choose a reason for hiding this comment

nomaxg Nov 2, 2023

Choose a reason for hiding this comment

jbearer Nov 2, 2023

Choose a reason for hiding this comment

nomaxg left a comment

Choose a reason for hiding this comment

jbearer commented Oct 6, 2023 •

edited

Loading

jbearer commented Oct 12, 2023 •

edited

Loading

philippecamacho commented Oct 13, 2023 •

edited

Loading