Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a test for L1 reorgs #250

Merged
merged 17 commits into from
Nov 6, 2023
Merged

Add a test for L1 reorgs #250

merged 17 commits into from
Nov 6, 2023

Conversation

jbearer
Copy link
Member

@jbearer jbearer commented Oct 6, 2023

This test successfully reproduces the exact failure in the preconfirmations node we saw in production:

test-reorg-zkevm-1-preconfirmations-node-1      | 2023-10-06T14:48:29.385Z      ERROR   synchronizer/synchronizer.go:
862     error storing the verifiedB in processTrustedVerifyBatches. BlockNumber: 42, error: %!w(*pgconn.PgError=&{ERR
OR 23503 insert or update on table "verified_batch" violates foreign key constraint "verified_batch_batch_num_fkey" K
ey (batch_num)=(1) is not present in table "virtual_batch".  0 0   state verified_batch   verified_batch_batch_num_fk
ey ri_triggers.c 2608 ri_ReportViolation})%!(EXTRA string=

Now to try and fix it

Closes #246

@jbearer
Copy link
Member Author

jbearer commented Oct 12, 2023

Ok. Take 2. The Anvil scripting didn't work out for a whole bunch of reasons, the latest one being that initialization of the zkEVM node depends on event logs emitted during contract creation, which means it depends not only on the state of the L1 but also the history. This does not work with the way Anvil snapshots only capture state but not history. At least we got a good test for the sequencer L1 client out of that adventure 🤷

It seems that there is no way to simulate a reorg that is sufficiently realistic for the zkEVM node, except to actually do a reorg. Amazingly, there is not a good way to do this locally with any of the usual Ethereum dev tools. The best we can do is start a local Geth PoW network, disconnect one node, then later reconnect it so that it switches over to the longest chain. Luckily, I found an open source tool for automating this, https://github.com/0xsequence/reorgme. The tool is a bit old and I needed to make some changes for our use case, so this is currently using the fork https://github.com/EspressoSystems/reorgme.

This test now reproduces the foreign key error, not every single time, but pretty reliably. Unfortunately it doesn't necessarily fail when the error reproduces, because the problem only affects the preconfirmations node's ability to sync L1 state, but all of the observable state of that node, the blocks it is executing, comes from HotShot, not the L1. But the problem is clearly visible in the logs so this should be good enough at least to determine whether or not it's been fixed.

@jbearer
Copy link
Member Author

jbearer commented Oct 12, 2023

With EspressoSystems/zkevm-node#87, this test now consistently passes without concerning error logs when run locally. Of course it won't work in CI until the zkevm-node PR is merged

@jbearer jbearer marked this pull request as ready for review October 12, 2023 15:45
@philippecamacho
Copy link

I get an error when running pnpm i:

pnpm i
Packages are hard linked from the content-addressable store to the virtual store.
  Content-addressable store is at: /home/leloup/.local/share/pnpm/store/v3
  Virtual store is at:             node_modules/.pnpm
../../../.local/share/pnpm/store/v3/tmp/_tmp_134803_13f1204fe81617ce0cb898101b7f7fbf [[email protected]]: Running yarn-install script...
 WARN  deprecated @types/[email protected]: This is a stub types definition. winston provides its own type definitions, so you do not need this installed.
Packages: +199
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 ERR_PNPM_PREPARE_PACKAGE  Failed to prepare git-hosted package fetched from "https://codeload.github.com/EspressoSystems/reorgme/tar.gz/6f1f079f8c522229e948fe950890b24270226153": [email protected] yarn-install: `yarn install`
spawn ENOENT
Progress: resolved 199, reused 197, downloaded 0, added 0

@philippecamacho
Copy link

philippecamacho commented Oct 13, 2023

I get an error when running pnpm i:

pnpm i
Packages are hard linked from the content-addressable store to the virtual store.
  Content-addressable store is at: /home/leloup/.local/share/pnpm/store/v3
  Virtual store is at:             node_modules/.pnpm
../../../.local/share/pnpm/store/v3/tmp/_tmp_134803_13f1204fe81617ce0cb898101b7f7fbf [[email protected]]: Running yarn-install script...
 WARN  deprecated @types/[email protected]: This is a stub types definition. winston provides its own type definitions, so you do not need this installed.
Packages: +199
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 ERR_PNPM_PREPARE_PACKAGE  Failed to prepare git-hosted package fetched from "https://codeload.github.com/EspressoSystems/reorgme/tar.gz/6f1f079f8c522229e948fe950890b24270226153": [email protected] yarn-install: `yarn install`
spawn ENOENT
Progress: resolved 199, reused 197, downloaded 0, added 0

Looks like "nodePackages.yarn" was missing in flake.nix.

Here is the diff:

diff --git a/flake.nix b/flake.nix
index 2fd26b6..1c8007c 100644
--- a/flake.nix
+++ b/flake.nix
@@ -120,6 +120,7 @@
                 jq
                 nodejs
                 nodePackages.pnpm
+                nodePackages.yarn
 
                 # Figures
                 graphviz

@philippecamacho
Copy link

Attaching the logs of with the failed tests.
logs.txt

@jbearer
Copy link
Member Author

jbearer commented Oct 24, 2023

From the error it looks like the Docker network was duplicated. Is it possible you already had something running on your machine, or a leaked network from a previous run? I think you can use docker network ls to see all the networks that exist and docker network rm <id> to remove one that shouldn't be.

The new reorg test takes quite a long time. If this works, I will
reorganize the test suite to run this in a scheduled job or
something. It's too long and not really necessary to run every PR,
since it's not a great regression test anyways (you need to look
at the logs to see if the reorg problem occurred). But I want to
get it passing in CI at least once before scaling it back.
@jbearer
Copy link
Member Author

jbearer commented Nov 1, 2023

It passed! I think all that remains is to move this into a slow-tests workflow to speed up the PR requirements

reorgme.join();

// Wait a bit for the nodes to recover.
sleep(Duration::from_secs(10)).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some way to check that the nodes have recovered without sleeping? Like some transaction effect we know for sure will have been reversed by the reorg?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that comment is a bit misleading. The actual intention of delaying here is to ensure that the nodes have seen the reorg, and so that the next transaction we submit is observed after the nodes have observed the reorg. Then the success of that transaction is what tells us the nodes have recovered.

But looking back at this, I think this was more necessary in my earlier attempts to simulate reorgs, where things were messier and not as determinstic. Now, reorgme.join() already waits until it has actually seen the reorg happen on the main L1 RPC node, so this delay might not be necessary at all. I'll see if I can delete it and still get the behavior I want in the logs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, removing this seems to work

.unwrap()
.unwrap();
tracing::info!("current verified batch is {}", event.num_batch);
if event.num_batch >= l2_height {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this time out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah but I'd rather handle it at the CI job config level (currently has a 1 hour timeout for the entire slow tests workflow)

Copy link
Contributor

@nomaxg nomaxg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff, just a couple of questions

Synchronization in this part of the test is handled by reorgme
@jbearer jbearer enabled auto-merge November 6, 2023 14:57
@jbearer jbearer merged commit 6710665 into main Nov 6, 2023
8 checks passed
@jbearer jbearer deleted the test/reorg branch November 6, 2023 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Preconfirmations node does not handle L1 reorgs correctly
3 participants