-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a test for L1 reorgs #250
Conversation
Ok. Take 2. The Anvil scripting didn't work out for a whole bunch of reasons, the latest one being that initialization of the zkEVM node depends on event logs emitted during contract creation, which means it depends not only on the state of the L1 but also the history. This does not work with the way Anvil snapshots only capture state but not history. At least we got a good test for the sequencer L1 client out of that adventure 🤷 It seems that there is no way to simulate a reorg that is sufficiently realistic for the zkEVM node, except to actually do a reorg. Amazingly, there is not a good way to do this locally with any of the usual Ethereum dev tools. The best we can do is start a local Geth PoW network, disconnect one node, then later reconnect it so that it switches over to the longest chain. Luckily, I found an open source tool for automating this, https://github.com/0xsequence/reorgme. The tool is a bit old and I needed to make some changes for our use case, so this is currently using the fork https://github.com/EspressoSystems/reorgme. This test now reproduces the foreign key error, not every single time, but pretty reliably. Unfortunately it doesn't necessarily fail when the error reproduces, because the problem only affects the preconfirmations node's ability to sync L1 state, but all of the observable state of that node, the blocks it is executing, comes from HotShot, not the L1. But the problem is clearly visible in the logs so this should be good enough at least to determine whether or not it's been fixed. |
With EspressoSystems/zkevm-node#87, this test now consistently passes without concerning error logs when run locally. Of course it won't work in CI until the zkevm-node PR is merged |
I get an error when running
|
Looks like "nodePackages.yarn" was missing in Here is the diff:
|
Attaching the logs of with the failed tests. |
From the error it looks like the Docker network was duplicated. Is it possible you already had something running on your machine, or a leaked network from a previous run? I think you can use |
The new reorg test takes quite a long time. If this works, I will reorganize the test suite to run this in a scheduled job or something. It's too long and not really necessary to run every PR, since it's not a great regression test anyways (you need to look at the logs to see if the reorg problem occurred). But I want to get it passing in CI at least once before scaling it back.
It passed! I think all that remains is to move this into a slow-tests workflow to speed up the PR requirements |
reorgme.join(); | ||
|
||
// Wait a bit for the nodes to recover. | ||
sleep(Duration::from_secs(10)).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some way to check that the nodes have recovered without sleeping? Like some transaction effect we know for sure will have been reversed by the reorg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that comment is a bit misleading. The actual intention of delaying here is to ensure that the nodes have seen the reorg, and so that the next transaction we submit is observed after the nodes have observed the reorg. Then the success of that transaction is what tells us the nodes have recovered.
But looking back at this, I think this was more necessary in my earlier attempts to simulate reorgs, where things were messier and not as determinstic. Now, reorgme.join()
already waits until it has actually seen the reorg happen on the main L1 RPC node, so this delay might not be necessary at all. I'll see if I can delete it and still get the behavior I want in the logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, removing this seems to work
.unwrap() | ||
.unwrap(); | ||
tracing::info!("current verified batch is {}", event.num_batch); | ||
if event.num_batch >= l2_height { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this time out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah but I'd rather handle it at the CI job config level (currently has a 1 hour timeout for the entire slow tests workflow)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good stuff, just a couple of questions
Synchronization in this part of the test is handled by reorgme
This test successfully reproduces the exact failure in the preconfirmations node we saw in production:
Now to try and fix it
Closes #246