diff --git a/protocol/test-infra.md b/protocol/test-infra.md new file mode 100644 index 00000000..2d16eb7e --- /dev/null +++ b/protocol/test-infra.md @@ -0,0 +1,358 @@ +# Purpose + + + + + +The OP-Stack is growing in many ways, and testing needs to grow with it. + +This design doc aims to express what the challenges are, +and what our bigger vision is to improve testing to support the growth. + +This is a shared document between Protocol and Platforms teams. + +*Note: this doc is actively changing still, this is just a draft opened by Proto.* + +# Summary + + + +In summary, the stack is changing in following ways: +- Growth of chains: more test deployments / monitoring +- Growth of features: more edge-cases to validate +- Growth of history: more syncing and regression tests +- Growth of clients: more spec conformance checks +- Growth of activity: more benchmarks and uptime checks +- Growth as platform: more need for testing to be extensible +- Interoperability: new cross-L2 test requirements + +Testing is both an Infra and a Software problem. +Arguably this is the Platforms-Protocol team split, but things can be fuzzy between them. + +With both platforms and protocol teams we can align on what improvements we need, +and implement them collaboratively. + +# Problem Statement + Context + + + +As outlined above, there are different areas where the stack is expanding, +and where testing needs to expand with it. + +What we currently have may be sufficient for a while, +but the pressure of a "test ceiling" does also stall development. +E.g. more `op-e2e` bloat -> more `op-e2e` flakes -> less confidence in `op-e2e` -> fewer features / changes. + +*Some pressure* may be good: complexity has to stop somewhere. +But ideally this comes as a design-choice, not a pain/delay in development. + +The sections below review what we have today, what challenges we have, +and what ideas we have, for each of the growth domains. + +### Growth of chains + +With a greater number of chains, we will need more test deployments, +and more monitoring. + +Today: +- Superchain-registry [validation checks](https://github.com/ethereum-optimism/superchain-registry/tree/main/validation) +- Test nodes on each network, with alerts +- [pessimism](https://github.com/base-org/pessimism) monitoring (deprecated) +- [monitorism](https://github.com/ethereum-optimism/monitorism) + +Challenges: +- Test nodes can be unstable: lack of peers can result in missed blocks. +- Test nodes can elevate costs: not everything has to run on the same tier hardware as sequencers. + There is a economies-of-scale thing to investigate. +- Monitorism may need to be deployed to more chains. + + +### Growth of features + +"Mo' features, mo' problems." Or really, more edge-cases to validate. +We need the test-infrastructure to efficiently express how things can go wrong, +and what we expect to happen in each case. + +Today: +- [`op-e2e` system tests](https://github.com/ethereum-optimism/optimism/tree/develop/op-e2e/system) +- [`op-e2e` action tests](https://github.com/ethereum-optimism/optimism/tree/develop/op-e2e/actions) +- [smart contracts tests](https://github.com/ethereum-optimism/optimism/tree/develop/packages/contracts-bedrock/test) +- [op-chain-ops upgrade checks](https://github.com/ethereum-optimism/optimism/tree/develop/op-chain-ops/cmd) +- [superchain-ops](https://github.com/ethereum-optimism/superchain-ops/) checks +- Interop tests (see [interop section](#interoperability)) + +Challenges: +- System tests are prone to flakes: running N systems, + in a resource-constrained environment, with concurrent work / timers / etc., tends to miss assertions due to processing stalls. +- `op-e2e` tests are not portable to alternative client implementations. + We need ways of generalizing test assertions, and decouple the nodes setup from the test itself. +- `op-e2e` system tests are quite monolithic. Adding/removing nodes is high friction. + Some tests run services that do not influence the test. +- `op-chain-ops` and `superchain-ops` are not automated against scheduled CI runs. + Upgrade checks should run more regularly than just one-off in devnet/testnet. +- Limited resource sharing: a lot of the feature-tests spawn a copy of the system, + rather than being able to run against an already running system. + +In general the main idea here is that we need a +way to express tests with system requirements, decoupled from system setup: +we can unify upgrade-checks / tests, and run them against shared chain resources, +to not overwhelm CI resources. + + +### Growth of history + +With more chain history, there is more syncing to do, and more regression tests to maintain. + +Today: +- Sync tests (internal nodes scheduled to perform a resync). +- Anecdotal syncs (internal/external sync feedback). +- Scheduled Fault-Proof test-runs against testnet. + +Challenges: +- Sync tests lack ownership, difficult to set up, and results do not feed into follow-up work well. +- Anecdotal syncs lack context: when something stalls or errors it is difficult to determine why. +- Feature changes that affect sync do not feed into sync-testruns easily. + +Some more visible automation, and a log of results +(perhaps with grafana dashboard links, filtered to the affected node), would be very useful here. +Perhaps a discord bot, that can own these sync test-runs, and post the progress/results, could solve the challenges. +There is some success with posting the fault-proof test-runs to Slack, but starting special runs is still too difficult. + +### Growth of clients + +With more client implementations, there are more (duplicate) spec conformance checks to run. + +Today: +- `op-e2e` action tests [exported for Kona](https://github.com/ethereum-optimism/optimism/blob/develop/op-e2e/actions/proofs/helpers/kona.go) +- `op-e2e` external-geth shim (removed in [12216](https://github.com/ethereum-optimism/optimism/pull/12216)) +- [optimism kurtosis](https://github.com/ethpandaops/optimism-package/) to spin up clients in a network + +Challenges: +- The Go tests generally don't export to alternative implementations like Rust +- Kurtosis is an "infra" solution to what is also a "software" problem: + after spinning up the network, we still need to run tests against it. + +This might get more approachable with more separation between node setup and testing, +as proposed in the feature-growth section. +Ideally we then run common tests against different combinations of each of the client implementations. + +### Growth of activity + +With more vertical scaling comes more concern around performance and stability. +We need more benchmarks and uptime checks to understand where we can improve. + +Today: +- [op-ufm (user facing monitoring)](https://github.com/ethereum-optimism/infra/tree/main/op-ufm) +- [`replayor`](https://github.com/danyalprout/replayor) +- Internal infra dashboards / alerts +- One-off chain activity analysis + +Challenges +- `op-ufm` should run against more chains, and be more visible to engineers. + E.g. if transaction inclusion speed is not good, we need to do something about it. +- `replayor` needs automation, ideally on top of a shadow-fork of a real network, + such that we can analyze performance without spending real ETH or harming a production network in any way. +- We need to monitor node performance better. + `pprof` / Go-resource dashboards should improve and be more center to our work. + +Automating flight-recording in `op-service` would be great. +See https://go.dev/blog/execution-traces-2024 for information about flight-recording. +E.g. on RPC call or on particular pre-programmed conditions, +a buffer of performance data from the last N seconds can be dumped and uploaded to some service. +When slow blocks occur, or engineers are interested in a performance snapshot of a real network, this could be great. +Also see [op-geth flight-recording design-doc (internal)](https://github.com/ethereum-optimism/design-docs-private/blob/main/op-geth-flight-recording.md). + +There also exists https://pyroscope.io/ for nice continuous monitoring, +although it is built on top of older Golang profiling hooks, and seems to provide less fine-grained information. + +### Growth as platform + +The more the stack becomes a "platform" that others fork and build new things on top on, +the more thought we should give to the public interface and extensibility of our testing. + +Today: no testing platform. + +Challenges: +External OP Stacks forks end up having to fork the testing infra as well, and wire in their customizations. +A lot of tests may work by default, but some features may not. +E.g. 4844 blob tests may need to be disabled for an alt-DA network test suite. + +There may be ways we can better categorize tests, make test setup more composable, +and improve parametrization of contract versioning and chain configurations. + +### Interoperability + +Interoperability is a net-new domain, and requires some deeper work than adjustments to handle growth. +Specifically, this requires cross-chain testing. +This testing adds multi-L2 deployments, multi-L2 test setup, and multi-L2 tests to the testing scope. + +Today: +- interop-devnet [docker-compose](https://github.com/ethereum-optimism/optimism/tree/develop/interop-devnet) +- interop op-e2e system variant: [SuperSystem](https://github.com/ethereum-optimism/optimism/blob/develop/op-e2e/interop/supersystem.go) tests +- interop op-e2e action tests variant: [InteropSetup](https://github.com/ethereum-optimism/optimism/blob/develop/op-e2e/actions/interop/interop.go) tests + +Challenges: +- Setting up multiple L2s that all attach to the same L1, + and managing unique per-chain keys, deployments, configs, resources can be difficult. + [`op-deployer`](https://github.com/ethereum-optimism/optimism/tree/develop/op-deployer), + [`interopgen`](https://github.com/ethereum-optimism/optimism/tree/develop/op-chain-ops/interopgen) + are a good start, but we things can be improved. + Ideally by unifying more of the setup between kurtosis and op-e2e tests, so there is less setup code to maintain. +- Existing `op-e2e` single-chain and interop variants are diverging. + We need to unify the test framework more, so that there is no inconsistency between testing the two kinds. +- While already difficult in an existing test, with interop it becomes even more challenging: + we have 5+ services running per chain, and then N chains, and we need to understand the interactions between them. + +We need to make it more seamless to interact with multiple chains from within the same test. +Part of thi is the setup challenge, +part of it is how we express these multi-L2 things without making things incredibly verbose. + +# Proposed Solution + + + +## Towards a solution + +Based on the challenges outlined above, I believe we need: +- Automation to spin up test chains +- Flexibility with node types +- Separation of setup and test +- More expressive language to define tests in +- Improved insights + +### Automation to spin up test chains + +The automation story is largely a platforms/infra project. +With `op-deployer` and Kurtosis as the main solution to setup configs and nodes quickly. + +### Flexibility with node types + +Flexibility requires some standardization: +each execution-engine talks the same RPC, +but rollup-nodes, peripherals, alternative proof systems, etc. still have unique API interfaces. + +Defining some interfaces (especially RPC namespaces) as "standard", can go a long way for portable testing. + +### Separation of setup and test + +One pattern we are trying to work towards in `op-e2e` is to hide +the test-setup and node-access with Go interfaces as much as possible. +The more it is separated by interface, the more flexible the implementation of the setup is, +and the more reusable the test. + +One example of this is the Go testing that also runs against local docker-compose devnet, +which was unfortunately removed in [PR 12216](https://github.com/ethereum-optimism/optimism/pull/12216). + +### More expressive language to define tests in + +A test should be simple, easy to maintain by anyone. + +At the same time, a test should be able to navigate into an edge-case, +to cover the more subtle state-transition problems. + +A lot of the time, the test is complicated due to the setup, +and not having common idioms to access nodes and state that assertions are made over. + +Besides separating from the setup, we need to review how we express the tests themselves. +Part of this is the language itself, part of it is the DSL (domain specific language) we build on top. + +#### Golang + +Golang can be great: +- A lot of the existing tests and test-utilities can be ported over quickly. +- There is a Kurtosis Go SDK we can integrate with +- Golang is the simplest common denominator for test-writing (assuming no introduction of Python just for testing). + +While there are many existing test utilities and abstractions, the Go test stack can still improve. +Notably it can be challenging to fork a test into two independent tests, or parametrize in general. + +With these existing test utils, some of it already looks like a DSL. +E.g. the action-tests have a specific way of making one thing happen at a time, in a bigger orchestration of actors. + +This DSL-like functionality needs to improve in the system-tests, +where awaiting events and more asynchronous work can still feel verbose and fragile. + +#### Rust + +There may be an argument for testing in Rust. +But with all existing infra tools written in Go, +this may be more challenging to maintain by the current platforms team. + +One option might be to automate Kurtosis, +and then define a Rust test-suite runner with the [Rust kurtosis SDK](https://crates.io/crates/kurtosis-sdk) +to interface with the kurtosis deployment. +For those test-cases where we prefer to write it in Rust instead of Go or Solidity. + +There is no DSL yet, this would be quite new ground for a lot of the non-Rust engineering. + +#### Solidity + +However, both Go and Rust fall short on contract interactions: creating bindings for every interaction can be tiresome. +And solidity rust macros might still be too much context switching for a test writer. + +A spike, to explore what testing in a solidity-first environment could look like, could be worth doing. +The [`whatif.sol`](https://gist.github.com/protolambda/1c149b54ec54b57610eca6661f687170) is an early draft of what this could look like. + +Solidity test scripts, expressing invariants and such, +can also be a great common language between the OP Stack specs, and the tests. +Similar to Python in the [Ethereum L1 Consensus specs](https://github.com/ethereum/consensus-specs/blob/dev/specs/phase0/beacon-chain.md). +Defining invariants in solidity, as part of the spec, which we then pull into tests, +could make protocol development more test-driven. + +For DSL, we do have Forge patterns: +switching to a fork, and announcing the next call as a broadcast, +are common patterns in foundry-tests that could work well, and might only need minimal changes. +Custom DSL / cheatcodes can be supported if we run the tests in the Go script environment, +where we can plug in our own cheatcodes. +These cheatcodes can potentially just be a thin wrapper around the Go test framework. + +### Improved insights + +After everything is set up and tests are running, we still need to improve our insights. +How do things fail? How does a live network perform after triggering a non-fatal edge-case? + +Generally this means improving monitoring, instrumentation, etc. + +Some ideas to improve: +- Light-weight RPC proxies everywhere. We can capture JSON-RPC exchanges between all the services, + and flag weird timing, and review RPC logs after problems. + In a way we can untangle the communication between 20+ services, just like logging, + by tracing and labeling what happens. +- Make flight-recording a standard practice for every service. + Being able to look at what a node is/was doing is very helpful. + For tests that take more resources to re-run, having more data after test-failure is important. +- Making op-node stream its events, so we can assert things based on what is happening inside of the node, + similar to [assertoor](https://github.com/ethpandaops/assertoor/) in L1. +- For more long-running tests, post progress and aggregate results in a channel on the R&D discord. + And maybe allow channel participants to interact with the test; e.g. node restarts, API queries, pause the testing, restart the test, etc. +- Differential tests: the better we can diff the results against other results + (previous runs or of alternative clients), the more confidence we can get in compatibility between different implementations. + +## Concrete solution + +*This document is a work in progress, the above ideas are still being iterated on.* + +## Resource Usage + + + +# Alternatives Considered + + + +# Risks & Uncertainties + +