Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

protocol/platforms: test infra draft #165

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
358 changes: 358 additions & 0 deletions protocol/test-infra.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,358 @@
# Purpose

<!-- This section is also sometimes called “Motivations” or “Goals”. -->

<!-- It is fine to remove this section from the final document,
but understanding the purpose of the doc when writing is very helpful. -->

The OP-Stack is growing in many ways, and testing needs to grow with it.

This design doc aims to express what the challenges are,
and what our bigger vision is to improve testing to support the growth.

This is a shared document between Protocol and Platforms teams.

*Note: this doc is actively changing still, this is just a draft opened by Proto.*

# Summary

<!-- Most (if not all) documents should have a summary.
While the length will likely be proportional to the length of the full document,
the summary should be as succinct as possible. -->

In summary, the stack is changing in following ways:
- Growth of chains: more test deployments / monitoring
- Growth of features: more edge-cases to validate
- Growth of history: more syncing and regression tests
- Growth of clients: more spec conformance checks
- Growth of activity: more benchmarks and uptime checks
- Growth as platform: more need for testing to be extensible
- Interoperability: new cross-L2 test requirements

Testing is both an Infra and a Software problem.
Arguably this is the Platforms-Protocol team split, but things can be fuzzy between them.

With both platforms and protocol teams we can align on what improvements we need,
and implement them collaboratively.

# Problem Statement + Context

<!-- Describe the specific problem that the document is seeking to address as well
as information needed to understand the problem and design space.
If more information is needed on the costs of the problem,
this is a good place to that information. -->

As outlined above, there are different areas where the stack is expanding,
and where testing needs to expand with it.

What we currently have may be sufficient for a while,
but the pressure of a "test ceiling" does also stall development.
E.g. more `op-e2e` bloat -> more `op-e2e` flakes -> less confidence in `op-e2e` -> fewer features / changes.

*Some pressure* may be good: complexity has to stop somewhere.
But ideally this comes as a design-choice, not a pain/delay in development.

The sections below review what we have today, what challenges we have,
and what ideas we have, for each of the growth domains.

### Growth of chains

With a greater number of chains, we will need more test deployments,
and more monitoring.
Comment on lines +60 to +61
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How generic/specific are the tests and monitoring here?
In the abstract I'm imagining that all the chains share some common fundamental properties that need to be continuously validated, but maybe they also have individual properties that need ad-hoc testing?

So for the former I could imagine that we need to invest in making those validation devices as generic as possible, while the latter would presumably require something different (higher-level interpretation of the chain configuration to derive the appropriate validators?)
(disclaimer, I barely know what I'm talking about here, so I might be way off :))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's validating the same thing for every chain. Per chain there might be some minor differences, like different bridge contract addresses, but easy to replicate.

For this type of continuous testing of more chains I am mostly worried about reducing cost (hosting more nodes + minimizing baby-sitting costs of those nodes).

It might get more chain-specific if we expand into the user-facing-monitoring side. E.g. if we want to check uptime / agreement of known public services per chain. E.g. automated checks that all major RPC providers are on the same chain.


Today:
- Superchain-registry [validation checks](https://github.com/ethereum-optimism/superchain-registry/tree/main/validation)
- Test nodes on each network, with alerts
- [pessimism](https://github.com/base-org/pessimism) monitoring (deprecated)
- [monitorism](https://github.com/ethereum-optimism/monitorism)

Challenges:
- Test nodes can be unstable: lack of peers can result in missed blocks.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you zoom-in on this? At some level it reads a bit like a failure mode of the chain itself is affecting its validation. Which I guess is somewhat ok? some tests could always be inconclusive, if their pre-requisites are not met (and then having inconclusive tests is itself a validation failure regardless of what they were trying to do)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Especially for smaller chains, it can be difficult at times to find peers and stay connected to them. The chain itself is healthy (sequencer online, working batch submitter), but the happy-path of block-distribution via P2P can be interrupted.

For those types of happy-path interruptions, it would be nice to track, and maybe improve the robustness of the stack. E.g. by using additional alternative peer-discovery systems, or changing gossip parameters.

I would say it's less of a priority than core spec-conformance testing, but still an area we can improve.

- Test nodes can elevate costs: not everything has to run on the same tier hardware as sequencers.
There is a economies-of-scale thing to investigate.
- Monitorism may need to be deployed to more chains.


### Growth of features

"Mo' features, mo' problems." Or really, more edge-cases to validate.
We need the test-infrastructure to efficiently express how things can go wrong,
and what we expect to happen in each case.

Today:
- [`op-e2e` system tests](https://github.com/ethereum-optimism/optimism/tree/develop/op-e2e/system)
- [`op-e2e` action tests](https://github.com/ethereum-optimism/optimism/tree/develop/op-e2e/actions)
- [smart contracts tests](https://github.com/ethereum-optimism/optimism/tree/develop/packages/contracts-bedrock/test)
- [op-chain-ops upgrade checks](https://github.com/ethereum-optimism/optimism/tree/develop/op-chain-ops/cmd)
- [superchain-ops](https://github.com/ethereum-optimism/superchain-ops/) checks
- Interop tests (see [interop section](#interoperability))

Challenges:
- System tests are prone to flakes: running N systems,
in a resource-constrained environment, with concurrent work / timers / etc., tends to miss assertions due to processing stalls.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a case of the assertions being stronger than what the system actually guarantees (typically: eventual correctness, encoded as a real-time constraint) ?
Or is it more that the test environment is actually invalid with respect to the properties of the system running in it (typically: IOPs not compliant with requirements, therefore timeout) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The former, the happy-path is being interrupted by resource-constrained CI machines. But generally we should be able to assert that happy-path, and not fallback to worst-case in every test.

E.g. if the chain started early, but blocks started being produced late, then the sequencer might not take time to include transactions, and the test runs but fails to confirm a tx within the expected amount of time. The bug here might first appear to be the tx-inclusion itself, but really it was the resource constraints that even prevented it from being ready to include a tx.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I mean, there's a bit of a philosophical consideration here, in the sense that asserting something that's not strictly speaking true is bound to cause false positives at scale.

I'm wondering if an ok middleground might look like:

  • specifying what the happy path actually demands (more or less "relevant platform metrics being in the green")
  • coupling test execution with platform monitoring, categorizing test "failures" as inconclusive when they happen to coincide with red monitoring
  • rescheduling inconclusive tests according to some policy
  • in parallel play with resources allocation, scheduling, ... to improve the odds of monitoring remaining green

The argument here would be that we then:

  • improve (presumably) our flaking situation
  • keep control of the requirements (therefore avoiding what I usually call "real-time inflation" where tests develop ties to "ideal" situations over time)
  • gain the ability to deliberately push the system into the red and see how it copes

- `op-e2e` tests are not portable to alternative client implementations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big +1 on this. The tests themselves should be decoupled from the clients they run against, so we can build matrix-style tests against multiple clients.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But is op-e2e the right framework for that? The vast majority of op-e2e tests aren't testing the EL. Many are testing op-node derivation, many are testing op-proposer, op-batcher, op-challenger, dispute games etc none of which have alternate implementations.

We do need a multi client testsuite, but I'm not sure having to pull everything into op-e2e is the right answer. We don't want op-reth, op-nethermind, op-besu, op-whatever-I-just-created to be a dependency of the monorepo for example. Asterisc pulls the monorepo in so it can run e2e tests in their repo and it's not a particularly clean solution - it works but we break it by making changes to op-e2e pretty regularly.

I'm a really big fan of being able to export a test suite like the L1 reference tests do. Individual clients will still have their own e2e tests in addition for things that aren't consensus critical but the reference tests define the shared set of tests for consensus compatibility and are designed to be implementation and test framework agnostic and easy to pull a release into whatever repo needs to run them.

We need ways of generalizing test assertions, and decouple the nodes setup from the test itself.
- `op-e2e` system tests are quite monolithic. Adding/removing nodes is high friction.
Some tests run services that do not influence the test.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that translate to those services not handling anything, or them running and processing stuff, but that stuff being ultimately irrelevant?
I guess I'm wondering what kind of analysis/monitoring would ideally detect those superfluous bits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. a test that checks the batch-submitter is functional might currently run a full suite of services, that includes the op-proposer, which does not affect the batch-submitting work itself.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so a strategy could be:

  • identify which components the test interacts with / has assertions for
  • compute the smaller runnable/viable system that spans those components
  • compare with the system actually being deployed, and expose the delta
    Does that sound right?

- `op-chain-ops` and `superchain-ops` are not automated against scheduled CI runs.
Upgrade checks should run more regularly than just one-off in devnet/testnet.
- Limited resource sharing: a lot of the feature-tests spawn a copy of the system,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this - re-using the L1 (for example) would help with a lot of the flakes we're seeing now.

rather than being able to run against an already running system.
Comment on lines +99 to +100
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that's a natural way to side-step the problem of guaranteeing isolation at the test level. Are there natural boundaries that would make resources-sharing more viable than in the general case?
Off the top of my head I'm imagining things like dynamically created sandboxes that we can monitor for any unexpected "external" operations. But again, I have no idea what I'm talking about :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there natural boundaries that would make resources-sharing more viable than in the general case?

Using unique keys/addresses for user-accounts that are part of a test would help. Otherwise we might see conflicting transactions (same nonce value, only one can be confirmed).

If a system key is involved, then we have to synchronize that part of the test globally, otherwise you get tx conflicts.

And tx-throughput may mess with the basefee, which can have adverse effects between tests.

If a test can declare what is needed, then I think some test-orchestrator can synchronize / schedule as necessary. We should be careful not to make the resource-sharing too complicated though, or else we may run into packing problems and such, or new kinds of flakes (e.g. where the test outcome depends on running another test in parallel or not).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so if we're able to prove that the test pre-requisites are contained in non-overlapping sandboxes then we should be able to reuse a test target if one is available. And otherwise we'd fallback to testing in isolation.
There's probably a need for some live-monitoring of some fundamentally shared constructs, like the basefee you mentioned, in order to validate that otherwise independent tests don't see their assumptions invalidated.


In general the main idea here is that we need a
way to express tests with system requirements, decoupled from system setup:
we can unify upgrade-checks / tests, and run them against shared chain resources,
to not overwhelm CI resources.
Comment on lines +102 to +105
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking I'd probably phrase this as having the ability to run a test against an arbitrary system. Shouldn't matter if it's a long-running one, a fresh stack, a developer-provided environment or whatever. Decoupling entirely the test logic from the system it applies to seems like a good idea in general

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Although some tests might depend on more rare / custom chain parameters, e.g. to shorten the sequencing window, to not wait the standard 12 hours for it to kick in. We need to find some way to bring those tests + the right setup together, without overcomplicating the test-setup (making it too dynamic again becomes a kind of packing problem, where some tests can share some resources, but not always).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I should have mentioned that I always assume that tests should be able to say "I can't run under those conditions", based on some specification of pre-conditions. Those pre-conditions should definitely cover things like these dependencies.
If a test cannot run with what it's provided, then it should say so and bail (well, in reality forcing it to run anyway is a nice way to detect potential over-specification of the pre-conditions, but that's not a core concern)



### Growth of history

With more chain history, there is more syncing to do, and more regression tests to maintain.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems concerning in the abstract, given that history is presumably ever-growing. Is there a possible world where we can essentially ignore a chain history prefix to cap the size the history "window" we'd have to consider?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's probably a project in there, outside of current scope, to deprecate legacy state-transition logic.
E.g. Holocene simplified the chain-derivation logic a lot, but we're still maintaining the pre-Holocene logic.

After some point for very old blocks we could say that we hardcode what is canonical, and omit the logic to derive them from raw inputs, to be able to drop the old code / complexity.

And this is related to the archive-node problem, and a bunch of ideas in L1 (things like EIP-4444, and the Era-file format). If we can freeze the historical data in some way, and drop the corresponding code, then we can get rid of legacy sync edge-cases.


Today:
- Sync tests (internal nodes scheduled to perform a resync).
- Anecdotal syncs (internal/external sync feedback).
- Scheduled Fault-Proof test-runs against testnet.

Challenges:
- Sync tests lack ownership, difficult to set up, and results do not feed into follow-up work well.
- Anecdotal syncs lack context: when something stalls or errors it is difficult to determine why.
- Feature changes that affect sync do not feed into sync-testruns easily.

Some more visible automation, and a log of results
(perhaps with grafana dashboard links, filtered to the affected node), would be very useful here.
Perhaps a discord bot, that can own these sync test-runs, and post the progress/results, could solve the challenges.
There is some success with posting the fault-proof test-runs to Slack, but starting special runs is still too difficult.

### Growth of clients

With more client implementations, there are more (duplicate) spec conformance checks to run.

Today:
- `op-e2e` action tests [exported for Kona](https://github.com/ethereum-optimism/optimism/blob/develop/op-e2e/actions/proofs/helpers/kona.go)
- `op-e2e` external-geth shim (removed in [12216](https://github.com/ethereum-optimism/optimism/pull/12216))
- [optimism kurtosis](https://github.com/ethpandaops/optimism-package/) to spin up clients in a network

Challenges:
- The Go tests generally don't export to alternative implementations like Rust
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand why the test language matters that much. Are we testing the clients at a level that sits below the user interface?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of the testing is below the regular user-interface, generally asserting more technical chain state-changes.

E.g. accurate progression of chain safety, or handling of malformed batch data. Things most users, with a transaction-RPC, are never bothered by directly. But still important for the stability and safety of the chain.

We need some way to express these more technical properties, how we assert them etc., without a lot of boilerplate where we repeat things like creating an RPC client, bindings for specific RPC methods, handling of retries, inspecting of error codes, etc. Helper functions, or even a more complete DSL, can do a lot here.

The language choice can also still be important: writing contract-interaction tests in Go is a lot more painful than in solidity, because every interaction in Go would need things like special contract bindings, extra transaction construction and confirmation handling. Things you don't even have to think about when writing it like a solidity test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to a good testing library/DSL. That would be my top pick for what's missing to be honest (competing with missing a dedicated cross-client reference test suite for consensus code). I think a good DSL could dramatically simplify go->contract interactions as well. Solidity will always work best for particularly complex things but for just calling contracts we should be able to make it easy.

- Kurtosis is an "infra" solution to what is also a "software" problem:
after spinning up the network, we still need to run tests against it.

This might get more approachable with more separation between node setup and testing,
as proposed in the feature-growth section.
Ideally we then run common tests against different combinations of each of the client implementations.

### Growth of activity

With more vertical scaling comes more concern around performance and stability.
We need more benchmarks and uptime checks to understand where we can improve.

Today:
- [op-ufm (user facing monitoring)](https://github.com/ethereum-optimism/infra/tree/main/op-ufm)
- [`replayor`](https://github.com/danyalprout/replayor)
- Internal infra dashboards / alerts
- One-off chain activity analysis

Challenges
- `op-ufm` should run against more chains, and be more visible to engineers.
E.g. if transaction inclusion speed is not good, we need to do something about it.
- `replayor` needs automation, ideally on top of a shadow-fork of a real network,
such that we can analyze performance without spending real ETH or harming a production network in any way.
- We need to monitor node performance better.
`pprof` / Go-resource dashboards should improve and be more center to our work.

Automating flight-recording in `op-service` would be great.
See https://go.dev/blog/execution-traces-2024 for information about flight-recording.
E.g. on RPC call or on particular pre-programmed conditions,
a buffer of performance data from the last N seconds can be dumped and uploaded to some service.
When slow blocks occur, or engineers are interested in a performance snapshot of a real network, this could be great.
Also see [op-geth flight-recording design-doc (internal)](https://github.com/ethereum-optimism/design-docs-private/blob/main/op-geth-flight-recording.md).

There also exists https://pyroscope.io/ for nice continuous monitoring,
although it is built on top of older Golang profiling hooks, and seems to provide less fine-grained information.

### Growth as platform

The more the stack becomes a "platform" that others fork and build new things on top on,
the more thought we should give to the public interface and extensibility of our testing.

Today: no testing platform.

Challenges:
External OP Stacks forks end up having to fork the testing infra as well, and wire in their customizations.
A lot of tests may work by default, but some features may not.
E.g. 4844 blob tests may need to be disabled for an alt-DA network test suite.

There may be ways we can better categorize tests, make test setup more composable,
and improve parametrization of contract versioning and chain configurations.
Comment on lines +186 to +187
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I reading correctly that you're proposing we make sure hard-dependencies on some stack (ours in this case) don't creep into the test logic?
If so I'm wondering if we can enumerate the objects that potentially create such a hard-dependency, and detect their usage-patterns in the tests to ensure they're correct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, ideally we don't opinionate our own test setup / framework so much that it becomes too hard for forks of the OP-Stack to build on top of the testing.

Alt-DA variants of the OP-Stack and Alt-proof-system versions are probably the most common things those forks want to be able to test, without refactoring the core test system. A more composable test setup can go a long way here I think.

This might not be an immediate priority, but we should gather some feedback on how different OP Stack forks approach testing of their modifications, and if there are things we can do to simplify that testing.


### Interoperability

Interoperability is a net-new domain, and requires some deeper work than adjustments to handle growth.
Specifically, this requires cross-chain testing.
This testing adds multi-L2 deployments, multi-L2 test setup, and multi-L2 tests to the testing scope.

Today:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to find a way to reduce the amount of duplication here so that we can use a shared platform.

- interop-devnet [docker-compose](https://github.com/ethereum-optimism/optimism/tree/develop/interop-devnet)
- interop op-e2e system variant: [SuperSystem](https://github.com/ethereum-optimism/optimism/blob/develop/op-e2e/interop/supersystem.go) tests
- interop op-e2e action tests variant: [InteropSetup](https://github.com/ethereum-optimism/optimism/blob/develop/op-e2e/actions/interop/interop.go) tests

Challenges:
- Setting up multiple L2s that all attach to the same L1,
and managing unique per-chain keys, deployments, configs, resources can be difficult.
[`op-deployer`](https://github.com/ethereum-optimism/optimism/tree/develop/op-deployer),
[`interopgen`](https://github.com/ethereum-optimism/optimism/tree/develop/op-chain-ops/interopgen)
are a good start, but we things can be improved.
Ideally by unifying more of the setup between kurtosis and op-e2e tests, so there is less setup code to maintain.
- Existing `op-e2e` single-chain and interop variants are diverging.
We need to unify the test framework more, so that there is no inconsistency between testing the two kinds.
- While already difficult in an existing test, with interop it becomes even more challenging:
we have 5+ services running per chain, and then N chains, and we need to understand the interactions between them.

We need to make it more seamless to interact with multiple chains from within the same test.
Part of thi is the setup challenge,
part of it is how we express these multi-L2 things without making things incredibly verbose.

# Proposed Solution

<!-- A high level overview of the proposed solution.
When there are multiple alternatives there should be an explanation
of why one solution was picked over other solutions.
As a rule of thumb, including code snippets (except for defining an external API)
is likely too low level. -->

## Towards a solution

Based on the challenges outlined above, I believe we need:
- Automation to spin up test chains
- Flexibility with node types
- Separation of setup and test
- More expressive language to define tests in
- Improved insights

### Automation to spin up test chains

The automation story is largely a platforms/infra project.
With `op-deployer` and Kurtosis as the main solution to setup configs and nodes quickly.

### Flexibility with node types

Flexibility requires some standardization:
each execution-engine talks the same RPC,
but rollup-nodes, peripherals, alternative proof systems, etc. still have unique API interfaces.

Defining some interfaces (especially RPC namespaces) as "standard", can go a long way for portable testing.

### Separation of setup and test

One pattern we are trying to work towards in `op-e2e` is to hide
the test-setup and node-access with Go interfaces as much as possible.
The more it is separated by interface, the more flexible the implementation of the setup is,
and the more reusable the test.

One example of this is the Go testing that also runs against local docker-compose devnet,
which was unfortunately removed in [PR 12216](https://github.com/ethereum-optimism/optimism/pull/12216).

### More expressive language to define tests in

A test should be simple, easy to maintain by anyone.

At the same time, a test should be able to navigate into an edge-case,
to cover the more subtle state-transition problems.

A lot of the time, the test is complicated due to the setup,
and not having common idioms to access nodes and state that assertions are made over.

Besides separating from the setup, we need to review how we express the tests themselves.
Part of this is the language itself, part of it is the DSL (domain specific language) we build on top.

#### Golang

Golang can be great:
- A lot of the existing tests and test-utilities can be ported over quickly.
- There is a Kurtosis Go SDK we can integrate with
- Golang is the simplest common denominator for test-writing (assuming no introduction of Python just for testing).

While there are many existing test utilities and abstractions, the Go test stack can still improve.
Notably it can be challenging to fork a test into two independent tests, or parametrize in general.

With these existing test utils, some of it already looks like a DSL.
E.g. the action-tests have a specific way of making one thing happen at a time, in a bigger orchestration of actors.

This DSL-like functionality needs to improve in the system-tests,
where awaiting events and more asynchronous work can still feel verbose and fragile.

#### Rust
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strongly in favor of writing tests in Go. I think adopting Rust here will make things much slower, and force us to rewrite all of our existing tooling.


There may be an argument for testing in Rust.
But with all existing infra tools written in Go,
this may be more challenging to maintain by the current platforms team.

One option might be to automate Kurtosis,
and then define a Rust test-suite runner with the [Rust kurtosis SDK](https://crates.io/crates/kurtosis-sdk)
to interface with the kurtosis deployment.
For those test-cases where we prefer to write it in Rust instead of Go or Solidity.

There is no DSL yet, this would be quite new ground for a lot of the non-Rust engineering.

#### Solidity

However, both Go and Rust fall short on contract interactions: creating bindings for every interaction can be tiresome.
And solidity rust macros might still be too much context switching for a test writer.

A spike, to explore what testing in a solidity-first environment could look like, could be worth doing.
The [`whatif.sol`](https://gist.github.com/protolambda/1c149b54ec54b57610eca6661f687170) is an early draft of what this could look like.

Solidity test scripts, expressing invariants and such,
can also be a great common language between the OP Stack specs, and the tests.
Similar to Python in the [Ethereum L1 Consensus specs](https://github.com/ethereum/consensus-specs/blob/dev/specs/phase0/beacon-chain.md).
Defining invariants in solidity, as part of the spec, which we then pull into tests,
could make protocol development more test-driven.

For DSL, we do have Forge patterns:
switching to a fork, and announcing the next call as a broadcast,
are common patterns in foundry-tests that could work well, and might only need minimal changes.
Custom DSL / cheatcodes can be supported if we run the tests in the Go script environment,
where we can plug in our own cheatcodes.
These cheatcodes can potentially just be a thin wrapper around the Go test framework.

### Improved insights

After everything is set up and tests are running, we still need to improve our insights.
How do things fail? How does a live network perform after triggering a non-fatal edge-case?

Generally this means improving monitoring, instrumentation, etc.

Some ideas to improve:
- Light-weight RPC proxies everywhere. We can capture JSON-RPC exchanges between all the services,
and flag weird timing, and review RPC logs after problems.
In a way we can untangle the communication between 20+ services, just like logging,
by tracing and labeling what happens.
- Make flight-recording a standard practice for every service.
Being able to look at what a node is/was doing is very helpful.
For tests that take more resources to re-run, having more data after test-failure is important.
- Making op-node stream its events, so we can assert things based on what is happening inside of the node,
similar to [assertoor](https://github.com/ethpandaops/assertoor/) in L1.
- For more long-running tests, post progress and aggregate results in a channel on the R&D discord.
And maybe allow channel participants to interact with the test; e.g. node restarts, API queries, pause the testing, restart the test, etc.
- Differential tests: the better we can diff the results against other results
(previous runs or of alternative clients), the more confidence we can get in compatibility between different implementations.

## Concrete solution

*This document is a work in progress, the above ideas are still being iterated on.*

## Resource Usage

<!-- What is the resource usage of the proposed solution?
Does it consume a large amount of computational resources or time? -->

# Alternatives Considered

<!-- List out a short summary of each possible solution that was considered.
Comparing the effort of each solution -->

# Risks & Uncertainties

<!-- An overview of what could go wrong.
Also any open questions that need more work to resolve. -->