protocol/platforms: test infra draft #165

protolambda · 2024-11-21T17:35:31Z

Description

Based on protocol-team discussion at the onsite, and some older notes, this design-doc puts together the context of where we are with testing, what the challenges are, and where things can improve.

I'd like this to be a cross-team design-doc, where platforms / protocol chime in. That way we can refine the problem-context, iterate on the proposed solution more, and prioritize some more concrete changes.

sigma · 2024-11-21T19:42:16Z

protocol/test-infra.md

+With a greater number of chains, we will need more test deployments,
+and more monitoring.


How generic/specific are the tests and monitoring here?
In the abstract I'm imagining that all the chains share some common fundamental properties that need to be continuously validated, but maybe they also have individual properties that need ad-hoc testing?

So for the former I could imagine that we need to invest in making those validation devices as generic as possible, while the latter would presumably require something different (higher-level interpretation of the chain configuration to derive the appropriate validators?)
(disclaimer, I barely know what I'm talking about here, so I might be way off :))

It's validating the same thing for every chain. Per chain there might be some minor differences, like different bridge contract addresses, but easy to replicate.

For this type of continuous testing of more chains I am mostly worried about reducing cost (hosting more nodes + minimizing baby-sitting costs of those nodes).

It might get more chain-specific if we expand into the user-facing-monitoring side. E.g. if we want to check uptime / agreement of known public services per chain. E.g. automated checks that all major RPC providers are on the same chain.

sigma · 2024-11-21T19:46:07Z

protocol/test-infra.md

+- [monitorism](https://github.com/ethereum-optimism/monitorism)
+
+Challenges:
+- Test nodes can be unstable: lack of peers can result in missed blocks.


Could you zoom-in on this? At some level it reads a bit like a failure mode of the chain itself is affecting its validation. Which I guess is somewhat ok? some tests could always be inconclusive, if their pre-requisites are not met (and then having inconclusive tests is itself a validation failure regardless of what they were trying to do)

Especially for smaller chains, it can be difficult at times to find peers and stay connected to them. The chain itself is healthy (sequencer online, working batch submitter), but the happy-path of block-distribution via P2P can be interrupted.

For those types of happy-path interruptions, it would be nice to track, and maybe improve the robustness of the stack. E.g. by using additional alternative peer-discovery systems, or changing gossip parameters.

I would say it's less of a priority than core spec-conformance testing, but still an area we can improve.

sigma · 2024-11-21T19:48:50Z

protocol/test-infra.md

+
+Challenges:
+- System tests are prone to flakes: running N systems,
+  in a resource-constrained environment, with concurrent work / timers / etc., tends to miss assertions due to processing stalls.


Is it a case of the assertions being stronger than what the system actually guarantees (typically: eventual correctness, encoded as a real-time constraint) ?
Or is it more that the test environment is actually invalid with respect to the properties of the system running in it (typically: IOPs not compliant with requirements, therefore timeout) ?

The former, the happy-path is being interrupted by resource-constrained CI machines. But generally we should be able to assert that happy-path, and not fallback to worst-case in every test.

E.g. if the chain started early, but blocks started being produced late, then the sequencer might not take time to include transactions, and the test runs but fails to confirm a tx within the expected amount of time. The bug here might first appear to be the tx-inclusion itself, but really it was the resource constraints that even prevented it from being ready to include a tx.

I see. I mean, there's a bit of a philosophical consideration here, in the sense that asserting something that's not strictly speaking true is bound to cause false positives at scale.

I'm wondering if an ok middleground might look like:

specifying what the happy path actually demands (more or less "relevant platform metrics being in the green")

coupling test execution with platform monitoring, categorizing test "failures" as inconclusive when they happen to coincide with red monitoring

rescheduling inconclusive tests according to some policy

in parallel play with resources allocation, scheduling, ... to improve the odds of monitoring remaining green

The argument here would be that we then:

improve (presumably) our flaking situation

keep control of the requirements (therefore avoiding what I usually call "real-time inflation" where tests develop ties to "ideal" situations over time)

gain the ability to deliberately push the system into the red and see how it copes

sigma · 2024-11-21T19:51:02Z

protocol/test-infra.md

+- `op-e2e` tests are not portable to alternative client implementations.
+  We need ways of generalizing test assertions, and decouple the nodes setup from the test itself.
+- `op-e2e` system tests are quite monolithic. Adding/removing nodes is high friction.
+  Some tests run services that do not influence the test.


Does that translate to those services not handling anything, or them running and processing stuff, but that stuff being ultimately irrelevant?
I guess I'm wondering what kind of analysis/monitoring would ideally detect those superfluous bits.

E.g. a test that checks the batch-submitter is functional might currently run a full suite of services, that includes the op-proposer, which does not affect the batch-submitting work itself.

ok, so a strategy could be:

identify which components the test interacts with / has assertions for

compute the smaller runnable/viable system that spans those components

compare with the system actually being deployed, and expose the delta
Does that sound right?

sigma · 2024-11-21T19:53:38Z

protocol/test-infra.md

+- Limited resource sharing: a lot of the feature-tests spawn a copy of the system,
+  rather than being able to run against an already running system.


I guess that's a natural way to side-step the problem of guaranteeing isolation at the test level. Are there natural boundaries that would make resources-sharing more viable than in the general case?
Off the top of my head I'm imagining things like dynamically created sandboxes that we can monitor for any unexpected "external" operations. But again, I have no idea what I'm talking about :)

Are there natural boundaries that would make resources-sharing more viable than in the general case?

Using unique keys/addresses for user-accounts that are part of a test would help. Otherwise we might see conflicting transactions (same nonce value, only one can be confirmed).

If a system key is involved, then we have to synchronize that part of the test globally, otherwise you get tx conflicts.

And tx-throughput may mess with the basefee, which can have adverse effects between tests.

If a test can declare what is needed, then I think some test-orchestrator can synchronize / schedule as necessary. We should be careful not to make the resource-sharing too complicated though, or else we may run into packing problems and such, or new kinds of flakes (e.g. where the test outcome depends on running another test in parallel or not).

Right, so if we're able to prove that the test pre-requisites are contained in non-overlapping sandboxes then we should be able to reuse a test target if one is available. And otherwise we'd fallback to testing in isolation.
There's probably a need for some live-monitoring of some fundamentally shared constructs, like the basefee you mentioned, in order to validate that otherwise independent tests don't see their assumptions invalidated.

sigma · 2024-11-21T19:55:36Z

protocol/test-infra.md

+In general the main idea here is that we need a
+way to express tests with system requirements, decoupled from system setup:
+we can unify upgrade-checks / tests, and run them against shared chain resources,
+to not overwhelm CI resources.


Generally speaking I'd probably phrase this as having the ability to run a test against an arbitrary system. Shouldn't matter if it's a long-running one, a fresh stack, a developer-provided environment or whatever. Decoupling entirely the test logic from the system it applies to seems like a good idea in general

+1. Although some tests might depend on more rare / custom chain parameters, e.g. to shorten the sequencing window, to not wait the standard 12 hours for it to kick in. We need to find some way to bring those tests + the right setup together, without overcomplicating the test-setup (making it too dynamic again becomes a kind of packing problem, where some tests can share some resources, but not always).

Right, I should have mentioned that I always assume that tests should be able to say "I can't run under those conditions", based on some specification of pre-conditions. Those pre-conditions should definitely cover things like these dependencies.
If a test cannot run with what it's provided, then it should say so and bail (well, in reality forcing it to run anyway is a nice way to detect potential over-specification of the pre-conditions, but that's not a core concern)

sigma · 2024-11-21T19:57:34Z

protocol/test-infra.md

+
+### Growth of history
+
+With more chain history, there is more syncing to do, and more regression tests to maintain.


That seems concerning in the abstract, given that history is presumably ever-growing. Is there a possible world where we can essentially ignore a chain history prefix to cap the size the history "window" we'd have to consider?

There's probably a project in there, outside of current scope, to deprecate legacy state-transition logic.
E.g. Holocene simplified the chain-derivation logic a lot, but we're still maintaining the pre-Holocene logic.

After some point for very old blocks we could say that we hardcode what is canonical, and omit the logic to derive them from raw inputs, to be able to drop the old code / complexity.

And this is related to the archive-node problem, and a bunch of ideas in L1 (things like EIP-4444, and the Era-file format). If we can freeze the historical data in some way, and drop the corresponding code, then we can get rid of legacy sync edge-cases.

sigma · 2024-11-21T19:59:57Z

protocol/test-infra.md

+- [optimism kurtosis](https://github.com/ethpandaops/optimism-package/) to spin up clients in a network
+
+Challenges:
+- The Go tests generally don't export to alternative implementations like Rust


I'm not sure I understand why the test language matters that much. Are we testing the clients at a level that sits below the user interface?

A lot of the testing is below the regular user-interface, generally asserting more technical chain state-changes.

E.g. accurate progression of chain safety, or handling of malformed batch data. Things most users, with a transaction-RPC, are never bothered by directly. But still important for the stability and safety of the chain.

We need some way to express these more technical properties, how we assert them etc., without a lot of boilerplate where we repeat things like creating an RPC client, bindings for specific RPC methods, handling of retries, inspecting of error codes, etc. Helper functions, or even a more complete DSL, can do a lot here.

The language choice can also still be important: writing contract-interaction tests in Go is a lot more painful than in solidity, because every interaction in Go would need things like special contract bindings, extra transaction construction and confirmation handling. Things you don't even have to think about when writing it like a solidity test.

+1 to a good testing library/DSL. That would be my top pick for what's missing to be honest (competing with missing a dedicated cross-client reference test suite for consensus code). I think a good DSL could dramatically simplify go->contract interactions as well. Solidity will always work best for particularly complex things but for just calling contracts we should be able to make it easy.

sigma · 2024-11-21T20:03:03Z

protocol/test-infra.md

+There may be ways we can better categorize tests, make test setup more composable,
+and improve parametrization of contract versioning and chain configurations.


Am I reading correctly that you're proposing we make sure hard-dependencies on some stack (ours in this case) don't creep into the test logic?
If so I'm wondering if we can enumerate the objects that potentially create such a hard-dependency, and detect their usage-patterns in the tests to ensure they're correct

Yes, ideally we don't opinionate our own test setup / framework so much that it becomes too hard for forks of the OP-Stack to build on top of the testing.

Alt-DA variants of the OP-Stack and Alt-proof-system versions are probably the most common things those forks want to be able to test, without refactoring the core test system. A more composable test setup can go a long way here I think.

This might not be an immediate priority, but we should gather some feedback on how different OP Stack forks approach testing of their modifications, and if there are things we can do to simplify that testing.

mslipper · 2024-11-21T20:59:17Z

protocol/test-infra.md

+  Some tests run services that do not influence the test.
+- `op-chain-ops` and `superchain-ops` are not automated against scheduled CI runs.
+  Upgrade checks should run more regularly than just one-off in devnet/testnet.
+- Limited resource sharing: a lot of the feature-tests spawn a copy of the system,


+1 on this - re-using the L1 (for example) would help with a lot of the flakes we're seeing now.

mslipper · 2024-11-21T20:59:51Z

protocol/test-infra.md

+Challenges:
+- System tests are prone to flakes: running N systems,
+  in a resource-constrained environment, with concurrent work / timers / etc., tends to miss assertions due to processing stalls.
+- `op-e2e` tests are not portable to alternative client implementations.


Big +1 on this. The tests themselves should be decoupled from the clients they run against, so we can build matrix-style tests against multiple clients.

But is op-e2e the right framework for that? The vast majority of op-e2e tests aren't testing the EL. Many are testing op-node derivation, many are testing op-proposer, op-batcher, op-challenger, dispute games etc none of which have alternate implementations.

We do need a multi client testsuite, but I'm not sure having to pull everything into op-e2e is the right answer. We don't want op-reth, op-nethermind, op-besu, op-whatever-I-just-created to be a dependency of the monorepo for example. Asterisc pulls the monorepo in so it can run e2e tests in their repo and it's not a particularly clean solution - it works but we break it by making changes to op-e2e pretty regularly.

I'm a really big fan of being able to export a test suite like the L1 reference tests do. Individual clients will still have their own e2e tests in addition for things that aren't consensus critical but the reference tests define the shared set of tests for consensus compatibility and are designed to be implementation and test framework agnostic and easy to pull a release into whatever repo needs to run them.

mslipper · 2024-11-21T21:02:19Z

protocol/test-infra.md

+Specifically, this requires cross-chain testing.
+This testing adds multi-L2 deployments, multi-L2 test setup, and multi-L2 tests to the testing scope.
+
+Today:


I'd love to find a way to reduce the amount of duplication here so that we can use a shared platform.

mslipper · 2024-11-21T21:05:21Z

protocol/test-infra.md

+This DSL-like functionality needs to improve in the system-tests,
+where awaiting events and more asynchronous work can still feel verbose and fragile.
+
+#### Rust


Strongly in favor of writing tests in Go. I think adopting Rust here will make things much slower, and force us to rewrite all of our existing tooling.

ajsutton

I'd say pretty much everything in this is a great idea, but the scope is so big that it's really hard to reason about and we're at risk of trying to define some all comprehensive solution when in my experience you usually need multiple different, partially overlapping solutions to solve all of these problems.

So I think there's value in identifying the full range of problems and opportunities we see so great to have this doc, but I'd then strongly suggest we identify what the one or maybe two highest impact things are and build the simplest thing that would solve it, ideally by building on and adapting what we already have. While there's a risk that we build something that needs a lot of changes to solve the next problem we take on, the sooner we can ship something that solves our biggest pain point, the sooner we start getting value from it and then because it made our life easier, we have more capacity to do the next thing. Plus we'll have almost certainly learnt something that then at least slightly changes what we'd want to build to solve the next problem anyway.

ajsutton · 2024-11-22T04:03:56Z

protocol/test-infra.md

+Challenges:
+- System tests are prone to flakes: running N systems,
+  in a resource-constrained environment, with concurrent work / timers / etc., tends to miss assertions due to processing stalls.
+- `op-e2e` tests are not portable to alternative client implementations.


But is op-e2e the right framework for that? The vast majority of op-e2e tests aren't testing the EL. Many are testing op-node derivation, many are testing op-proposer, op-batcher, op-challenger, dispute games etc none of which have alternate implementations.

We do need a multi client testsuite, but I'm not sure having to pull everything into op-e2e is the right answer. We don't want op-reth, op-nethermind, op-besu, op-whatever-I-just-created to be a dependency of the monorepo for example. Asterisc pulls the monorepo in so it can run e2e tests in their repo and it's not a particularly clean solution - it works but we break it by making changes to op-e2e pretty regularly.

I'm a really big fan of being able to export a test suite like the L1 reference tests do. Individual clients will still have their own e2e tests in addition for things that aren't consensus critical but the reference tests define the shared set of tests for consensus compatibility and are designed to be implementation and test framework agnostic and easy to pull a release into whatever repo needs to run them.

ajsutton · 2024-11-22T04:13:46Z

protocol/test-infra.md

+- [optimism kurtosis](https://github.com/ethpandaops/optimism-package/) to spin up clients in a network
+
+Challenges:
+- The Go tests generally don't export to alternative implementations like Rust


+1 to a good testing library/DSL. That would be my top pick for what's missing to be honest (competing with missing a dedicated cross-client reference test suite for consensus code). I think a good DSL could dramatically simplify go->contract interactions as well. Solidity will always work best for particularly complex things but for just calling contracts we should be able to make it easy.

protocol/platforms: test infra draft

592ea5f

protolambda requested review from mslipper, tessr, tynes and sigma November 21, 2024 17:35

protolambda assigned protolambda, sigma, mslipper and tessr Nov 21, 2024

protolambda requested a review from ajsutton November 21, 2024 17:46

sigma reviewed Nov 21, 2024

View reviewed changes

mslipper reviewed Nov 21, 2024

View reviewed changes

ajsutton reviewed Nov 22, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

protocol/platforms: test infra draft #165

protocol/platforms: test infra draft #165

protolambda commented Nov 21, 2024

sigma Nov 21, 2024

protolambda Nov 21, 2024

sigma Nov 21, 2024

protolambda Nov 21, 2024

sigma Nov 21, 2024

protolambda Nov 21, 2024

sigma Nov 22, 2024

sigma Nov 21, 2024

protolambda Nov 21, 2024

sigma Nov 22, 2024

sigma Nov 21, 2024

protolambda Nov 21, 2024

sigma Nov 22, 2024

sigma Nov 21, 2024

protolambda Nov 21, 2024

sigma Nov 22, 2024

sigma Nov 21, 2024

protolambda Nov 21, 2024

sigma Nov 21, 2024

protolambda Nov 21, 2024

ajsutton Nov 22, 2024

sigma Nov 21, 2024

protolambda Nov 21, 2024

mslipper Nov 21, 2024

mslipper Nov 21, 2024

ajsutton Nov 22, 2024

mslipper Nov 21, 2024

mslipper Nov 21, 2024

ajsutton left a comment

ajsutton Nov 22, 2024

ajsutton Nov 22, 2024

		With a greater number of chains, we will need more test deployments,
		and more monitoring.

		- Limited resource sharing: a lot of the feature-tests spawn a copy of the system,
		rather than being able to run against an already running system.


		### Growth of history

		With more chain history, there is more syncing to do, and more regression tests to maintain.

		There may be ways we can better categorize tests, make test setup more composable,
		and improve parametrization of contract versioning and chain configurations.

protocol/platforms: test infra draft #165

Are you sure you want to change the base?

protocol/platforms: test infra draft #165

Conversation

protolambda commented Nov 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajsutton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment