Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

protocol/platforms: test infra draft #165

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

protocol/platforms: test infra draft #165

wants to merge 1 commit into from

Conversation

protolambda
Copy link
Contributor

Description

Based on protocol-team discussion at the onsite, and some older notes, this design-doc puts together the context of where we are with testing, what the challenges are, and where things can improve.

I'd like this to be a cross-team design-doc, where platforms / protocol chime in. That way we can refine the problem-context, iterate on the proposed solution more, and prioritize some more concrete changes.

Comment on lines +60 to +61
With a greater number of chains, we will need more test deployments,
and more monitoring.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How generic/specific are the tests and monitoring here?
In the abstract I'm imagining that all the chains share some common fundamental properties that need to be continuously validated, but maybe they also have individual properties that need ad-hoc testing?

So for the former I could imagine that we need to invest in making those validation devices as generic as possible, while the latter would presumably require something different (higher-level interpretation of the chain configuration to derive the appropriate validators?)
(disclaimer, I barely know what I'm talking about here, so I might be way off :))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's validating the same thing for every chain. Per chain there might be some minor differences, like different bridge contract addresses, but easy to replicate.

For this type of continuous testing of more chains I am mostly worried about reducing cost (hosting more nodes + minimizing baby-sitting costs of those nodes).

It might get more chain-specific if we expand into the user-facing-monitoring side. E.g. if we want to check uptime / agreement of known public services per chain. E.g. automated checks that all major RPC providers are on the same chain.

- [monitorism](https://github.com/ethereum-optimism/monitorism)

Challenges:
- Test nodes can be unstable: lack of peers can result in missed blocks.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you zoom-in on this? At some level it reads a bit like a failure mode of the chain itself is affecting its validation. Which I guess is somewhat ok? some tests could always be inconclusive, if their pre-requisites are not met (and then having inconclusive tests is itself a validation failure regardless of what they were trying to do)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Especially for smaller chains, it can be difficult at times to find peers and stay connected to them. The chain itself is healthy (sequencer online, working batch submitter), but the happy-path of block-distribution via P2P can be interrupted.

For those types of happy-path interruptions, it would be nice to track, and maybe improve the robustness of the stack. E.g. by using additional alternative peer-discovery systems, or changing gossip parameters.

I would say it's less of a priority than core spec-conformance testing, but still an area we can improve.


Challenges:
- System tests are prone to flakes: running N systems,
in a resource-constrained environment, with concurrent work / timers / etc., tends to miss assertions due to processing stalls.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a case of the assertions being stronger than what the system actually guarantees (typically: eventual correctness, encoded as a real-time constraint) ?
Or is it more that the test environment is actually invalid with respect to the properties of the system running in it (typically: IOPs not compliant with requirements, therefore timeout) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The former, the happy-path is being interrupted by resource-constrained CI machines. But generally we should be able to assert that happy-path, and not fallback to worst-case in every test.

E.g. if the chain started early, but blocks started being produced late, then the sequencer might not take time to include transactions, and the test runs but fails to confirm a tx within the expected amount of time. The bug here might first appear to be the tx-inclusion itself, but really it was the resource constraints that even prevented it from being ready to include a tx.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I mean, there's a bit of a philosophical consideration here, in the sense that asserting something that's not strictly speaking true is bound to cause false positives at scale.

I'm wondering if an ok middleground might look like:

  • specifying what the happy path actually demands (more or less "relevant platform metrics being in the green")
  • coupling test execution with platform monitoring, categorizing test "failures" as inconclusive when they happen to coincide with red monitoring
  • rescheduling inconclusive tests according to some policy
  • in parallel play with resources allocation, scheduling, ... to improve the odds of monitoring remaining green

The argument here would be that we then:

  • improve (presumably) our flaking situation
  • keep control of the requirements (therefore avoiding what I usually call "real-time inflation" where tests develop ties to "ideal" situations over time)
  • gain the ability to deliberately push the system into the red and see how it copes

- `op-e2e` tests are not portable to alternative client implementations.
We need ways of generalizing test assertions, and decouple the nodes setup from the test itself.
- `op-e2e` system tests are quite monolithic. Adding/removing nodes is high friction.
Some tests run services that do not influence the test.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that translate to those services not handling anything, or them running and processing stuff, but that stuff being ultimately irrelevant?
I guess I'm wondering what kind of analysis/monitoring would ideally detect those superfluous bits.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. a test that checks the batch-submitter is functional might currently run a full suite of services, that includes the op-proposer, which does not affect the batch-submitting work itself.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so a strategy could be:

  • identify which components the test interacts with / has assertions for
  • compute the smaller runnable/viable system that spans those components
  • compare with the system actually being deployed, and expose the delta
    Does that sound right?

Comment on lines +99 to +100
- Limited resource sharing: a lot of the feature-tests spawn a copy of the system,
rather than being able to run against an already running system.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that's a natural way to side-step the problem of guaranteeing isolation at the test level. Are there natural boundaries that would make resources-sharing more viable than in the general case?
Off the top of my head I'm imagining things like dynamically created sandboxes that we can monitor for any unexpected "external" operations. But again, I have no idea what I'm talking about :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there natural boundaries that would make resources-sharing more viable than in the general case?

Using unique keys/addresses for user-accounts that are part of a test would help. Otherwise we might see conflicting transactions (same nonce value, only one can be confirmed).

If a system key is involved, then we have to synchronize that part of the test globally, otherwise you get tx conflicts.

And tx-throughput may mess with the basefee, which can have adverse effects between tests.

If a test can declare what is needed, then I think some test-orchestrator can synchronize / schedule as necessary. We should be careful not to make the resource-sharing too complicated though, or else we may run into packing problems and such, or new kinds of flakes (e.g. where the test outcome depends on running another test in parallel or not).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so if we're able to prove that the test pre-requisites are contained in non-overlapping sandboxes then we should be able to reuse a test target if one is available. And otherwise we'd fallback to testing in isolation.
There's probably a need for some live-monitoring of some fundamentally shared constructs, like the basefee you mentioned, in order to validate that otherwise independent tests don't see their assumptions invalidated.

Comment on lines +102 to +105
In general the main idea here is that we need a
way to express tests with system requirements, decoupled from system setup:
we can unify upgrade-checks / tests, and run them against shared chain resources,
to not overwhelm CI resources.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking I'd probably phrase this as having the ability to run a test against an arbitrary system. Shouldn't matter if it's a long-running one, a fresh stack, a developer-provided environment or whatever. Decoupling entirely the test logic from the system it applies to seems like a good idea in general

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Although some tests might depend on more rare / custom chain parameters, e.g. to shorten the sequencing window, to not wait the standard 12 hours for it to kick in. We need to find some way to bring those tests + the right setup together, without overcomplicating the test-setup (making it too dynamic again becomes a kind of packing problem, where some tests can share some resources, but not always).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I should have mentioned that I always assume that tests should be able to say "I can't run under those conditions", based on some specification of pre-conditions. Those pre-conditions should definitely cover things like these dependencies.
If a test cannot run with what it's provided, then it should say so and bail (well, in reality forcing it to run anyway is a nice way to detect potential over-specification of the pre-conditions, but that's not a core concern)


### Growth of history

With more chain history, there is more syncing to do, and more regression tests to maintain.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems concerning in the abstract, given that history is presumably ever-growing. Is there a possible world where we can essentially ignore a chain history prefix to cap the size the history "window" we'd have to consider?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's probably a project in there, outside of current scope, to deprecate legacy state-transition logic.
E.g. Holocene simplified the chain-derivation logic a lot, but we're still maintaining the pre-Holocene logic.

After some point for very old blocks we could say that we hardcode what is canonical, and omit the logic to derive them from raw inputs, to be able to drop the old code / complexity.

And this is related to the archive-node problem, and a bunch of ideas in L1 (things like EIP-4444, and the Era-file format). If we can freeze the historical data in some way, and drop the corresponding code, then we can get rid of legacy sync edge-cases.

- [optimism kurtosis](https://github.com/ethpandaops/optimism-package/) to spin up clients in a network

Challenges:
- The Go tests generally don't export to alternative implementations like Rust
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand why the test language matters that much. Are we testing the clients at a level that sits below the user interface?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of the testing is below the regular user-interface, generally asserting more technical chain state-changes.

E.g. accurate progression of chain safety, or handling of malformed batch data. Things most users, with a transaction-RPC, are never bothered by directly. But still important for the stability and safety of the chain.

We need some way to express these more technical properties, how we assert them etc., without a lot of boilerplate where we repeat things like creating an RPC client, bindings for specific RPC methods, handling of retries, inspecting of error codes, etc. Helper functions, or even a more complete DSL, can do a lot here.

The language choice can also still be important: writing contract-interaction tests in Go is a lot more painful than in solidity, because every interaction in Go would need things like special contract bindings, extra transaction construction and confirmation handling. Things you don't even have to think about when writing it like a solidity test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to a good testing library/DSL. That would be my top pick for what's missing to be honest (competing with missing a dedicated cross-client reference test suite for consensus code). I think a good DSL could dramatically simplify go->contract interactions as well. Solidity will always work best for particularly complex things but for just calling contracts we should be able to make it easy.

Comment on lines +186 to +187
There may be ways we can better categorize tests, make test setup more composable,
and improve parametrization of contract versioning and chain configurations.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I reading correctly that you're proposing we make sure hard-dependencies on some stack (ours in this case) don't creep into the test logic?
If so I'm wondering if we can enumerate the objects that potentially create such a hard-dependency, and detect their usage-patterns in the tests to ensure they're correct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, ideally we don't opinionate our own test setup / framework so much that it becomes too hard for forks of the OP-Stack to build on top of the testing.

Alt-DA variants of the OP-Stack and Alt-proof-system versions are probably the most common things those forks want to be able to test, without refactoring the core test system. A more composable test setup can go a long way here I think.

This might not be an immediate priority, but we should gather some feedback on how different OP Stack forks approach testing of their modifications, and if there are things we can do to simplify that testing.

Some tests run services that do not influence the test.
- `op-chain-ops` and `superchain-ops` are not automated against scheduled CI runs.
Upgrade checks should run more regularly than just one-off in devnet/testnet.
- Limited resource sharing: a lot of the feature-tests spawn a copy of the system,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this - re-using the L1 (for example) would help with a lot of the flakes we're seeing now.

Challenges:
- System tests are prone to flakes: running N systems,
in a resource-constrained environment, with concurrent work / timers / etc., tends to miss assertions due to processing stalls.
- `op-e2e` tests are not portable to alternative client implementations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big +1 on this. The tests themselves should be decoupled from the clients they run against, so we can build matrix-style tests against multiple clients.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But is op-e2e the right framework for that? The vast majority of op-e2e tests aren't testing the EL. Many are testing op-node derivation, many are testing op-proposer, op-batcher, op-challenger, dispute games etc none of which have alternate implementations.

We do need a multi client testsuite, but I'm not sure having to pull everything into op-e2e is the right answer. We don't want op-reth, op-nethermind, op-besu, op-whatever-I-just-created to be a dependency of the monorepo for example. Asterisc pulls the monorepo in so it can run e2e tests in their repo and it's not a particularly clean solution - it works but we break it by making changes to op-e2e pretty regularly.

I'm a really big fan of being able to export a test suite like the L1 reference tests do. Individual clients will still have their own e2e tests in addition for things that aren't consensus critical but the reference tests define the shared set of tests for consensus compatibility and are designed to be implementation and test framework agnostic and easy to pull a release into whatever repo needs to run them.

Specifically, this requires cross-chain testing.
This testing adds multi-L2 deployments, multi-L2 test setup, and multi-L2 tests to the testing scope.

Today:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to find a way to reduce the amount of duplication here so that we can use a shared platform.

This DSL-like functionality needs to improve in the system-tests,
where awaiting events and more asynchronous work can still feel verbose and fragile.

#### Rust
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strongly in favor of writing tests in Go. I think adopting Rust here will make things much slower, and force us to rewrite all of our existing tooling.

Copy link
Contributor

@ajsutton ajsutton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say pretty much everything in this is a great idea, but the scope is so big that it's really hard to reason about and we're at risk of trying to define some all comprehensive solution when in my experience you usually need multiple different, partially overlapping solutions to solve all of these problems.

So I think there's value in identifying the full range of problems and opportunities we see so great to have this doc, but I'd then strongly suggest we identify what the one or maybe two highest impact things are and build the simplest thing that would solve it, ideally by building on and adapting what we already have. While there's a risk that we build something that needs a lot of changes to solve the next problem we take on, the sooner we can ship something that solves our biggest pain point, the sooner we start getting value from it and then because it made our life easier, we have more capacity to do the next thing. Plus we'll have almost certainly learnt something that then at least slightly changes what we'd want to build to solve the next problem anyway.

Challenges:
- System tests are prone to flakes: running N systems,
in a resource-constrained environment, with concurrent work / timers / etc., tends to miss assertions due to processing stalls.
- `op-e2e` tests are not portable to alternative client implementations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But is op-e2e the right framework for that? The vast majority of op-e2e tests aren't testing the EL. Many are testing op-node derivation, many are testing op-proposer, op-batcher, op-challenger, dispute games etc none of which have alternate implementations.

We do need a multi client testsuite, but I'm not sure having to pull everything into op-e2e is the right answer. We don't want op-reth, op-nethermind, op-besu, op-whatever-I-just-created to be a dependency of the monorepo for example. Asterisc pulls the monorepo in so it can run e2e tests in their repo and it's not a particularly clean solution - it works but we break it by making changes to op-e2e pretty regularly.

I'm a really big fan of being able to export a test suite like the L1 reference tests do. Individual clients will still have their own e2e tests in addition for things that aren't consensus critical but the reference tests define the shared set of tests for consensus compatibility and are designed to be implementation and test framework agnostic and easy to pull a release into whatever repo needs to run them.

- [optimism kurtosis](https://github.com/ethpandaops/optimism-package/) to spin up clients in a network

Challenges:
- The Go tests generally don't export to alternative implementations like Rust
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to a good testing library/DSL. That would be my top pick for what's missing to be honest (competing with missing a dedicated cross-client reference test suite for consensus code). I think a good DSL could dramatically simplify go->contract interactions as well. Solidity will always work best for particularly complex things but for just calling contracts we should be able to make it easy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants