Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests(pd): automated migration testing #4337

Closed
wants to merge 2 commits into from
Closed

Conversation

conorsch
Copy link
Contributor

@conorsch conorsch commented May 6, 2024

Describe your changes

Adds intra-CI standalone testing of migration behavior. This can be run locally on developer workstations, and also in CI. The job is currently taking >20m, and requires a lot of disk space, but it's worth it for the assurance.

Building on the smoke-test rewrite to use process-compose, let's script the migration process, so that we can test
current HEAD of the monorepo against a prior tagged version, and validate that necessary migrations are in place.

One possible approach is to fetch prebuilt binaries from uploaded artifacts on Github. That's fine for pd, but doesn't work for running the smoke tests, due to client/server incompatibility. Therefore we'll clone the entire repo in a git-ignored subdir, and build the old binaries there. Heavy, but reliable.

Adds a new rust crate, strictly for running the migration-test suite of integration tests, which is very similar in nature to the already-existing network-integration tests AKA smoke tests. Copy/pastes a lot of code from the smokes, which we can always factor out into reusable utils, but not bothering with that right now.

Issue ticket number and link

Refs #4323.

Checklist before requesting a review

  • If this code contains consensus-breaking changes, I have added the "consensus-breaking" label. Otherwise, I declare my belief that there are not consensus-breaking changes, for the following reason:

    Testing-only, no changes to application logic.

Review, running locally

Check out this branch, and run just migration-test. It'll take a while to build! The test as written duplicates the local repo, so plan on ~100GB of disk space utilization.

@hdevalence
Copy link
Member

Confoundingly, I'm getting a proto incompatible error, about missing AuctionParams.

I think this was an actual bug #4338. We shouldn't assume the migrations actually work until we do the testing. See also #4340

@cratelyn cratelyn added this to the Sprint 6 milestone May 8, 2024
@cratelyn cratelyn added C-enhancement Category: an enhancement to the codebase A-CI/CD Relates to continuous integration & deployment of Penumbra A-testing Area: Relates to testing of Penumbra A-upgrades Area: Relates to chain upgrades labels May 8, 2024
@conorsch conorsch changed the title feat(tests): automated migration testing tests(pd): automated migration testing May 29, 2024
@conorsch conorsch requested a review from cratelyn May 29, 2024 20:08
@conorsch conorsch marked this pull request as ready for review May 29, 2024 20:44
@conorsch
Copy link
Contributor Author

Marking as ready for review, so I can stop rebasing it. 😅 The immediate consequence of merging this PR is that CI runs will take >20m again, but I'm happy to dial that back and rely on the cron schedule. For now, though, having it run on every PR will give us confidence about the state of migrations as we push toward #4497.

Copy link
Contributor

@cratelyn cratelyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm going to somewhat cautiously hit the "approve" button here. thanks so much for your work to give us automated coverage of migrations. that's tremendously valuable work.

we should have automated migration testing, and i want that sooner than later, but.. i'm a little nervous about how these are shaped. they feel somewhat heavy on boilerplate yaml, and there's a lot of shell glue surrounding the actual test.

i think a lot of the logic here could also be (and is, in the case of delegating) covered by mock consensus tests, but the broader vision of having integration tests that run a real CometBFT node feel topically relevant for migrations.

i've left a smattering of comments below. i'm happy to see this merge before those things are addressed, being acutely aware that this life easier while managing upgrades ❤️ ...but maybe file a tracking ticket(s) for following up on these points.

Comment on lines +14 to +17
# By default, don't enable migration-tests: require explicit opt-in via
# `--features migration-test`.
default = []
migration-test = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice 👍

Comment on lines +1 to +3
fn main() {
println!("Hello, world!");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn main() {
println!("Hello, world!");
}
//! This crate is used to group migration tests, this binary does not do anything.
fn main() {
println!("Hello, world!");
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this empty file was a little confusing for a moment, so a comment might help signal that this main.rs is only here as a stub.

@@ -0,0 +1,35 @@
[package]
name = "migration-test"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name = "migration-test"
name = "penumbra-migration-test"

a nit, but this should probably have the same penumbra- prefix as other crates in the workspace.

Comment on lines +17 to +22
# By default, build pd from the workspace. Support overriding via a deeper git-worktree,
# so that an older version of pd can be built and run. This helps when running older
# networks locally, to debug migrations.
vars:
WORKING_DIR: .
# WORKING_DIR: deployments/worktrees/v0.73.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i appreciate these thoughtful breadcrumbs for future tinkering. nice 🪙

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a bit difficult to understand what this is testing. my understanding is that this is running the smoke tests on the v0.76.0 release? i don't think i am connecting the dots to how this is exercising migration logic.

it'd be helpful to (a) name these files more commucatively (possibly via putting them in a folder together), and (b) putting a comment at the top of the file, in place of the boilerplate there right now. this isn't a configuration for running migration tests, it is a migration test, aiui.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...as i've read this more, i see these are phases. i'll mark this as resolved, but i did find coming to this understanding to be difficult. 😓


/// TOML for an "upgrade-plan" governance proposal.
// Intentionally avoiding importing this type to adhere to strict
// CLI interfacts for the pcli binaries.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// CLI interfacts for the pcli binaries.
// CLI interfacts for the pcli binaries.

i think a sentence got garbled here!

height: u64,
}

#[cfg_attr(not(feature = "migration-test"), ignore)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#[cfg_attr(not(feature = "migration-test"), ignore)]
#[cfg(feature = "migration-test")]

ditto, another conditional compilation attribute we can simplify

deployments/scripts/migration-test Show resolved Hide resolved
Comment on lines +19 to +35
[dependencies]
anyhow = {workspace = true}
directories = {workspace = true}
once_cell = {workspace = true}
penumbra-keys = {workspace = true, default-features = false}
serde = {workspace = true, features = ["derive"]}
serde_json = {workspace = true}
serde_with = {workspace = true, features = ["hex"]}
toml = {workspace = true, features = ["preserve_order"]}
tracing = {workspace = true}
tracing-subscriber = {workspace = true, features = ["env-filter", "ansi"]}

[dev-dependencies]
assert_cmd = {workspace = true}
predicates = "2.1"
regex = {workspace = true}
tempfile = {workspace = true}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[dependencies]
anyhow = {workspace = true}
directories = {workspace = true}
once_cell = {workspace = true}
penumbra-keys = {workspace = true, default-features = false}
serde = {workspace = true, features = ["derive"]}
serde_json = {workspace = true}
serde_with = {workspace = true, features = ["hex"]}
toml = {workspace = true, features = ["preserve_order"]}
tracing = {workspace = true}
tracing-subscriber = {workspace = true, features = ["env-filter", "ansi"]}
[dev-dependencies]
assert_cmd = {workspace = true}
predicates = "2.1"
regex = {workspace = true}
tempfile = {workspace = true}
[dev-dependencies]
anyhow = { workspace = true }
assert_cmd = { workspace = true }
directories = { workspace = true }
once_cell = { workspace = true }
penumbra-keys = { workspace = true, default-features = false }
predicates = "2.1"
regex = { workspace = true }
serde = { workspace = true, features = ["derive"] }
serde_json = { workspace = true }
serde_with = { workspace = true, features = ["hex"] }
tempfile = { workspace = true }
toml = { workspace = true, features = ["preserve_order"] }
tracing = { workspace = true }
tracing-subscriber = { workspace = true, features = ["env-filter", "ansi"] }

these should all be dev-dependencies, since this crate only contains tests.

otherwise, cargo build --package migration-test will mean building all of these crates, despite the main.rs being empty and the --features migration-test not being provided

Adds intra-CI standalone testing of migration behavior.
This can be run locally on developer workstations,
and also in CI. The job is currently taking >20m, and requires
a lot of disk space, but it's worth it for the assurance.

Building on the smoke-test rewrite to use process-compose,
let's script the migration process, so that we can test
current HEAD of the monorepo against a prior tagged version,
and validate that necessary migrations are in place.

One possible approach is to fetch prebuilt binaries from uploaded
artifacts on Github. That's fine for `pd`, but doesn't work for
running the smoke tests, due to client/server incompatibility.
Therefore we'll clone the entire repo in a git-ignored subdir,
and build the old binaries there. Heavy, but reliable.

Adds a new rust crate, strictly for running the migration-test
suite of integration tests, which is very similar in nature to the
already-existing network-integration tests AKA smoke tests.
Copy/pastes a lot of code from the smokes, which we can always
factor out into reusable utils, but not bothering with that right now.

Refs #4323.
@conorsch conorsch marked this pull request as draft June 3, 2024 15:39
@conorsch
Copy link
Contributor Author

conorsch commented Jun 3, 2024

Downgrading to draft until I address the review comments. Overall, the shape of this approach is rather heavy. Will continue to discuss testing options with @cratelyn, so that we maximize what's possible with mock-consensus, but still retain integration tests for migration behavior. Notably, this PR does not catch bugs in restarting a multi-validator setup post-upgrade, e.g. #4508.

@cratelyn cratelyn modified the milestones: Sprint 7, Sprint 8 Jun 4, 2024
@conorsch
Copy link
Contributor Author

This draft work successfully caught several early-stage issues with migrations, but only of the simpler variety: when migrations fail to apply to halted state, for instance. More complex situations such as the network failing to come back up in multi-validator setups due to app logic problems were not caught by these changes, and instead required a more complicated setup that leveraged an online multi-validator setup.

Given the CI changes in #4678, we should think hard about what an adequate testing environment looks like for this environment. For the near future, I recommend we stick to creating devnets and performing candidate migrations against them, ensuring that they behave well when restarted across the upgrade boundary. Longer-term, there's detail in #4323 and also #4265 for how to approach this problem comprehensively.

@conorsch conorsch closed this Jun 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-CI/CD Relates to continuous integration & deployment of Penumbra A-testing Area: Relates to testing of Penumbra A-upgrades Area: Relates to chain upgrades C-enhancement Category: an enhancement to the codebase
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants