Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(dex): migration for dex trait proto encodings #4341

Merged
merged 2 commits into from
May 8, 2024

Conversation

hdevalence
Copy link
Member

@hdevalence hdevalence commented May 7, 2024

Describe your changes

Fixes a chain halt due to changes introduced in #4188.

Previously, aggregate liquidity lookups in non-verifiable storage (i.e. "how much liquidity is available for trading pair (A, B)?") were stored as raw big-endian encoded values. In #4188 the DEX state accessors were refactored and part of that was adjusting the accessors for aggregate liquidity lookups to use nonverifiable_put/nonverifiable_get instead of the _raw methods they were using previously.

Testing

Manually testing the values were changed appropriately is actually quite difficult currently as pcli doesn't support querying values from non-verifiable storage, and returned values need to be parsed correctly into the Amount data type.

  • Create a devnet on v0.73.1
  • Watch for changes to the aggregate liquidity prefix: cargo run --release --bin pcli -- query watch --nv-key-regex 'dex/ab/.*'
  • Submit some liquidity positions. Note down the aggregate liquidity data
  • Perform migration to this branch: delegate, propose, vote, halt, migrate, restart network
  • The chain will halt if the migration doesn't properly change the encoding of the values. The first test is simply to ensure that the chain can run post-migration.
  • A further test is to verify that the aggregate liquidity values remain the same post-migration. Query again and compare

Issue ticket number and link

Closes #4340

@conorsch
Copy link
Contributor

conorsch commented May 7, 2024

We merged #4339, so this can be rebased on latest main.

@conorsch
Copy link
Contributor

conorsch commented May 7, 2024

Lots of churn on main, including a fix for #4344. Let's keep this rebased. Locally I'll be cross-checking via #4332. Looks like we're within striking distance on a passing migration test end-to-end.

@hdevalence
Copy link
Member Author

To clarify my intent with this PR wasn't that it should become a fix, we should throw this code away — I just wanted to share context while writing up the problem for others to pick up.

@conorsch
Copy link
Contributor

conorsch commented May 7, 2024

Fair, we could just as easily open a new PR for the fix commit. But since it's already here, I think "taking over" this PR and using it to close is good enough. Up to @zbuc: if the debugging commit is no longer useful, we can snip it out pronto.

@zbuc zbuc force-pushed the dex-upgrade-halt-scratch-work branch from fe370f2 to ba81c45 Compare May 7, 2024 21:13
@zbuc
Copy link
Member

zbuc commented May 7, 2024

Fair, we could just as easily open a new PR for the fix commit. But since it's already here, I think "taking over" this PR and using it to close is good enough. Up to @zbuc: if the debugging commit is no longer useful, we can snip it out pronto.

I removed the debug trace statements, I say using this PR and squashing the commits is fine.

@zbuc zbuc marked this pull request as ready for review May 7, 2024 21:14
@hdevalence
Copy link
Member Author

sounds good, just wanted to clarify intent

@conorsch conorsch changed the title Dex upgrade halt scratch work fix(dex): migration for dex trait proto encodings May 7, 2024
zbuc and others added 2 commits May 7, 2024 15:30
I've seen several lfs-related CI failures today. We're landing PRs
pretty rapidly, so I'm just adjusting the CI jobs and not debugging
further right now.
Copy link
Contributor

@conorsch conorsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After following the test plan, post-migration, I no longer see a chain halt, nor do I see any error messages about invalid tags. Post-migration, I was able to submit and claim several swaps. LGTM.

}

Ok(())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this migration fn idempotent? We've been telling folks our migrations are idempotent, but not sure that's the case for this one. Zooming out, the changes in #4339 definitely aren't idempotent, so not blocking on this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. No, it is not idempotent. I wasn't aware that was a requirement for migrations; I will follow-up

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this not idempotent? if you run it multiple times you will get the same result, no? why not?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it because the encoding changes? that's completely fine imo

Copy link
Member

@zbuc zbuc May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this not idempotent? if you run it multiple times you will get the same result, no? why not?

Looking more closely, I am not sure at the moment. If we attempt to decode protobuf encoded Amount as a BE-encoded Amount it could fail silently and the Amount would change. However, this requires the protobuf encoding to be the same length as the big-endian encoding.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly not completely sure this is important but we can try to first decode it as a proto amount and return early if it works, and proceed if it correctly returns an error. This way it is resistant to accidental double runs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some edge cases:

        let a: Amount = 4951760157141521099596496895u128.into();

        println!("amount 1: {:?}", a);

        let proto_a_bytes: Vec<u8> = penumbra_proto::core::num::v1::Amount::from(a).encode_to_vec();

        let amount = Amount::from_be_bytes(proto_a_bytes.as_slice().try_into()?);

        println!("amount 2: {:?}", amount);
amount 1: 4951760157141521099596496895

amount 2: 11963051962064242856133983240072462207

so it could theoretically still munge some values and be non-idempotent even if we took that approach if everything aligned and the stored Amounts were in a specific range and order.

But I'm also not convinced it's worth trying to fix... migration idempotency seems better to solve at a higher level than within individual functions modifying data if it's a design goal, as there are likely other similar cases that will come up in migration code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah there's some Discord discussion about this already: https://discord.com/channels/824484045370818580/1052302055680790568/1228444060113698938

Right now, migrations are not idempotent. This may require publishing the post-upgrade root hash, however that doesn't account for migrations affecting non-verifiable storage.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not tell people migrations are idempotent.

The right way to handle idempotency, later, is probably to use the ABCI application version and compare against a version number in the state.

@@ -117,6 +148,9 @@ pub async fn migrate(
// Write auction parameters
write_auction_parameters(&mut delta).await?;

// Rewrite base liquidity indices as proto-encoded
rewrite_base_liquidity_indices(&mut delta).await?;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we circle back to addressing whether or not we want to enforce that all migrations are idempotent, we can sprinkle in some anyhow::Context usage on these calls. The tracing spans are already quite helpful in debugging what failed in a migration, but more context is never a bad thing IMO.

@conorsch conorsch merged commit fec0144 into main May 8, 2024
13 checks passed
@conorsch conorsch deleted the dex-upgrade-halt-scratch-work branch May 8, 2024 00:42
@hdevalence
Copy link
Member Author

I also just tested this on the exact same state that caused the failure previously and confirmed that it worked fine. Great work!

@cratelyn cratelyn added the A-dex Area: Relates to the dex label May 8, 2024
@cratelyn cratelyn added this to the Sprint 6 milestone May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-dex Area: Relates to the dex
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Chain halt during swap execution on post-upgraded devnet
5 participants