replay: extend last fec set check for 32+ retransmitter signed shreds #2101

AshwinSekar · 2024-07-11T21:35:33Z

Problem

When the feature flag is turned on, we need to verify that the last fec set uses the new shred format.

Because the last fec set is not known when receiving shreds, and the new shred format is only used for the last set, we cannot perform this check until replay has completed.

Summary of Changes

Extend the check which compares the merkle roots of the last fec set to additionally check that they are all of the retransmitter shred variant.

ledger/src/blockstore.rs

behzadnouri · 2024-07-16T13:20:30Z

ledger/src/blockstore.rs

+                "incomplete_final_fec_set",
+                ("slot", slot, i64),
+                ("hash", hash.to_string(), String)
+            );
+        }
+        if !self.is_retransmitter_signed {
+            datapoint_warn!(
+                "invalid_retransmitter_signature_final_fec_set",
+                ("slot", slot, i64),
+                ("hash", hash.to_string(), String)


Apparently these hash things are pretty expensive for metrics server because they do not compress.
And until chained merkle shreds are rolled out every node is going to submit this metric for every slot which is too much.

Sounds good i've gated this with should_chain_merkle_shreds so that we don't wastefully submit this metric prematurely.

But then the metric cannot be used to evaluate if we are ready to activate the feature or not because the metric itself is behind the feature.

I think for now there is no point in adding the metric because it will spam metrics server until chained Merkle shreds are rolled out.
Putting it behind the feature gate is not helpful either because then we don't know if we can activate the feature or not.

So I think the best is to remove it entirely for now, wait until chained Merkle shreds are fully rolled out, then we can add this metric to confirm the cluster is ready to activate duplicate shreds features.

i thought the should_chain_merkle_shreds function is the one that controls the chained merkle rollout?
As in, you will slowly update this function to target different cluster types and increasing slot ranges until it is always true.
By gating this metric behind should_chain_merkle_shreds we can verify if it is performing correctly on chained merkle shred slots only, and once should_chain_merkle_shreds is always true and the metric looks correct we can turn on feature_set::vote_only_retransmitter_signed_fec_sets

I suppose there would be some metrics spam as the cluster upgrades to a version that changes should_chain_merkle_shreds as there would be disagreement on whether this slot should use the new shred format.
What do you think about using feature_set::verify_retransmitter_signature::id() to gate the metric instead? since I assume we will activate that feature flag first (after chained merkle roots are fully rolled out) but before turning on this one.
If not, I can do as you suggest and remove the metric entirely and add it back before we intend to activate the feature.

I suppose there would be some metrics spam as the cluster upgrades to a version that changes should_chain_merkle_shreds as there would be disagreement on whether this slot should use the new shred format.

yeah, that is the issue with using should_chain_merkle_shreds.

What do you think about using feature_set::verify_retransmitter_signature::id() to gate the metric instead?

I think the code is over crowded with feature gate branches already, and I would lean to keep it simple and not to add more branches.

ledger/src/blockstore.rs

ledger/src/shred.rs

core/src/replay_stage.rs

ledger/src/blockstore.rs

AshwinSekar · 2024-07-16T19:28:58Z

ledger/src/shred.rs

@@ -1267,6 +1283,10 @@ pub fn verify_test_data_shred(
    }
 }

+pub fn should_chain_merkle_shreds(_slot: Slot, cluster_type: ClusterType) -> bool {


moved from standard_broadcast_run in order to resolve dependencies.

just a note to move this back if the metric gate on should_chain_merkle_shreds is removed.

core/src/replay_stage.rs

behzadnouri · 2024-07-17T13:24:22Z

ledger/src/blockstore.rs

+                "incomplete_final_fec_set",
+                ("slot", slot, i64),
+                ("hash", hash.to_string(), String)
+            );
+        }
+        if !self.is_retransmitter_signed {
+            datapoint_warn!(
+                "invalid_retransmitter_signature_final_fec_set",
+                ("slot", slot, i64),
+                ("hash", hash.to_string(), String)


But then the metric cannot be used to evaluate if we are ready to activate the feature or not because the metric itself is behind the feature.

I think for now there is no point in adding the metric because it will spam metrics server until chained Merkle shreds are rolled out.
Putting it behind the feature gate is not helpful either because then we don't know if we can activate the feature or not.

So I think the best is to remove it entirely for now, wait until chained Merkle shreds are fully rolled out, then we can add this metric to confirm the cluster is ready to activate duplicate shreds features.

core/src/replay_stage.rs

behzadnouri · 2024-07-17T13:36:06Z

ledger/src/blockstore.rs

+        // These metrics are expensive to send because hash does not compress well.
+        // Only send these metrics when we are sure the appropriate shred format is being sent
+        if !self.is_retransmitter_signed && shred::should_chain_merkle_shreds(slot, cluster_type) {
+            datapoint_warn!(
+                "invalid_retransmitter_signature_final_fec_set",
+                ("slot", slot, i64),
+                ("bank_hash", bank_hash.to_string(), String)
+            );
+        }


As in the other comment, maybe we should remove this metric entirely for now and add it back once chained merkle shreds are fully rolled out.

ledger/src/blockstore.rs

AshwinSekar · 2024-07-17T15:48:20Z

Do we already observe this in the metrics?
Wondering how much load this is adding to the metrics server.

Just to confirm, this is not currently spamming metrics server because last FEC set is already 32+ data shreds on mainnet, right?

The replay check that introduced this metric:
#1410
was never approved for backport to 1.18 so it is not observed on mainnet apart from master canaries.

On testnet however we do observe it quite frequently:

It seems that something is wrong with the check. I will hold off on this change and investigate the previous check and get back to you.

AshwinSekar · 2024-07-17T19:06:31Z

False alarm, I checked the block producer for blocks that have been failing the check on testnet today and they are all running firedancer:

3Rdnk6ZSGbPbwGiSEDkTs12oK5VcX8YWuQPaYtqcnYW6
5qzgRKGnpZs27aqR3tkZp2HJ8ZXoxTBjYoayqVG1QKbs
ARhax8ZwKr4xTkRB4HcU5ncWtGoVXCbCa2x98QhTpihw
F1REirtRdV1famCks1tLZ8ZpxKWxPzT8Ky7d19cA1K1N
Hwvxg5dacM8SbpwPDiZyxNNLpAid9v7BQZDef1Fhug2P
jfireruobS6UzNLqixjraT4URA73fSj4cWYankBsQpV
puMPKindCwo2b279wFUTHZbR3CncfEWX5ySk2ibqQ3U

It seems that FD is not padding the last fec set yet, so this is to be expected.

Also I realized that there is no reason to send the bank hash in metrics, as this check is happening before the bank is frozen so the hash is not yet populated 🤦

EDIT:
There are a couple of producers with 0.02% of stake that are not padding as well on 2.0.0. Perhaps they are running mods:

4qrtRWHiJm4YCTsGmdWThngntekKZXrF9nGHJCEoRTrC
8trHyY8YjVqiZ6APHYpmsmCYkVbWF5sMRSpKku6Yuhbj

behzadnouri

lgtm, aside from minor comments.

behzadnouri · 2024-07-18T14:54:14Z

core/src/replay_stage.rs

+                        &bank.feature_set,
+                    ) {
+                        Ok(block_id) => block_id,
+                        Err(result_err) => {


Instead of match, it might be more readable to use inspect_err:
https://doc.rust-lang.org/std/result/enum.Result.html#method.inspect_err
followed by unwrap_or_default.

It doesn't work because we can't continue from the inspect_err closure.

behzadnouri · 2024-07-18T14:57:57Z

ledger/src/blockstore.rs

+        }
+        // These metrics are expensive to send because hash does not compress well.
+        // Only send these metrics when we are sure the appropriate shred format is being sent
+        if !results.is_retransmitter_signed && shred::should_chain_merkle_shreds(slot, cluster_type)


As mentioned in the other comment, gating on should_chain_merkle_shreds has cluster upgrade issues.

behzadnouri · 2024-07-18T15:00:36Z

ledger/src/blockstore.rs

@@ -170,6 +174,31 @@ impl<T> AsRef<T> for WorkingEntry<T> {
    }
 }

+#[derive(Clone, Copy, PartialEq, Eq, Debug)]
+pub struct LastFECSetCheckResults {
+    block_id: Option<Hash>,


Is there a place this block_id is defined ?
Would it make sense to call this last_fec_set_merkle_root or something similar to be more expressive?

agave/sdk/program/src/vote/state/mod.rs

Lines 235 to 238 in a2eaf6d

/// the unique identifier for the chain up to and

/// including this block. Does not require replaying

/// in order to compute.

pub block_id: Hash,

This is where it is defined. I'll keep it as last_fec_set_merkle_root here and we can start calling it block_id in the replay pipeline.

behzadnouri · 2024-07-18T15:03:43Z

ledger/src/shred.rs

@@ -1267,6 +1283,10 @@ pub fn verify_test_data_shred(
    }
 }

+pub fn should_chain_merkle_shreds(_slot: Slot, cluster_type: ClusterType) -> bool {


just a note to move this back if the metric gate on should_chain_merkle_shreds is removed.

…nt, false for legacy

mergify · 2024-07-18T16:42:13Z

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

…#2101) * replay: extend last fec set check for 32+ retransmitter signed shreds * pr feedback: use separate feature flag * pr feedback: is_retransmitter_signed -> is_retransmitter_signed_variant, false for legacy * pr feedback: update doc comment fail -> error * pr feedback: hash -> bank_hash for report metrics * refactor metrics inside blockstore fn, return block_id for future use * pr feedback: gate metrics reporting * pr feedback: do not distinguish impossible combos, simplify check code * pr feedback: remove report_metrics helper fn * pr feedback: remove metric * pr feedback: block_id -> last_fec_set_merkle_root (cherry picked from commit 93edb65) # Conflicts: # ledger/src/shred.rs # sdk/src/feature_set.rs

…#2101) * replay: extend last fec set check for 32+ retransmitter signed shreds * pr feedback: use separate feature flag * pr feedback: is_retransmitter_signed -> is_retransmitter_signed_variant, false for legacy * pr feedback: update doc comment fail -> error * pr feedback: hash -> bank_hash for report metrics * refactor metrics inside blockstore fn, return block_id for future use * pr feedback: gate metrics reporting * pr feedback: do not distinguish impossible combos, simplify check code * pr feedback: remove report_metrics helper fn * pr feedback: remove metric * pr feedback: block_id -> last_fec_set_merkle_root

…shreds (backport of #2101) (#2192) replay: extend last fec set check for 32+ retransmitter signed shreds (#2101) * replay: extend last fec set check for 32+ retransmitter signed shreds * pr feedback: use separate feature flag * pr feedback: is_retransmitter_signed -> is_retransmitter_signed_variant, false for legacy * pr feedback: update doc comment fail -> error * pr feedback: hash -> bank_hash for report metrics * refactor metrics inside blockstore fn, return block_id for future use * pr feedback: gate metrics reporting * pr feedback: do not distinguish impossible combos, simplify check code * pr feedback: remove report_metrics helper fn * pr feedback: remove metric * pr feedback: block_id -> last_fec_set_merkle_root Co-authored-by: Ashwin Sekar <[email protected]>

AshwinSekar force-pushed the replay-retransmitter branch from 9b3728d to ef2185c Compare July 15, 2024 17:31

AshwinSekar commented Jul 15, 2024

View reviewed changes

ledger/src/blockstore.rs Outdated Show resolved Hide resolved

AshwinSekar commented Jul 15, 2024

View reviewed changes

ledger/src/blockstore.rs Outdated Show resolved Hide resolved

AshwinSekar marked this pull request as ready for review July 15, 2024 18:37

AshwinSekar requested review from behzadnouri and bw-solana July 15, 2024 18:37

behzadnouri reviewed Jul 16, 2024

View reviewed changes

bw-solana reviewed Jul 16, 2024

View reviewed changes

core/src/replay_stage.rs Outdated Show resolved Hide resolved

bw-solana reviewed Jul 16, 2024

View reviewed changes

core/src/replay_stage.rs Show resolved Hide resolved

bw-solana reviewed Jul 16, 2024

View reviewed changes

ledger/src/blockstore.rs Outdated Show resolved Hide resolved

bw-solana reviewed Jul 16, 2024

View reviewed changes

ledger/src/blockstore.rs Show resolved Hide resolved

AshwinSekar force-pushed the replay-retransmitter branch from ef2185c to 509cdb9 Compare July 16, 2024 18:40

AshwinSekar commented Jul 16, 2024

View reviewed changes

behzadnouri reviewed Jul 17, 2024

View reviewed changes

AshwinSekar marked this pull request as draft July 17, 2024 15:49

AshwinSekar force-pushed the replay-retransmitter branch from 7daf78c to acac298 Compare July 17, 2024 19:54

AshwinSekar marked this pull request as ready for review July 17, 2024 20:06

behzadnouri previously approved these changes Jul 18, 2024

View reviewed changes

AshwinSekar added 10 commits July 18, 2024 15:50

replay: extend last fec set check for 32+ retransmitter signed shreds

8070c3a

pr feedback: use separate feature flag

498a131

pr feedback: is_retransmitter_signed -> is_retransmitter_signed_varia…

3c2a3a5

…nt, false for legacy

pr feedback: update doc comment fail -> error

9a3445a

pr feedback: hash -> bank_hash for report metrics

f0935ae

refactor metrics inside blockstore fn, return block_id for future use

2c11c74

pr feedback: gate metrics reporting

4ecfbba

pr feedback: do not distinguish impossible combos, simplify check code

9916e60

pr feedback: remove report_metrics helper fn

b27d78e

pr feedback: remove metric

c88824c

pr feedback: block_id -> last_fec_set_merkle_root

df32e7b

AshwinSekar dismissed behzadnouri’s stale review via df32e7b July 18, 2024 16:00

AshwinSekar force-pushed the replay-retransmitter branch from acac298 to df32e7b Compare July 18, 2024 16:00

AshwinSekar added the v2.0 Backport to v2.0 branch label Jul 18, 2024

behzadnouri approved these changes Jul 18, 2024

View reviewed changes

AshwinSekar merged commit 93edb65 into anza-xyz:master Jul 18, 2024
52 checks passed

AshwinSekar deleted the replay-retransmitter branch July 18, 2024 20:41

mergify bot mentioned this pull request Jul 18, 2024

v2.0: replay: extend last fec set check for 32+ retransmitter signed shreds (backport of #2101) #2192

Merged

AshwinSekar added the feature-gate Pull Request adds or modifies a runtime feature gate label Jul 22, 2024

AshwinSekar mentioned this pull request Jul 22, 2024

Feature Gate: Vote only on retransmitter signed FEC sets #2237

Open

AshwinSekar mentioned this pull request Aug 2, 2024

rolls out chained Merkle shreds to ~5% of testnet #2389

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replay: extend last fec set check for 32+ retransmitter signed shreds #2101

replay: extend last fec set check for 32+ retransmitter signed shreds #2101

AshwinSekar commented Jul 11, 2024

behzadnouri Jul 16, 2024

AshwinSekar Jul 16, 2024

behzadnouri Jul 17, 2024

AshwinSekar Jul 17, 2024 •

edited

Loading

AshwinSekar Jul 17, 2024

behzadnouri Jul 18, 2024

AshwinSekar Jul 16, 2024

behzadnouri Jul 18, 2024

behzadnouri Jul 17, 2024

behzadnouri Jul 17, 2024

AshwinSekar commented Jul 17, 2024 •

edited

Loading

AshwinSekar commented Jul 17, 2024 •

edited

Loading

behzadnouri left a comment

behzadnouri Jul 18, 2024

AshwinSekar Jul 18, 2024

behzadnouri Jul 18, 2024

behzadnouri Jul 18, 2024

AshwinSekar Jul 18, 2024

behzadnouri Jul 18, 2024

mergify bot commented Jul 18, 2024

	/// the unique identifier for the chain up to and
	/// including this block. Does not require replaying
	/// in order to compute.
	pub block_id: Hash,

replay: extend last fec set check for 32+ retransmitter signed shreds #2101

replay: extend last fec set check for 32+ retransmitter signed shreds #2101

Conversation

AshwinSekar commented Jul 11, 2024

Problem

Summary of Changes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AshwinSekar Jul 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AshwinSekar commented Jul 17, 2024 • edited Loading

AshwinSekar commented Jul 17, 2024 • edited Loading

behzadnouri left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented Jul 18, 2024

AshwinSekar Jul 17, 2024 •

edited

Loading

AshwinSekar commented Jul 17, 2024 •

edited

Loading

AshwinSekar commented Jul 17, 2024 •

edited

Loading