Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replay: extend last fec set check for 32+ retransmitter signed shreds #2101

Merged
merged 11 commits into from
Jul 18, 2024

Conversation

AshwinSekar
Copy link

Problem

When the feature flag is turned on, we need to verify that the last fec set uses the new shred format.

Because the last fec set is not known when receiving shreds, and the new shred format is only used for the last set, we cannot perform this check until replay has completed.

Summary of Changes

Extend the check which compares the merkle roots of the last fec set to additionally check that they are all of the retransmitter shred variant.

ledger/src/blockstore.rs Outdated Show resolved Hide resolved
ledger/src/blockstore.rs Outdated Show resolved Hide resolved
@AshwinSekar AshwinSekar marked this pull request as ready for review July 15, 2024 18:37
Comment on lines 185 to 194
"incomplete_final_fec_set",
("slot", slot, i64),
("hash", hash.to_string(), String)
);
}
if !self.is_retransmitter_signed {
datapoint_warn!(
"invalid_retransmitter_signature_final_fec_set",
("slot", slot, i64),
("hash", hash.to_string(), String)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently these hash things are pretty expensive for metrics server because they do not compress.
And until chained merkle shreds are rolled out every node is going to submit this metric for every slot which is too much.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good i've gated this with should_chain_merkle_shreds so that we don't wastefully submit this metric prematurely.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then the metric cannot be used to evaluate if we are ready to activate the feature or not because the metric itself is behind the feature.

I think for now there is no point in adding the metric because it will spam metrics server until chained Merkle shreds are rolled out.
Putting it behind the feature gate is not helpful either because then we don't know if we can activate the feature or not.

So I think the best is to remove it entirely for now, wait until chained Merkle shreds are fully rolled out, then we can add this metric to confirm the cluster is ready to activate duplicate shreds features.

Copy link
Author

@AshwinSekar AshwinSekar Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought the should_chain_merkle_shreds function is the one that controls the chained merkle rollout?
As in, you will slowly update this function to target different cluster types and increasing slot ranges until it is always true.
By gating this metric behind should_chain_merkle_shreds we can verify if it is performing correctly on chained merkle shred slots only, and once should_chain_merkle_shreds is always true and the metric looks correct we can turn on feature_set::vote_only_retransmitter_signed_fec_sets

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose there would be some metrics spam as the cluster upgrades to a version that changes should_chain_merkle_shreds as there would be disagreement on whether this slot should use the new shred format.
What do you think about using feature_set::verify_retransmitter_signature::id() to gate the metric instead? since I assume we will activate that feature flag first (after chained merkle roots are fully rolled out) but before turning on this one.
If not, I can do as you suggest and remove the metric entirely and add it back before we intend to activate the feature.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose there would be some metrics spam as the cluster upgrades to a version that changes should_chain_merkle_shreds as there would be disagreement on whether this slot should use the new shred format.

yeah, that is the issue with using should_chain_merkle_shreds.

What do you think about using feature_set::verify_retransmitter_signature::id() to gate the metric instead?

I think the code is over crowded with feature gate branches already, and I would lean to keep it simple and not to add more branches.

ledger/src/blockstore.rs Outdated Show resolved Hide resolved
ledger/src/blockstore.rs Outdated Show resolved Hide resolved
ledger/src/shred.rs Outdated Show resolved Hide resolved
ledger/src/shred.rs Outdated Show resolved Hide resolved
core/src/replay_stage.rs Outdated Show resolved Hide resolved
ledger/src/blockstore.rs Outdated Show resolved Hide resolved
@@ -1267,6 +1283,10 @@ pub fn verify_test_data_shred(
}
}

pub fn should_chain_merkle_shreds(_slot: Slot, cluster_type: ClusterType) -> bool {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved from standard_broadcast_run in order to resolve dependencies.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a note to move this back if the metric gate on should_chain_merkle_shreds is removed.

core/src/replay_stage.rs Show resolved Hide resolved
Comment on lines 185 to 194
"incomplete_final_fec_set",
("slot", slot, i64),
("hash", hash.to_string(), String)
);
}
if !self.is_retransmitter_signed {
datapoint_warn!(
"invalid_retransmitter_signature_final_fec_set",
("slot", slot, i64),
("hash", hash.to_string(), String)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then the metric cannot be used to evaluate if we are ready to activate the feature or not because the metric itself is behind the feature.

I think for now there is no point in adding the metric because it will spam metrics server until chained Merkle shreds are rolled out.
Putting it behind the feature gate is not helpful either because then we don't know if we can activate the feature or not.

So I think the best is to remove it entirely for now, wait until chained Merkle shreds are fully rolled out, then we can add this metric to confirm the cluster is ready to activate duplicate shreds features.

core/src/replay_stage.rs Show resolved Hide resolved
Comment on lines 192 to 200
// These metrics are expensive to send because hash does not compress well.
// Only send these metrics when we are sure the appropriate shred format is being sent
if !self.is_retransmitter_signed && shred::should_chain_merkle_shreds(slot, cluster_type) {
datapoint_warn!(
"invalid_retransmitter_signature_final_fec_set",
("slot", slot, i64),
("bank_hash", bank_hash.to_string(), String)
);
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in the other comment, maybe we should remove this metric entirely for now and add it back once chained merkle shreds are fully rolled out.

ledger/src/blockstore.rs Outdated Show resolved Hide resolved
ledger/src/blockstore.rs Outdated Show resolved Hide resolved
ledger/src/blockstore.rs Outdated Show resolved Hide resolved
@AshwinSekar
Copy link
Author

AshwinSekar commented Jul 17, 2024

Do we already observe this in the metrics?
Wondering how much load this is adding to the metrics server.

Just to confirm, this is not currently spamming metrics server because last FEC set is already 32+ data shreds on mainnet, right?

The replay check that introduced this metric:
#1410
was never approved for backport to 1.18 so it is not observed on mainnet apart from master canaries.

On testnet however we do observe it quite frequently:
image

It seems that something is wrong with the check. I will hold off on this change and investigate the previous check and get back to you.

@AshwinSekar AshwinSekar marked this pull request as draft July 17, 2024 15:49
@AshwinSekar
Copy link
Author

AshwinSekar commented Jul 17, 2024

False alarm, I checked the block producer for blocks that have been failing the check on testnet today and they are all running firedancer:

3Rdnk6ZSGbPbwGiSEDkTs12oK5VcX8YWuQPaYtqcnYW6
5qzgRKGnpZs27aqR3tkZp2HJ8ZXoxTBjYoayqVG1QKbs
ARhax8ZwKr4xTkRB4HcU5ncWtGoVXCbCa2x98QhTpihw
F1REirtRdV1famCks1tLZ8ZpxKWxPzT8Ky7d19cA1K1N
Hwvxg5dacM8SbpwPDiZyxNNLpAid9v7BQZDef1Fhug2P
jfireruobS6UzNLqixjraT4URA73fSj4cWYankBsQpV
puMPKindCwo2b279wFUTHZbR3CncfEWX5ySk2ibqQ3U

It seems that FD is not padding the last fec set yet, so this is to be expected.

Also I realized that there is no reason to send the bank hash in metrics, as this check is happening before the bank is frozen so the hash is not yet populated 🤦

EDIT:
There are a couple of producers with 0.02% of stake that are not padding as well on 2.0.0. Perhaps they are running mods:

4qrtRWHiJm4YCTsGmdWThngntekKZXrF9nGHJCEoRTrC
8trHyY8YjVqiZ6APHYpmsmCYkVbWF5sMRSpKku6Yuhbj

@AshwinSekar AshwinSekar marked this pull request as ready for review July 17, 2024 20:06
behzadnouri
behzadnouri previously approved these changes Jul 18, 2024
Copy link

@behzadnouri behzadnouri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, aside from minor comments.

&bank.feature_set,
) {
Ok(block_id) => block_id,
Err(result_err) => {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of match, it might be more readable to use inspect_err:
https://doc.rust-lang.org/std/result/enum.Result.html#method.inspect_err
followed by unwrap_or_default.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't work because we can't continue from the inspect_err closure.

}
// These metrics are expensive to send because hash does not compress well.
// Only send these metrics when we are sure the appropriate shred format is being sent
if !results.is_retransmitter_signed && shred::should_chain_merkle_shreds(slot, cluster_type)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in the other comment, gating on should_chain_merkle_shreds has cluster upgrade issues.

@@ -170,6 +174,31 @@ impl<T> AsRef<T> for WorkingEntry<T> {
}
}

#[derive(Clone, Copy, PartialEq, Eq, Debug)]
pub struct LastFECSetCheckResults {
block_id: Option<Hash>,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a place this block_id is defined ?
Would it make sense to call this last_fec_set_merkle_root or something similar to be more expressive?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// the unique identifier for the chain up to and
/// including this block. Does not require replaying
/// in order to compute.
pub block_id: Hash,

This is where it is defined. I'll keep it as last_fec_set_merkle_root here and we can start calling it block_id in the replay pipeline.

@@ -1267,6 +1283,10 @@ pub fn verify_test_data_shred(
}
}

pub fn should_chain_merkle_shreds(_slot: Slot, cluster_type: ClusterType) -> bool {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a note to move this back if the metric gate on should_chain_merkle_shreds is removed.

Copy link

mergify bot commented Jul 18, 2024

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

@AshwinSekar AshwinSekar merged commit 93edb65 into anza-xyz:master Jul 18, 2024
52 checks passed
@AshwinSekar AshwinSekar deleted the replay-retransmitter branch July 18, 2024 20:41
mergify bot pushed a commit that referenced this pull request Jul 18, 2024
…#2101)

* replay: extend last fec set check for 32+ retransmitter signed shreds

* pr feedback: use separate feature flag

* pr feedback: is_retransmitter_signed -> is_retransmitter_signed_variant, false for legacy

* pr feedback: update doc comment fail -> error

* pr feedback: hash -> bank_hash for report metrics

* refactor metrics inside blockstore fn, return block_id for future use

* pr feedback: gate metrics reporting

* pr feedback: do not distinguish impossible combos, simplify check code

* pr feedback: remove report_metrics helper fn

* pr feedback: remove metric

* pr feedback: block_id -> last_fec_set_merkle_root

(cherry picked from commit 93edb65)

# Conflicts:
#	ledger/src/shred.rs
#	sdk/src/feature_set.rs
@AshwinSekar AshwinSekar added the feature-gate Pull Request adds or modifies a runtime feature gate label Jul 22, 2024
behzadnouri pushed a commit that referenced this pull request Aug 3, 2024
…#2101)

* replay: extend last fec set check for 32+ retransmitter signed shreds

* pr feedback: use separate feature flag

* pr feedback: is_retransmitter_signed -> is_retransmitter_signed_variant, false for legacy

* pr feedback: update doc comment fail -> error

* pr feedback: hash -> bank_hash for report metrics

* refactor metrics inside blockstore fn, return block_id for future use

* pr feedback: gate metrics reporting

* pr feedback: do not distinguish impossible combos, simplify check code

* pr feedback: remove report_metrics helper fn

* pr feedback: remove metric

* pr feedback: block_id -> last_fec_set_merkle_root
AshwinSekar added a commit that referenced this pull request Aug 16, 2024
…#2101)

* replay: extend last fec set check for 32+ retransmitter signed shreds

* pr feedback: use separate feature flag

* pr feedback: is_retransmitter_signed -> is_retransmitter_signed_variant, false for legacy

* pr feedback: update doc comment fail -> error

* pr feedback: hash -> bank_hash for report metrics

* refactor metrics inside blockstore fn, return block_id for future use

* pr feedback: gate metrics reporting

* pr feedback: do not distinguish impossible combos, simplify check code

* pr feedback: remove report_metrics helper fn

* pr feedback: remove metric

* pr feedback: block_id -> last_fec_set_merkle_root
AshwinSekar added a commit that referenced this pull request Aug 17, 2024
…#2101)

* replay: extend last fec set check for 32+ retransmitter signed shreds

* pr feedback: use separate feature flag

* pr feedback: is_retransmitter_signed -> is_retransmitter_signed_variant, false for legacy

* pr feedback: update doc comment fail -> error

* pr feedback: hash -> bank_hash for report metrics

* refactor metrics inside blockstore fn, return block_id for future use

* pr feedback: gate metrics reporting

* pr feedback: do not distinguish impossible combos, simplify check code

* pr feedback: remove report_metrics helper fn

* pr feedback: remove metric

* pr feedback: block_id -> last_fec_set_merkle_root
AshwinSekar added a commit that referenced this pull request Aug 23, 2024
…#2101)

* replay: extend last fec set check for 32+ retransmitter signed shreds

* pr feedback: use separate feature flag

* pr feedback: is_retransmitter_signed -> is_retransmitter_signed_variant, false for legacy

* pr feedback: update doc comment fail -> error

* pr feedback: hash -> bank_hash for report metrics

* refactor metrics inside blockstore fn, return block_id for future use

* pr feedback: gate metrics reporting

* pr feedback: do not distinguish impossible combos, simplify check code

* pr feedback: remove report_metrics helper fn

* pr feedback: remove metric

* pr feedback: block_id -> last_fec_set_merkle_root
AshwinSekar added a commit that referenced this pull request Aug 28, 2024
…shreds (backport of #2101) (#2192)

replay: extend last fec set check for 32+ retransmitter signed shreds (#2101)

* replay: extend last fec set check for 32+ retransmitter signed shreds

* pr feedback: use separate feature flag

* pr feedback: is_retransmitter_signed -> is_retransmitter_signed_variant, false for legacy

* pr feedback: update doc comment fail -> error

* pr feedback: hash -> bank_hash for report metrics

* refactor metrics inside blockstore fn, return block_id for future use

* pr feedback: gate metrics reporting

* pr feedback: do not distinguish impossible combos, simplify check code

* pr feedback: remove report_metrics helper fn

* pr feedback: remove metric

* pr feedback: block_id -> last_fec_set_merkle_root

Co-authored-by: Ashwin Sekar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-gate Pull Request adds or modifies a runtime feature gate v2.0 Backport to v2.0 branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants