Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stable2409] Backport #6696 #6840

Merged
merged 2 commits into from
Dec 11, 2024
Merged

Conversation

paritytech-cmd-bot-polkadot-sdk[bot]

Backport #6696 into stable2409 from alexggh.

See the documentation on how to use this bot.

…my (#6696)

After finality started lagging on kusama around `2025-11-25 15:55:40`
nodes started being overloaded with messages and some restarted with
```
Subsystem approval-distribution-subsystem appears unresponsive when sending a message of type polkadot_node_subsystem_types::messages::ApprovalDistributionMessage. origin=polkadot_service::relay_chain_selection::SelectRelayChainInner<sc_client_db::Backend<sp_runtime::generic::block::Block<sp_runtime::generic::header::Header<u32, sp_runtime::traits::BlakeTwo256>, sp_runtime::OpaqueExtrinsic>>, polkadot_overseer::Handle>
```

I think this happened because our aggression in the current form is way
too spammy and create problems in situation where we already constructed
blocks with a load of candidates to check which what happened around
`#25933682` before and after. However aggression, does help in the
nightmare scenario where the network is segmented and sparsely
connected, so I tend to think we shouldn't completely remove it.

The current configuration is:
```
l1_threshold: Some(16),
l2_threshold: Some(28),
resend_unfinalized_period: Some(8),
```
The way aggression works right now :
1. After L1 is triggered all nodes send all messages they created to all
the other nodes and all messages they would have they already send
according to the topology.
2. Because of resend_unfinalized_period for each block all messages at
step 1) are sent every 8 blocks, so for example let's say we have blocks
1 to 24 unfinalized, then at block 25, all messages for block 1, 9 will
be resent, and consequently at block 26, all messages for block 2, 10
will be resent, this becomes worse as more blocks are created if backing
backpressure did not kick in yet. In total this logic makes that each
node receive 3 * total_number_of messages_per_block
3. L2 aggression is way too spammy, when L2 aggression is enabled all
nodes sends all messages of a block on GridXY, that means that all
messages are received and sent by node at least 2*sqrt(num_validators),
so on kusama would be 66 * NUM_MESSAGES_AT_FIRST_UNFINALIZED_BLOCK, so
even with a reasonable number of messages like 10K, which you can have
if you escalated because of no shows, you end-up sending and receiving
~660k messages at once, I think that's what makes the
approval-distribution to appear unresponsive on some nodes.
4. Duplicate messages are received by the nodes which turn, mark the
node as banned, which may create more no-shows.

## Proposed improvements:
1. Make L2 trigger way later 28 blocks, instead of 64, this should
literally the last resort, until then we should try to let the
approval-voting escalation mechanism to do its things and cover the
no-shows.
2. On L1 aggression don't send messages for blocks too far from the
first_unfinalized there is no point in sending the messages for block
20, if block 1 is still unfinalized.
3. On L1 aggression, send messages then back-off for 3 *
resend_unfinalized_period to give time for everyone to clear up their
queues.
4. If aggression is enabled accept duplicate messages from validators
and don't punish them by reducting their reputation which, which may
create no-shows.

---------

Signed-off-by: Alexandru Gheorghe <[email protected]>
Co-authored-by: Andrei Sandu <[email protected]>
(cherry picked from commit 85dd228)
@github-actions github-actions bot added the A3-backport Pull request is already reviewed well in another branch. label Dec 11, 2024
Copy link

This pull request is amending an existing release. Please proceed with extreme caution,
as to not impact downstream teams that rely on the stability of it. Some things to consider:

  • Backports are only for 'patch' or 'minor' changes. No 'major' or other breaking change.
  • Should be a legit fix for some bug, not adding tons of new features.
  • Must either be already audited or not need an audit.
Emergency Bypass

If you really need to bypass this check: add validate: false to each crate
in the Prdoc where a breaking change is introduced. This will release a new major
version of that crate and all its reverse dependencies and basically break the release.

@paritytech-cicd-pr
Copy link

The CI pipeline was cancelled due to failure one of the required jobs.
Job name: test-linux-stable 3/3
Logs: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7899542

@alexggh alexggh requested review from AndreiEres, sandreim and EgorPopelyaev and removed request for AndreiEres December 11, 2024 09:44
@EgorPopelyaev EgorPopelyaev merged commit 6925af4 into stable2409 Dec 11, 2024
158 of 191 checks passed
@EgorPopelyaev EgorPopelyaev deleted the backport-6696-to-stable2409 branch December 11, 2024 12:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A3-backport Pull request is already reviewed well in another branch.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants