Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

litep2p: Update network backend to v0.7.0 #5609

Merged
merged 19 commits into from
Sep 10, 2024
Merged

litep2p: Update network backend to v0.7.0 #5609

merged 19 commits into from
Sep 10, 2024

Conversation

lexnv
Copy link
Contributor

@lexnv lexnv commented Sep 5, 2024

This release introduces several new features, improvements, and fixes to the litep2p library. Key updates include enhanced error handling, configurable connection limits, and a new API for managing public addresses.

For a detailed set of changes, see litep2p changelog.

This PR makes use of:

  • connection limits to optimize network throughput
  • better errors that are propagated to substrate metrics
  • public addresses API to report healthy addresses to the Identify protocol

Warp sync time improvement

Measuring warp sync time is a bit inaccurate since the network is not deterministic and we might end up using faster peers (peers with more resources to handle our requests). However, I did not see warp sync times of 16 minutes, instead, they are roughly stabilized between 8 and 10 minutes.

For measuring warp-sync time, I've used sub-trige-logs

Litep2p

Phase Time
Warp 426.999999919s
State 99.000000555s
Total 526.000000474s

Libp2p

Phase Time
Warp 731.999999837s
State 71.000000882s
Total 803.000000719s

Closes: #4986

Low peer count

After exposing the litep2p::public_addresses interface, we can report to litep2p confirmed external addresses. This should mitigate or at least improve: #4925. Will keep the issue around to confirm this.

Improved metrics

We are one step closer to exposing similar metrics as libp2p: #4681.

cc @paritytech/networking

Next Steps

  • Use public address interface to confirm addresses to identify protocol

@lexnv lexnv self-assigned this Sep 5, 2024
@paritytech-cicd-pr
Copy link

The CI pipeline was cancelled due to failure one of the required jobs.
Job name: test-linux-stable 3/3
Logs: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/7276240

Cargo.toml Outdated Show resolved Hide resolved
substrate/client/network/src/litep2p/discovery.rs Outdated Show resolved Hide resolved
substrate/client/network/Cargo.toml Outdated Show resolved Hide resolved
@dmitry-markin dmitry-markin requested a review from bkchr September 6, 2024 07:42
Copy link
Member

@bkchr bkchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but some things should be improved.

@@ -557,6 +554,12 @@ impl<B: BlockT + 'static, H: ExHashT> NetworkBackend<B, H> for Litep2pNetworkBac
.with_libp2p_ping(ping_config)
.with_libp2p_identify(identify_config)
.with_libp2p_kademlia(kademlia_config)
.with_connection_limits(
// By default litep2p accepts only two connections per peer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need multiple connections per peer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Litep2p accepts at most 2 connections with the remote peer to handle cases when both peers dial each other simultaneously. For libp2p, we configure this manually to 2 connections for similar reasons:

/// The maximum allowed number of established connections per peer.
///
/// Typically, and by design of the network behaviours in this crate,
/// there is a single established connection per peer. However, to
/// avoid unnecessary and nondeterministic connection closure in
/// case of (possibly repeated) simultaneous dialing attempts between
/// two peers, the per-peer connection limit is not set to 1 but 2.
const MAX_CONNECTIONS_PER_PEER: usize = 2;

RejectReason::ConnectionClosed => "connection-closed".to_string(),
RejectReason::SubstreamClosed => "substream-closed".to_string(),
RejectReason::SubstreamOpenError(substream_error) => {
format!("substream-open-error: {:?}", substream_error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want this as a metric name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep that's a good point, have remove the String allocation entirely and provided shorter str names, thanks 🙏

Some((RequestFailure::NotConnected, "not-connected".to_string())),
RequestResponseError::Rejected(reason) => {
let reason = match reason {
RejectReason::ConnectionClosed => "connection-closed".to_string(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also better to use a Cow<'static, str> here. (Less allocations)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have removed the allocations entirely, since we don't really need to expose each variant of format!("substream-open-error: {:?}", substream_error)

@lexnv lexnv added T0-node This PR/Issue is related to the topic “node”. I5-enhancement An additional feature request. D0-easy Can be fixed primarily by duplicating and adapting code by an intermediate coder. labels Sep 6, 2024
@lexnv
Copy link
Contributor Author

lexnv commented Sep 9, 2024

I've changed a bit two things since last review:

  • print only newly discovered addresses
  • disable hickory proto (former trust-dns-proto) logging

We got around 4.7k warnings from hickory:

WARN tokio-runtime-worker hickory_proto::xfer::dns_exchange: failed to associate send_message response to the sender"

The fix similar to: paritytech/substrate#12253 (disable logging for this crate).

Local node triaging (litep2p)

Count Level Triage report
232 warn 🥩 ran out of peers to request justif #.* from num_cache=.* num_live=.* err=.*
4 warn Report .: . to .. Reason: .. Banned, disconnecting. ( Peer disconnected with inflight after backoffs. Banned, disconnecting. )
2 warn ❌ Error while dialing .: .
1 warn 💔 Error importing block .: . ( Parent block of 0xd7a9…f573 has no associated weight )
1 warn Report .: . to .. Reason: .. Banned, disconnecting. ( Same block request multiple times. Banned, disconnecting. )

Other warnings:

    "2024-09-06 15:39:22.641  WARN tokio-runtime-worker litep2p::ipfs::identify: inbound identify substream opened for peer who doesn't exist peer=PeerId(\"12D3KooWRHaoLvJuJptSUgsc1bzXsKToRUR6qS2KW1MVgnJqLpKx\") protocol=/ipfs/id/1.0.0",
  
   "2024-09-07 20:01:07.952  WARN tokio-runtime-worker hickory_proto::xfer::dns_exchange: failed to associate send_message response to the sender",

    "2024-09-07 21:43:54.157  WARN tokio-runtime-worker litep2p::transport-manager: unknown connection opened as secondary connection, discarding peer=PeerId(\"12D3KooWGTnNXimfyieaZAeyRDvZLQpFF7Nr9a8bS3oN4yMPQExZ\") connection_id=ConnectionId(2347697) address=\"/ip4/212.224.112.221/tcp/49054/ws/p2p/12D3KooWGTnNXimfyieaZAeyRDvZLQpFF7Nr9a8bS3oN4yMPQExZ\" dial_record=AddressRecord { score: 100, address: \"/ip4/212.224.112.221/tcp/30333/p2p/12D3KooWGTnNXimfyieaZAeyRDvZLQpFF7Nr9a8bS3oN4yMPQExZ\", connection_id: Some(ConnectionId(2347695)) }",

This has resurfaced the litep2p::transport-manager: unknown connection opened as secondary connection: paritytech/litep2p#172. Have created a new issue for this: paritytech/litep2p#242

Local node triaging (libp2p)

Count Level Triage report
683 warn Notification block pinning limit reached. Unpinning block with hash = .*
11 warn Report .: . to .. Reason: .. Banned, disconnecting. ( Not requested block data. Banned, disconnecting. )
4 warn Report .: . to .. Reason: .. Banned, disconnecting. ( Unsupported protocol. Banned, disconnecting. )
2 warn Can't listen on .* because: .*
1 warn Re-finalized block #.* (.) in the canonical chain, current best finalized is #.
1 warn Report .: . to .. Reason: .. Banned, disconnecting. ( Same block request multiple times. Banned, disconnecting. )
1 warn ❌ Error while dialing .: .
1 warn 💔 Error importing block .: . ( Parent block of 0xd7a9…f573 has no associated weight )

Other warnings:

   - "2024-09-06 14:13:20.673  WARN tokio-runtime-worker sc_network::service: 💔 The bootnode you want to connect to at `/dns/ksm14.rotko.net/tcp/33224/p2p/12D3KooWAa5THTw8HPfnhEei23HdL8P9McBXdozG2oTtMMksjZkK` provided a different peer ID `12D3KooWDTWSFqWQNqHdrAc2srGsqzK7GMw9RAjFTfUjcka5FEJN` than the one you expect `12D3KooWAa5THTw8HPfnhEei23HdL8P9McBXdozG2oTtMMksjZkK`.    ",
   - "2024-09-06 22:58:25.843 ERROR tokio-runtime-worker sc_utils::mpsc: The number of unprocessed messages in channel `mpsc-notification-to-protocol-2-beefy` exceeded 100000.",

Versi network testing

  • versi-net was started with 100 validators on Friday, then scaled down over the weekend to 20 validators
  • sub-triage-logs needs access to the loki instance (requested and tracked by devops)

Manual triaging (until sub-triage-logs gains access to loki):

WARN tokio-runtime-worker babe: 👶 Epoch(s) skipped: from 33226 to 33241    

2024-09-07 06:16:30.004  WARN tokio-runtime-worker parachain::dispute-coordinator: error=Runtime(RuntimeRequest(NotSupported { runtime_api_name: "candidate_events" }))
		
2024-09-07 06:16:30.084  WARN tokio-runtime-worker parachain::runtime-api: cannot query the runtime API version: Api called for an unknown Block: Header was not found in the database: 0x5aaa2a515394a2f9da57ab3ea792808f93822dcdcac76fd9de173776bd9d31ca api="candidate_events"

2024-09-07 06:16:59.828  WARN tokio-runtime-worker parachain::runtime-api: cannot query the runtime API version: Api called for an unknown Block: Header was not found in the database: 0x5aaa2a515394a2f9da57ab3ea792808f93822dcdcac76fd9de173776bd9d31ca api="candidate_events"

Warnings appeared after the versi-net was scaled down from 100 to 20 validators Saturaday, roughly at Sat Sep 7 06:09:42. Warnings continued for around 1h.
This was the first time we introduced scaling in our versi-net testing, will continue to keep an eye on this and check how libp2p behaves in comparison.

Grafana logs

@lexnv lexnv added this pull request to the merge queue Sep 10, 2024
Merged via the queue into master with commit 12eeb5d Sep 10, 2024
202 of 203 checks passed
@lexnv lexnv deleted the lexnv/litep2p-0.7.0 branch September 10, 2024 18:11
@Polkadot-Forum
Copy link

This pull request has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/litep2p-network-backend-updates/9973/1

mordamax pushed a commit to paritytech-stg/polkadot-sdk that referenced this pull request Sep 11, 2024
This release introduces several new features, improvements, and fixes to
the litep2p library. Key updates include enhanced error handling,
configurable connection limits, and a new API for managing public
addresses.

For a detailed set of changes, see [litep2p
changelog](https://github.com/paritytech/litep2p/blob/master/CHANGELOG.md#070---2024-09-05).

This PR makes use of:
- connection limits to optimize network throughput
- better errors that are propagated to substrate metrics 
- public addresses API to report healthy addresses to the Identify
protocol

### Warp sync time improvement

Measuring warp sync time is a bit inaccurate since the network is not
deterministic and we might end up using faster peers (peers with more
resources to handle our requests). However, I did not see warp sync
times of 16 minutes, instead, they are roughly stabilized between 8 and
10 minutes.

For measuring warp-sync time, I've used
[sub-trige-logs](https://github.com/lexnv/sub-triage-logs/?tab=readme-ov-file#warp-time)

### Litep2p

Phase | Time
 -|-
Warp  | 426.999999919s
State | 99.000000555s
Total | 526.000000474s

### Libp2p

Phase | Time
 -|-
Warp  | 731.999999837s
State | 71.000000882s
Total | 803.000000719s

Closes: paritytech#4986


### Low peer count

After exposing the `litep2p::public_addresses` interface, we can report
to litep2p confirmed external addresses. This should mitigate or at
least improve: paritytech#4925.
Will keep the issue around to confirm this.


### Improved metrics

We are one step closer to exposing similar metrics as libp2p:
paritytech#4681.

cc @paritytech/networking 

### Next Steps
- [x] Use public address interface to confirm addresses to identify
protocol

---------

Signed-off-by: Alexandru Vasile <[email protected]>
fairax pushed a commit to UniqueNetwork/polkadot-sdk that referenced this pull request Oct 30, 2024
lexnv added a commit that referenced this pull request Nov 15, 2024
This release introduces several new features, improvements, and fixes to
the litep2p library. Key updates include enhanced error handling,
configurable connection limits, and a new API for managing public
addresses.

For a detailed set of changes, see [litep2p
changelog](https://github.com/paritytech/litep2p/blob/master/CHANGELOG.md#070---2024-09-05).

This PR makes use of:
- connection limits to optimize network throughput
- better errors that are propagated to substrate metrics
- public addresses API to report healthy addresses to the Identify
protocol

Measuring warp sync time is a bit inaccurate since the network is not
deterministic and we might end up using faster peers (peers with more
resources to handle our requests). However, I did not see warp sync
times of 16 minutes, instead, they are roughly stabilized between 8 and
10 minutes.

For measuring warp-sync time, I've used
[sub-trige-logs](https://github.com/lexnv/sub-triage-logs/?tab=readme-ov-file#warp-time)

Phase | Time
 -|-
Warp  | 426.999999919s
State | 99.000000555s
Total | 526.000000474s

Phase | Time
 -|-
Warp  | 731.999999837s
State | 71.000000882s
Total | 803.000000719s

Closes: #4986

After exposing the `litep2p::public_addresses` interface, we can report
to litep2p confirmed external addresses. This should mitigate or at
least improve: #4925.
Will keep the issue around to confirm this.

We are one step closer to exposing similar metrics as libp2p:
#4681.

cc @paritytech/networking

- [x] Use public address interface to confirm addresses to identify
protocol

---------

Signed-off-by: Alexandru Vasile <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
D0-easy Can be fixed primarily by duplicating and adapting code by an intermediate coder. I5-enhancement An additional feature request. T0-node This PR/Issue is related to the topic “node”.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

network/litep2p: Slower sync time compared to libp2p
5 participants