Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tcp/websocket/quic: Fix cancel memory leak #272

Merged
merged 11 commits into from
Oct 30, 2024
Merged

Conversation

lexnv
Copy link
Collaborator

@lexnv lexnv commented Oct 22, 2024

This fixes a bug in the TCP and Websocket transports that was leaking memory for:

The memory leak is happening in the following scenarios:

  • T0: transport manager: dials K (parallelism factor = 8) addresses on TCP and WebSocket on ConnectionId=1
  • T1: TCP: establishes a connection with the peer ConnectionId=1
  • T2: WebSocket: establishes a connection with the peer ConnectionId=1
  • T3: transport manager: receives TCP establishment event and cancels WebSocket dials

The issue happens when T2 finishes before T3.
In this situation, the WebSocket transport no longer has a future with a corresponding ConnectionId=1.
The canceling method simply inserts ConnectionId=1 into a hashset This leads to the hashset growing over time, without a way to clean-up stale connection IDs.

The fix relies on the changes added in #255:

  • cancel_futures maps a connection ID to an abort handle
  • the cancel_futures is guaranteed to contain a connection ID that corresponds to an unfinished pending_raw_connections future
  • the cancel method just aborts the in-flight future, if it exists
  • state of the cancel_futures is done when polling pending_raw_connections

Testing Done

I used a custom-patched version of litep2p to log the number of pending dials. After a few hours, the pending dials for both TCP and WebSocket connections stabilized at just a few. (Same as #271).

2024-10-22 17:37:56.252  INFO tokio-runtime-worker litep2p::tcp: status pending_dials=1 pending_inbound_connections=0 pending_connections=1 pending_raw_connections=0 opened_raw=0 cancel_futures=0 pending_open=0
2024-10-22 17:38:26.252  INFO tokio-runtime-worker litep2p::tcp: status pending_dials=0 pending_inbound_connections=0 pending_connections=0 pending_raw_connections=1 opened_raw=0 cancel_futures=1 pending_open=0
2024-10-22 17:38:56.253  INFO tokio-runtime-worker litep2p::tcp: status pending_dials=0 pending_inbound_connections=0 pending_connections=0 pending_raw_connections=0 opened_raw=0 cancel_futures=0 pending_open=0
2024-10-22 17:39:26.253  INFO tokio-runtime-worker litep2p::tcp: status pending_dials=0 pending_inbound_connections=0 pending_connections=0 pending_raw_connections=0 opened_raw=0 cancel_futures=0 pending_open=0
2024-10-22 17:39:56.252  INFO tokio-runtime-worker litep2p::tcp: status pending_dials=0 pending_inbound_connections=0 pending_connections=0 pending_raw_connections=0 opened_raw=0 cancel_futures=0 pending_open=0
2024-10-22 17:40:26.252  INFO tokio-runtime-worker litep2p::tcp: status pending_dials=0 pending_inbound_connections=0 pending_connections=0 pending_raw_connections=1 opened_raw=0 cancel_futures=1 pending_open=0
2024-10-22 17:40:56.252  INFO tokio-runtime-worker litep2p::tcp: status pending_dials=0 pending_inbound_connections=0 pending_connections=0 pending_raw_connections=0 opened_raw=0 cancel_futures=0 pending_open=0
2024-10-22 17:41:26.252  INFO tokio-runtime-worker litep2p::tcp: status pending_dials=0 pending_inbound_connections=0 pending_connections=0 pending_raw_connections=0 opened_raw=0 cancel_futures=0 pending_open=0

Build on: #255

Closes: #270

Copy link
Collaborator

@dmitry-markin dmitry-markin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Should we address QUIC as well?

src/transport/tcp/mod.rs Outdated Show resolved Hide resolved
src/transport/websocket/mod.rs Outdated Show resolved Hide resolved
lexnv and others added 2 commits October 30, 2024 11:21
Base automatically changed from lexnv/cancel-connections to master October 30, 2024 09:26
@lexnv lexnv changed the title tcp/websocket: Fix cancel memory leak tcp/websocket/quic: Fix cancel memory leak Oct 30, 2024
Signed-off-by: Alexandru Vasile <[email protected]>
@lexnv
Copy link
Collaborator Author

lexnv commented Oct 30, 2024

Have fixed quic as well, since it wasn't too much of a hustle 🙏

@lexnv lexnv merged commit c0fef8d into master Oct 30, 2024
8 checks passed
@lexnv lexnv deleted the lexnv/fix-cancel-leak branch October 30, 2024 10:29
lexnv added a commit that referenced this pull request Nov 4, 2024
## [0.8.0] - 2024-11-01

This release adds support for content provider advertisement and
discovery to Kademlia protocol implementation (see libp2p
[spec](https://github.com/libp2p/specs/blob/master/kad-dht/README.md#content-provider-advertisement-and-discovery)).
Additionally, the release includes several improvements and memory leak
fixes to enhance the stability and performance of the litep2p library.

### Added

- kad: Providers part 8: unit, e2e, and `libp2p` conformance tests
([#258](#258))
- kad: Providers part 7: better types and public API, public addresses &
known providers ([#246](#246))
- kad: Providers part 6: stop providing
([#245](#245))
- kad: Providers part 5: `GET_PROVIDERS` query
([#236](#236))
- kad: Providers part 4: refresh local providers
([#235](#235))
- kad: Providers part 3: publish provider records (start providing)
([#234](#234))

### Changed

- transport_service: Improve connection stability by downgrading
connections on substream inactivity
([#260](#260))
- transport: Abort canceled dial attempts for TCP, WebSocket and Quic
([#255](#255))
- kad/executor: Add timeout for writting frames
([#277](#277))
- kad: Avoid cloning the `KademliaMessage` and use reference for
`RoutingTable::closest`
([#233](#233))
- peer_state: Robust state machine transitions
([#251](#251))
- address_store: Improve address tracking and add eviction algorithm
([#250](#250))
- kad: Remove unused serde cfg
([#262](#262))
- req-resp: Refactor to move functionality to dedicated methods
([#244](#244))
- transport_service: Improve logs and move code from tokio::select macro
([#254](#254))

### Fixed

- tcp/websocket/quic: Fix cancel memory leak
([#272](#272))
- transport: Fix pending dials memory leak
([#271](#271))
- ping: Fix memory leak of unremoved `pending_opens`
([#274](#274))
- identify: Fix memory leak of unused `pending_opens`
([#273](#273))
- kad: Fix not retrieving local records
([#221](#221))

---------

Signed-off-by: Alexandru Vasile <[email protected]>
Co-authored-by: Dmitry Markin <[email protected]>
github-merge-queue bot pushed a commit to paritytech/polkadot-sdk that referenced this pull request Nov 5, 2024
This PR updates litep2p to the latest release.

- `KademliaEvent::PutRecordSucess` is renamed to fix word typo
- `KademliaEvent::GetProvidersSuccess` and
`KademliaEvent::IncomingProvider` are needed for bootnodes on DHT work
and will be utilized later


### Added

- kad: Providers part 8: unit, e2e, and `libp2p` conformance tests
([#258](paritytech/litep2p#258))
- kad: Providers part 7: better types and public API, public addresses &
known providers ([#246](paritytech/litep2p#246))
- kad: Providers part 6: stop providing
([#245](paritytech/litep2p#245))
- kad: Providers part 5: `GET_PROVIDERS` query
([#236](paritytech/litep2p#236))
- kad: Providers part 4: refresh local providers
([#235](paritytech/litep2p#235))
- kad: Providers part 3: publish provider records (start providing)
([#234](paritytech/litep2p#234))

### Changed

- transport_service: Improve connection stability by downgrading
connections on substream inactivity
([#260](paritytech/litep2p#260))
- transport: Abort canceled dial attempts for TCP, WebSocket and Quic
([#255](paritytech/litep2p#255))
- kad/executor: Add timeout for writting frames
([#277](paritytech/litep2p#277))
- kad: Avoid cloning the `KademliaMessage` and use reference for
`RoutingTable::closest`
([#233](paritytech/litep2p#233))
- peer_state: Robust state machine transitions
([#251](paritytech/litep2p#251))
- address_store: Improve address tracking and add eviction algorithm
([#250](paritytech/litep2p#250))
- kad: Remove unused serde cfg
([#262](paritytech/litep2p#262))
- req-resp: Refactor to move functionality to dedicated methods
([#244](paritytech/litep2p#244))
- transport_service: Improve logs and move code from tokio::select macro
([#254](paritytech/litep2p#254))

### Fixed

- tcp/websocket/quic: Fix cancel memory leak
([#272](paritytech/litep2p#272))
- transport: Fix pending dials memory leak
([#271](paritytech/litep2p#271))
- ping: Fix memory leak of unremoved `pending_opens`
([#274](paritytech/litep2p#274))
- identify: Fix memory leak of unused `pending_opens`
([#273](paritytech/litep2p#273))
- kad: Fix not retrieving local records
([#221](paritytech/litep2p#221))

See release changelog for more details:
https://github.com/paritytech/litep2p/releases/tag/v0.8.0

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <[email protected]>
Co-authored-by: Dmitry Markin <[email protected]>
lexnv added a commit to paritytech/polkadot-sdk that referenced this pull request Nov 15, 2024
This PR updates litep2p to the latest release.

- `KademliaEvent::PutRecordSucess` is renamed to fix word typo
- `KademliaEvent::GetProvidersSuccess` and
`KademliaEvent::IncomingProvider` are needed for bootnodes on DHT work
and will be utilized later

- kad: Providers part 8: unit, e2e, and `libp2p` conformance tests
([#258](paritytech/litep2p#258))
- kad: Providers part 7: better types and public API, public addresses &
known providers ([#246](paritytech/litep2p#246))
- kad: Providers part 6: stop providing
([#245](paritytech/litep2p#245))
- kad: Providers part 5: `GET_PROVIDERS` query
([#236](paritytech/litep2p#236))
- kad: Providers part 4: refresh local providers
([#235](paritytech/litep2p#235))
- kad: Providers part 3: publish provider records (start providing)
([#234](paritytech/litep2p#234))

- transport_service: Improve connection stability by downgrading
connections on substream inactivity
([#260](paritytech/litep2p#260))
- transport: Abort canceled dial attempts for TCP, WebSocket and Quic
([#255](paritytech/litep2p#255))
- kad/executor: Add timeout for writting frames
([#277](paritytech/litep2p#277))
- kad: Avoid cloning the `KademliaMessage` and use reference for
`RoutingTable::closest`
([#233](paritytech/litep2p#233))
- peer_state: Robust state machine transitions
([#251](paritytech/litep2p#251))
- address_store: Improve address tracking and add eviction algorithm
([#250](paritytech/litep2p#250))
- kad: Remove unused serde cfg
([#262](paritytech/litep2p#262))
- req-resp: Refactor to move functionality to dedicated methods
([#244](paritytech/litep2p#244))
- transport_service: Improve logs and move code from tokio::select macro
([#254](paritytech/litep2p#254))

- tcp/websocket/quic: Fix cancel memory leak
([#272](paritytech/litep2p#272))
- transport: Fix pending dials memory leak
([#271](paritytech/litep2p#271))
- ping: Fix memory leak of unremoved `pending_opens`
([#274](paritytech/litep2p#274))
- identify: Fix memory leak of unused `pending_opens`
([#273](paritytech/litep2p#273))
- kad: Fix not retrieving local records
([#221](paritytech/litep2p#221))

See release changelog for more details:
https://github.com/paritytech/litep2p/releases/tag/v0.8.0

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <[email protected]>
Co-authored-by: Dmitry Markin <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TCP/WebSocket: Fix memory leak of canceled state and canceled futures
2 participants