Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core/muxing: Redesign StreamMuxer trait #2648

Closed

Conversation

thomaseizinger
Copy link
Contributor

@thomaseizinger thomaseizinger commented May 15, 2022

Description

This PR re-designs the StreamMuxer trait to:

  1. Use &mut self for all methods
  2. Follow the style of "mutate state and return result via poll", same as we do for many other parts of rust-libp2p

Splitting out separate PRs

Additional tasks

  • Retain Substream associated type on libp2p-core level
  • Fix mplex performance regression (use channels of Bytes/[u8; X]?)
  • Make StreamMuxer use Pin<&mut self>
  • Include "first close, then flush" logic within SubstreamBox
  • Release yamux once Replace ConnectionError in public API with io::Error rust-yamux#135 is merged
  • Update PR with latest yamux release
  • Drop b6c91d1 from PR now that cec1176 is landed and the bug is fixed.
  • Write a more detailed changelog entry for libp2p-yamux.

Links to any relevant issues

Open Questions

  • Is there a better name for AsyncReadWriteBox?

Change checklist

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • A changelog entry has been made in the appropriate crates

Copy link
Contributor Author

@thomaseizinger thomaseizinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things found in self-review.

muxers/yamux/src/lib.rs Outdated Show resolved Hide resolved
muxers/yamux/src/lib.rs Outdated Show resolved Hide resolved
swarm/src/connection.rs Outdated Show resolved Hide resolved
@mxinden
Copy link
Member

mxinden commented May 15, 2022

Great. I meant to touch this code part since some time. I will take a look. Thanks @thomaseizinger for proposing the patch set.

@thomaseizinger

This comment was marked as resolved.

@thomaseizinger

This comment was marked as resolved.

@thomaseizinger

This comment was marked as resolved.

@thomaseizinger
Copy link
Contributor Author

As part of fixing the benchmarks, I ran them against master and this branch and unfortunately, there seems to be some severe performance regressions in mplex:

Gnuplot not found, using plotters backend
tcp/256                 time:   [5.3159 ms 5.3836 ms 5.4592 ms]
                        thrpt:  [183.18 MiB/s 185.75 MiB/s 188.11 MiB/s]
                 change:
                        time:   [-0.2956% +2.5125% +5.8465%] (p = 0.10 > 0.05)
                        thrpt:  [-5.5236% -2.4509% +0.2965%]
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe
tcp/512                 time:   [3.4658 ms 3.5983 ms 3.7190 ms]
                        thrpt:  [268.89 MiB/s 277.91 MiB/s 288.53 MiB/s]
                 change:
                        time:   [-13.413% -5.8873% +2.6998%] (p = 0.18 > 0.05)
                        thrpt:  [-2.6289% +6.2556% +15.491%]
                        No change in performance detected.
Found 15 outliers among 100 measurements (15.00%)
  13 (13.00%) low severe
  2 (2.00%) high mild
tcp/1024                time:   [3.2309 ms 3.3600 ms 3.4761 ms]
                        thrpt:  [287.68 MiB/s 297.62 MiB/s 309.51 MiB/s]
                 change:
                        time:   [+6.6190% +15.824% +26.712%] (p = 0.00 < 0.05)
                        thrpt:  [-21.081% -13.662% -6.2081%]
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) low severe
  1 (1.00%) high mild
tcp/8192                time:   [2.8161 ms 2.8594 ms 2.9051 ms]
                        thrpt:  [344.22 MiB/s 349.73 MiB/s 355.09 MiB/s]
                 change:
                        time:   [+44.152% +57.258% +71.712%] (p = 0.00 < 0.05)
                        thrpt:  [-41.763% -36.410% -30.629%]
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe
tcp/16384               time:   [2.7120 ms 2.7387 ms 2.7652 ms]
                        thrpt:  [361.64 MiB/s 365.13 MiB/s 368.73 MiB/s]
                 change:
                        time:   [+30.950% +42.147% +54.770%] (p = 0.00 < 0.05)
                        thrpt:  [-35.388% -29.650% -23.635%]
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild
tcp/65536               time:   [2.2610 ms 2.4366 ms 2.6072 ms]
                        thrpt:  [383.56 MiB/s 410.41 MiB/s 442.29 MiB/s]
                 change:
                        time:   [+140.64% +162.39% +182.64%] (p = 0.00 < 0.05)
                        thrpt:  [-64.619% -61.888% -58.445%]
                        Performance has regressed.
tcp/262144              time:   [3.0009 ms 3.0434 ms 3.0865 ms]
                        thrpt:  [324.00 MiB/s 328.58 MiB/s 333.24 MiB/s]
                 change:
                        time:   [+99.851% +115.74% +132.65%] (p = 0.00 < 0.05)
                        thrpt:  [-57.017% -53.647% -49.963%]
                        Performance has regressed.
tcp/1048576             time:   [3.8665 ms 3.8915 ms 3.9158 ms]
                        thrpt:  [255.37 MiB/s 256.97 MiB/s 258.63 MiB/s]
                 change:
                        time:   [+42.686% +53.377% +65.578%] (p = 0.00 < 0.05)
                        thrpt:  [-39.606% -34.801% -29.916%]
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild

memory/256              time:   [3.9661 ms 4.0088 ms 4.0507 ms]
                        thrpt:  [246.87 MiB/s 249.45 MiB/s 252.14 MiB/s]
                 change:
                        time:   [+54.112% +61.755% +68.979%] (p = 0.00 < 0.05)
                        thrpt:  [-40.821% -38.178% -35.112%]
                        Performance has regressed.
memory/512              time:   [2.4825 ms 2.5069 ms 2.5319 ms]
                        thrpt:  [394.96 MiB/s 398.89 MiB/s 402.82 MiB/s]
                 change:
                        time:   [+51.312% +58.870% +66.709%] (p = 0.00 < 0.05)
                        thrpt:  [-40.015% -37.056% -33.911%]
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  4 (4.00%) low mild
  9 (9.00%) high mild
Benchmarking memory/1024: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.7s, enable flat sampling, or reduc
memory/1024             time:   [1.6728 ms 1.7484 ms 1.8086 ms]
                        thrpt:  [552.90 MiB/s 571.95 MiB/s 597.79 MiB/s]
                 change:
                        time:   [+23.997% +34.583% +46.069%] (p = 0.00 < 0.05)
                        thrpt:  [-31.539% -25.697% -19.353%]
                        Performance has regressed.
Benchmarking memory/8192: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.2s, enable flat sampling, or reduce sample count to 60.
memory/8192             time:   [1.3502 ms 1.3676 ms 1.3871 ms]
                        thrpt:  [720.93 MiB/s 731.22 MiB/s 740.66 MiB/s]
                 change:
                        time:   [+34.785% +44.974% +56.196%] (p = 0.00 < 0.05)
                        thrpt:  [-35.978% -31.022% -25.808%]
                        Performance has regressed.
memory/16384            time:   [791.93 us 908.92 us 1.0127 ms]
                        thrpt:  [987.44 MiB/s 1.0744 GiB/s 1.2331 GiB/s]
                 change:
                        time:   [+11.338% +22.345% +33.895%] (p = 0.00 < 0.05)
                        thrpt:  [-25.315% -18.264% -10.183%]
                        Performance has regressed.
Benchmarking memory/65536: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.7s, enable flat sampling, or reduce sample count to 60.
memory/65536            time:   [1.1447 ms 1.1558 ms 1.1671 ms]
                        thrpt:  [856.80 MiB/s 865.22 MiB/s 873.62 MiB/s]
                 change:
                        time:   [+68.560% +82.823% +98.392%] (p = 0.00 < 0.05)
                        thrpt:  [-49.595% -45.302% -40.674%]
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  7 (7.00%) high mild
Benchmarking memory/262144: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.7s, enable flat sampling, or reduce sample count to 60.
memory/262144           time:   [1.0463 ms 1.1525 ms 1.2377 ms]
                        thrpt:  [807.98 MiB/s 867.68 MiB/s 955.77 MiB/s]
                 change:
                        time:   [+13.879% +30.134% +47.883%] (p = 0.00 < 0.05)
                        thrpt:  [-32.379% -23.156% -12.188%]
                        Performance has regressed.
memory/1048576          time:   [2.1756 ms 2.1942 ms 2.2129 ms]
                        thrpt:  [451.90 MiB/s 455.75 MiB/s 459.65 MiB/s]
                 change:
                        time:   [+72.594% +79.643% +86.679%] (p = 0.00 < 0.05)
                        thrpt:  [-46.432% -44.334% -42.061%]
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

@thomaseizinger

This comment was marked as resolved.

@thomaseizinger thomaseizinger marked this pull request as draft May 17, 2022 12:13
Copy link
Member

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some very cool simplifications in here.

Any chance this pull request can be split into multiple autonomous ones to ease the review process?

core/src/muxing.rs Outdated Show resolved Hide resolved
core/src/muxing.rs Show resolved Hide resolved
swarm/src/connection/substream.rs Show resolved Hide resolved
muxers/yamux/src/lib.rs Show resolved Hide resolved
muxers/mplex/src/lib.rs Outdated Show resolved Hide resolved
@thomaseizinger

This comment was marked as resolved.

@thomaseizinger

This comment was marked as outdated.

@thomaseizinger

This comment was marked as outdated.

@thomaseizinger thomaseizinger force-pushed the refactor-multiplexing branch 6 times, most recently from 8bc4dfb to cc75a1a Compare May 19, 2022 15:43
@thomaseizinger thomaseizinger marked this pull request as ready for review May 19, 2022 15:43
@thomaseizinger

This comment was marked as resolved.

@mxinden

This comment was marked as resolved.

let mut buf = Vec::new();
outbound.read_to_end(&mut buf).await.unwrap();
let mut buf = vec![0u8; 11];
outbound.read_exact(&mut buf).await.unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that not defeat the purpose of the unit test?

Above:

// Tests that `AsyncWrite::close` implies flush.

Not using read_to_end thus no longer tests the closing and flushing, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you might be right. I wasn't too happy about these changes but extracted them as a separate commit so we can discuss it easier.

I suspect there might be a bug somewhere but I couldn't spot it so far.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it! The important commit is cec1176.

This means I can drop this commit from the PR history but that will require a force-push. I've noted it in the PR description as a task for the final clean-up once this PR is closer to being merged :)

@thomaseizinger

This comment was marked as resolved.

@mxinden
Copy link
Member

mxinden commented Jun 8, 2022

I would guess that https://github.com/flamegraph-rs/flamegraph proves useful here. One can compile the benchmarks to a binary and then run them via cargo flamegraph.

@thomaseizinger
Copy link
Contributor Author

I would guess that https://github.com/flamegraph-rs/flamegraph proves useful here. One can compile the benchmarks to a binary and then run them via cargo flamegraph.

Great idea. I am familiar with flamegraph-rs, just didn't think of using it.

thomaseizinger added a commit to thomaseizinger/rust-libp2p that referenced this pull request Jun 14, 2022
This is analogous to other sub-modules like `transport` within
`libp2p-core` and thus contributes to consistency within the code
base.

Additionally, it will reduce the diff of libp2p#2648.
We are already enforcing that the associated type must convert to
`io::Error`. We might as well just make all functions return an
`io::Error` directly.
Visually separating this makes it easier to read.
The current `StreamMuxer` API is based around a poll-based approach
where each substream needs to be passed by value to read/write/close
it.

The muxer itself however is not available in the end-user abstractions
like `ConnectionHandler`. To circumvent this, the `StreamMuxerBox`
type exists which allows a `StreamMuxer` to be cloned. Together with
`SubstreamRef`, this allows each substream to own a reference to the
muxer it was created with and pass itself to the muxer within
implementations of `AsyncRead` and `AsyncWrite`. These implementations
are convenient as they allow an end-user to simply read and write to
the substream without holding a reference to anything else.

We can achieve the same goal by changing the `StreamMuxerBox` abstraction
to use a `SubstreamBox`. Users of `libp2p-core` can continue to
depend on the `Substream` associated type of `StreamMuxer`.
@thomaseizinger
Copy link
Contributor Author

thomaseizinger commented Jun 15, 2022

EDIT: I managed to extract the crucial bit into a dedicated PR: #2706 Best to benchmark that in isolation.

OLD COMMENT BELOW:


Interestingly enough, I can no longer reproduce the performance problems. Would you mind trying @elenaf9 / @mxinden? What I did was:

  • Install cargo-criterion
  • Run cargo criterion within the muxers/mplex directory for master branch
  • Run cargo criterion within the muxers/mplex directory for refactoring-multiplexing branch (Easy to checkout via gh pr checkout 2648)

Funnily, I can now actually see a performance improvement of 5% for some of the TCP benchmarks and even 10% for the memory benchmarks.

tcp/16384 and memory/16384 are are an outlier with ~5% regression.

thomaseizinger added a commit to thomaseizinger/rust-libp2p that referenced this pull request Jun 15, 2022
This aligns the public API of the `libp2p-mplex` module with the one
from `libp2p-yamux`. This change has two benefits:

1. For standalone users of `libp2p-mplex`, the substreams itself are
now useful, similar to `libp2p-yamux` and don't necessarily need to
be polled via the `StreamMuxer`. The `StreamMuxer` only forwards to
the `Async{Read,Write}` implementations.

2. This will reduce the diff of libp2p#2648 because we can chunk the one
giant commit into smaller atomic ones.
@mxinden
Copy link
Member

mxinden commented Jun 19, 2022

@elenaf9 just double checking, none of these changes make QUIC in rust-libp2p harder to implement, right? I would expect the opposite, namely for all of this to simplify #2289.

mxinden pushed a commit that referenced this pull request Jun 20, 2022
…2706)

This aligns the public API of the `libp2p-mplex` module with the one
from `libp2p-yamux`. This change has two benefits:

1. For standalone users of `libp2p-mplex`, the substreams itself are
now useful, similar to `libp2p-yamux` and don't necessarily need to
be polled via the `StreamMuxer`. The `StreamMuxer` only forwards to
the `Async{Read,Write}` implementations.

2. This will reduce the diff of #2648 because we can chunk the one
giant commit into smaller atomic ones.
@mxinden
Copy link
Member

mxinden commented Jun 21, 2022

Interestingly enough, I can no longer reproduce the performance problems. Would you mind trying @elenaf9 / @mxinden? What I did was:

Sorry, late reply here. Happy to do it. Would I be running against this branch or against master given that #2706 is already merged?

@thomaseizinger
Copy link
Contributor Author

Interestingly enough, I can no longer reproduce the performance problems. Would you mind trying @elenaf9 / @mxinden? What I did was:

Sorry, late reply here. Happy to do it. Would I be running against this branch or against master given that #2706 is already merged?

If you want, you can benchmark #2706 against master before it was merged :)

@mxinden
Copy link
Member

mxinden commented Jun 22, 2022

@melekes given that #2622 now also implements the StreamMuxer trait, I want to make sure you are aware of Thomas refactorings around the trait.

Let us know in case there is something we should be considering (with regards to WebRTC).

I would expect the overall change set to simplify the WebRTC muxing implementation.

@thomaseizinger
Copy link
Contributor Author

thomaseizinger commented Jun 22, 2022

I am closing this PR in favor of a ticket where we can track all of this: #2722

@mxinden
Copy link
Member

mxinden commented Jun 22, 2022

Interestingly enough, I can no longer reproduce the performance problems. Would you mind trying @elenaf9 / @mxinden? What I did was:

Sorry, late reply here. Happy to do it. Would I be running against this branch or against master given that #2706 is already merged?

If you want, you can benchmark #2706 against master before it was merged :)

@thomaseizinger: Ran on a Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz:

critcmp 3c120ef9 ea487aeb
group             3c120ef9                                ea487aeb
-----             --------                                --------
memory/1024       1.00   518.4±42.66µs  1929.1 MB/sec     1.01   525.2±48.46µs  1904.1 MB/sec
memory/1048576    1.00   492.1±21.08µs  2032.2 MB/sec     1.03   508.9±44.19µs  1965.0 MB/sec
memory/16384      1.00   288.6±39.33µs     3.4 GB/sec     1.20   346.2±58.03µs     2.8 GB/sec
memory/256        1.00  1319.6±74.77µs   757.8 MB/sec     1.01  1328.2±60.90µs   752.9 MB/sec
memory/262144     1.00   284.4±39.42µs     3.4 GB/sec     1.00   284.5±38.24µs     3.4 GB/sec
memory/512        1.00   804.1±47.89µs  1243.7 MB/sec     1.00   803.6±47.80µs  1244.5 MB/sec
memory/65536      1.00   264.0±37.24µs     3.7 GB/sec     1.01   266.7±37.35µs     3.7 GB/sec
memory/8192       1.00   317.7±47.46µs     3.1 GB/sec     1.01   321.2±51.34µs     3.0 GB/sec
tcp/1024          1.02   810.9±57.88µs  1233.2 MB/sec     1.00   798.6±85.29µs  1252.2 MB/sec
tcp/1048576       1.00   885.9±56.03µs  1128.7 MB/sec     1.00   885.9±91.76µs  1128.8 MB/sec
tcp/16384         1.01   668.6±93.15µs  1495.7 MB/sec     1.00   659.9±98.35µs  1515.5 MB/sec
tcp/256           1.32  1927.9±258.41µs   518.7 MB/sec    1.00  1456.4±176.39µs   686.6 MB/sec
tcp/262144        1.03   679.6±96.01µs  1471.4 MB/sec     1.00   657.3±78.15µs  1521.4 MB/sec
tcp/512           1.12  1152.5±164.17µs   867.7 MB/sec    1.00  1027.3±134.77µs   973.4 MB/sec
tcp/65536         1.01   626.5±83.42µs  1596.1 MB/sec     1.00   619.0±78.21µs  1615.5 MB/sec
tcp/8192          1.04  686.9±117.84µs  1455.8 MB/sec     1.00  663.1±107.80µs  1508.1 MB/sec

where 3c120ef is the base and ea487ae includes the change. I don't see any of this as a sign of a regression.

@elenaf9
Copy link
Contributor

elenaf9 commented Jun 25, 2022

@elenaf9 just double checking, none of these changes make QUIC in rust-libp2p harder to implement, right? I would expect the opposite, namely for all of this to simplify #2289.

I did not get the chance to review the QuicMuxer in depth yet. I have been focused on the transport thus far.
But the QuicMuxer will definitely need some refactoring after #2707, see #2289 (comment). Because we now need to hand out substreams instead of just Ids, there is some additional logic needed. I can not say right now if it overall simplifies things or not.

But either way I think the changes of this PR are the right direction.

@thomaseizinger
Copy link
Contributor Author

@elenaf9 just double checking, none of these changes make QUIC in rust-libp2p harder to implement, right? I would expect the opposite, namely for all of this to simplify #2289.

I did not get the chance to review the QuicMuxer in depth yet. I have been focused on the transport thus far.
But the QuicMuxer will definitely need some refactoring after #2707, see #2289 (comment). Because we now need to hand out substreams instead of just Ids, there is some additional logic needed. I can not say right now if it overall simplifies things or not.

But either way I think the changes of this PR are the right direction.

I think worst case, you can follow an approach similar to the mplex muxer!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants