Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarks and improve performance of ServletIo#requestBody #171

Open
wants to merge 10 commits into
base: series/0.23
Choose a base branch
from

Conversation

TimWSpence
Copy link

@TimWSpence TimWSpence commented Jan 24, 2023

Background

We merged some dependency updates and noticed a spike in CPU usage and in GC frequency/duration. We managed to narrow it down to http4s-servlet:0.23.12 (.13 is also affected). With some git bisection and making local snapshots we managed to pin it down to #50.

Changes in this PR

This PR adds a benchmark to compare the performance of requestBody and reader and makes some suggested improvements to requestBody off the back of that. The benchmarks do show a very significant reduction in throughput and increase in allocations when using requestBody. The most significant thing that I found was that the requestBody benchmark was allocating an enormous number of int[] (35-40% of allocations) because we are spawning 3 additional fibers per chunk - one in the parallel Dispatcher and two because the use of Stream#concurrently means that Pull.eval(q.take) is an interruptible eval.

Conequently I made the following changes:

  1. Use a sequential Dispatcher (I was actually confused how the current code works as there are no ordering guarantees for q.submit(chunk).unsafeRunAndForget when running on the parallel Dispatcher?) to eliminate one fiber allocation. I'm only forcing sequential in the benchmarks thus far but this very much supports Create a sequential dispatcher for each request and response #170
  2. Replace concurrently with flatMap so that Pull.eval(q.take) is no longer interruptible and hence eliminate the other fiber allocations
  3. Remove the boxing on the queue. I didn't see a huge difference with this but given that the benchmarks suggest that the improved requestBody still gives ~50% less throughput and allocates about 10X more, I thought that every little helps.

Benchmark results

Current version

Benchmark                                                         (iters)  (size)   Mode  Cnt         Score        Error   Units
ServletIoBenchmarks.reader                                           1000  100000  thrpt   20       206.176 ±      2.882   ops/s
ServletIoBenchmarks.reader:·gc.alloc.rate                            1000  100000  thrpt   20       514.322 ±      7.181  MB/sec
ServletIoBenchmarks.reader:·gc.alloc.rate.norm                       1000  100000  thrpt   20   2748150.063 ±      5.020    B/op
ServletIoBenchmarks.reader:·gc.churn.G1_Eden_Space                   1000  100000  thrpt   20       511.571 ±      8.988  MB/sec
ServletIoBenchmarks.reader:·gc.churn.G1_Eden_Space.norm              1000  100000  thrpt   20   2733572.797 ±  37047.601    B/op
ServletIoBenchmarks.reader:·gc.churn.G1_Survivor_Space               1000  100000  thrpt   20         0.032 ±      0.010  MB/sec
ServletIoBenchmarks.reader:·gc.churn.G1_Survivor_Space.norm          1000  100000  thrpt   20       172.966 ±     52.647    B/op
ServletIoBenchmarks.reader:·gc.count                                 1000  100000  thrpt   20       668.000               counts
ServletIoBenchmarks.reader:·gc.time                                  1000  100000  thrpt   20       361.000                   ms

ServletIoBenchmarks.requestBody                                      1000  100000  thrpt   20        39.742 ±      0.779   ops/s
ServletIoBenchmarks.requestBody:·gc.alloc.rate                       1000  100000  thrpt   20      2371.828 ±     46.455  MB/sec
ServletIoBenchmarks.requestBody:·gc.alloc.rate.norm                  1000  100000  thrpt   20  65741530.766 ±   2074.704    B/op
ServletIoBenchmarks.requestBody:·gc.churn.G1_Eden_Space              1000  100000  thrpt   20      2396.249 ±     47.821  MB/sec
ServletIoBenchmarks.requestBody:·gc.churn.G1_Eden_Space.norm         1000  100000  thrpt   20  66418490.614 ± 273169.904    B/op
ServletIoBenchmarks.requestBody:·gc.churn.G1_Survivor_Space          1000  100000  thrpt   20         0.029 ±      0.012  MB/sec
ServletIoBenchmarks.requestBody:·gc.churn.G1_Survivor_Space.norm     1000  100000  thrpt   20       806.306 ±    327.324    B/op
ServletIoBenchmarks.requestBody:·gc.count                            1000  100000  thrpt   20      1886.000               counts
ServletIoBenchmarks.requestBody:·gc.time                             1000  100000  thrpt   20       926.000                   ms

This PR

Benchmark                                                         (iters)  (size)   Mode  Cnt         Score       Error   Units
ServletIoBenchmarks.reader                                           1000  100000  thrpt   20       188.385 ±     3.473   ops/s
ServletIoBenchmarks.reader:·gc.alloc.rate                            1000  100000  thrpt   20       470.069 ±     8.604  MB/sec
ServletIoBenchmarks.reader:·gc.alloc.rate.norm                       1000  100000  thrpt   20   2749302.659 ±  1382.197    B/op
ServletIoBenchmarks.reader:·gc.churn.G1_Eden_Space                   1000  100000  thrpt   20       468.061 ±    10.620  MB/sec
ServletIoBenchmarks.reader:·gc.churn.G1_Eden_Space.norm              1000  100000  thrpt   20   2737395.458 ± 25786.814    B/op
ServletIoBenchmarks.reader:·gc.churn.G1_Survivor_Space               1000  100000  thrpt   20         0.026 ±     0.009  MB/sec
ServletIoBenchmarks.reader:·gc.churn.G1_Survivor_Space.norm          1000  100000  thrpt   20       150.846 ±    54.054    B/op
ServletIoBenchmarks.reader:·gc.count                                 1000  100000  thrpt   20       611.000              counts
ServletIoBenchmarks.reader:·gc.time                                  1000  100000  thrpt   20       357.000                  ms

ServletIoBenchmarks.requestBody                                      1000  100000  thrpt   20       104.962 ±     0.326   ops/s
ServletIoBenchmarks.requestBody:·gc.alloc.rate                       1000  100000  thrpt   20      1942.809 ±     6.386  MB/sec
ServletIoBenchmarks.requestBody:·gc.alloc.rate.norm                  1000  100000  thrpt   20  20396363.510 ±  8978.371    B/op
ServletIoBenchmarks.requestBody:·gc.churn.G1_Eden_Space              1000  100000  thrpt   20      1960.285 ±    11.354  MB/sec
ServletIoBenchmarks.requestBody:·gc.churn.G1_Eden_Space.norm         1000  100000  thrpt   20  20579794.985 ± 91489.964    B/op
ServletIoBenchmarks.requestBody:·gc.churn.G1_Survivor_Space          1000  100000  thrpt   20         0.022 ±     0.007  MB/sec
ServletIoBenchmarks.requestBody:·gc.churn.G1_Survivor_Space.norm     1000  100000  thrpt   20       234.842 ±    70.506    B/op
ServletIoBenchmarks.requestBody:·gc.count                            1000  100000  thrpt   20      1545.000              counts
ServletIoBenchmarks.requestBody:·gc.time                             1000  100000  thrpt   20       739.000                  ms

[Edited by @rossabaker to make the benchmarks monospace]

Concurrently introduces an fs2 interruptible scope which means that each
pull from the queue forks a fiber. The GC impact of this is significant.
Copy link
Member

@rossabaker rossabaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like major progress, though I'm awfully disappointed that it still falls so far short. We may need to revert the whole queue concept.

The unsafeRunAndForget does seem dubious here with the parallel dispatcher. I'm now concerned we may have a critical correctness bug in the current version.

@rossabaker
Copy link
Member

rossabaker commented Jan 25, 2023

This should be a unit test, but I think we've got a bigger problem than performance with that parallel dispatcher: 4c9eb64. Specifically, I'm seeing it reading an empty body!

[Edit: no, sequential is failing that way, too. I think I need to look at this with fresh eyes tomorrow.]

@TimWSpence
Copy link
Author

TimWSpence commented Jan 25, 2023

This looks like major progress, though I'm awfully disappointed that it still falls so far short. We may need to revert the whole queue concept.

Yeah it's a shame because the new code is much easier to understand. I wonder if part of the problem is that we're effectively double-queuing each chunk? Because there's a queue inside the sequential dispatcher as well.

The unsafeRunAndForget does seem dubious here with the parallel dispatcher. I'm now concerned we may have a critical correctness bug in the current version.

Yeah it seems like it shouldn't work to me. But therre may be a critical correctness bug in my understanding 😅

Also thanks very much for fixing the spacing on the benchmarks! 😂

@rossabaker
Copy link
Member

I went ahead and made it a per-request dispatcher. Quick-and-dirty as of 9210893:

[info] Benchmark                        (iters)  (size)   Mode  Cnt    Score    Error  Units
[info] ServletIoBenchmarks.reader          1000  100000  thrpt    3  828.621 ± 15.904  ops/s
[info] ServletIoBenchmarks.requestBody     1000  100000  thrpt    3   56.251 ± 10.994  ops/s

Here's where requestBody is spending its time:

[info]   3.1%  28.4% sun.misc.Unsafe.unpark
[info]   2.2%  20.3% java.lang.Object.hashCode
[info]   1.5%  13.1% sun.misc.Unsafe.park
[info]   0.9%   8.4% java.io.InputStream.read
[info]   0.8%   6.9% cats.effect.IOPlatform.<init>
[info]   0.8%   6.8% cats.effect.IOFiber.runLoop
[info]   0.2%   2.2% cats.effect.IOFiber.succeeded
[info]   0.2%   1.4% cats.effect.IO.<init>
[info]   0.1%   1.2% java.util.Random.nextBytes
[info]   0.1%   1.1% cats.effect.kernel.Resource.loop$1
[info]   1.1%  10.1% <other>

This is reader:

[info]   1.8%  19.8% org.http4s.servlet.NonBlockingServletIo.org$http4s$servlet$NonBlockingServletIo$$read$1
[info]   1.7%  19.4% java.util.Random.nextBytes
[info]   1.4%  15.5% cats.effect.IOFiber.runLoop
[info]   1.0%  10.6% cats.effect.IOPlatform.<init>
[info]   0.4%   4.3% cats.effect.IOFiber.succeeded
[info]   0.3%   2.8% cats.effect.IO$$anon$4.delay
[info]   0.2%   2.7% cats.effect.IO.flatMap
[info]   0.2%   2.0% cats.effect.kernel.Resource$$Lambda$19/1133484341.<init>
[info]   0.2%   1.9% fs2.Pull$.fs2$Pull$$viewL$1
[info]   0.1%   1.6% cats.effect.IO$.pure
[info]   1.8%  19.5% <other>

@rossabaker
Copy link
Member

rossabaker commented Jan 25, 2023

Note that above stack profile is on Java 8. We get a different picture on Java 17. We should be targeting 8 with the library for maximal support, but we should be optimizing for 17.

requestBody stack:

[info]   7.7%  44.0% <stack is empty, everything is filtered?>
[info]   2.6%  14.7% cats.effect.IOFiber.runLoop
[info]   1.4%   8.2% jdk.internal.misc.Unsafe.unpark
[info]   0.9%   5.1% java.util.Random.nextBytes
[info]   0.7%   4.1% cats.effect.unsafe.ScalQueue.poll
[info]   0.7%   4.1% cats.effect.IOFiber.succeeded
[info]   0.3%   1.5% cats.effect.kernel.Resource.loop$1
[info]   0.2%   1.2% cats.effect.std.Dispatcher.unsafeRunAndForget
[info]   0.2%   1.2% cats.effect.IO.<init>
[info]   0.2%   1.0% cats.Monad.$anonfun$map$1
[info]   2.6%  14.8% <other>

and

[info] ServletIoBenchmarks.reader              1000  100000  thrpt    3  353.705 ± 580.082  ops/s
[info] ServletIoBenchmarks.requestBody         1000  100000  thrpt    3   77.763 ±   2.734  ops/s

Fascinating that reader got worse, and I don't know WTF that error margin is about.

@@ -86,6 +86,20 @@ class ServletIoBenchmarks {
readListener.onAllDataRead()
result
}

override def read(buf: Array[Byte]) = {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 I should have realized that the default is one byte at a time via Read

@rossabaker
Copy link
Member

I started a Pull-based implementation that avoids the queues and quickly ran into about a 3x performance loss. If I change requestBody to be equivalent to reader, so that the only difference is creating and releasing an (unused in this impl) sequential Dispatcher per request, we are already in a mess!

[info] ServletIoBenchmarks.reader                 1000  100000  thrpt    2  470.326          ops/s
[info] ServletIoBenchmarks.requestBody            1000  100000  thrpt    2  153.507          ops/s

What I don't know is how much a decline in these benchmarks translates to a decline in real-world throughput. But I don't think we can compete with reader with a dispatcher-per-request.

@TimWSpence
Copy link
Author

TimWSpence commented Jan 31, 2023

@rossabaker sorry I let this fall off my radar. Is there anything I can do to help get this whole performance thing over the line? I wasn't sure if you were planning to carry on with this PR or revert to the previous implementation

@rossabaker
Copy link
Member

I got distracted by life and a couple other projects this week. Yeah... the current one is not only not performant, but strikingly incorrect. My inclination is that we should revert without breaking binary compatibility, and meanwhile keep exploring a way to clean it up and remain performant.

@rossabaker rossabaker mentioned this pull request Feb 2, 2023
@rossabaker
Copy link
Member

Alright, we can release #175 to undo the damage, and then keep exploring better ways via this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants