-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmarks and improve performance of ServletIo#requestBody
#171
base: series/0.23
Are you sure you want to change the base?
Conversation
Concurrently introduces an fs2 interruptible scope which means that each pull from the queue forks a fiber. The GC impact of this is significant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like major progress, though I'm awfully disappointed that it still falls so far short. We may need to revert the whole queue concept.
The unsafeRunAndForget does seem dubious here with the parallel dispatcher. I'm now concerned we may have a critical correctness bug in the current version.
benchmarks/src/main/scala/org/http4s/servlet/ServletIoBenchmarks.scala
Outdated
Show resolved
Hide resolved
This should be a unit test, but I think we've got a bigger problem than performance with that parallel dispatcher: 4c9eb64. Specifically, I'm seeing it reading an empty body! [Edit: no, sequential is failing that way, too. I think I need to look at this with fresh eyes tomorrow.] |
Yeah it's a shame because the new code is much easier to understand. I wonder if part of the problem is that we're effectively double-queuing each chunk? Because there's a queue inside the sequential dispatcher as well.
Yeah it seems like it shouldn't work to me. But therre may be a critical correctness bug in my understanding 😅 Also thanks very much for fixing the spacing on the benchmarks! 😂 |
I went ahead and made it a per-request dispatcher. Quick-and-dirty as of 9210893:
Here's where
This is
|
Note that above stack profile is on Java 8. We get a different picture on Java 17. We should be targeting 8 with the library for maximal support, but we should be optimizing for 17.
and
Fascinating that |
@@ -86,6 +86,20 @@ class ServletIoBenchmarks { | |||
readListener.onAllDataRead() | |||
result | |||
} | |||
|
|||
override def read(buf: Array[Byte]) = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦 I should have realized that the default is one byte at a time via Read
I started a Pull-based implementation that avoids the queues and quickly ran into about a 3x performance loss. If I change
What I don't know is how much a decline in these benchmarks translates to a decline in real-world throughput. But I don't think we can compete with |
@rossabaker sorry I let this fall off my radar. Is there anything I can do to help get this whole performance thing over the line? I wasn't sure if you were planning to carry on with this PR or revert to the previous implementation |
I got distracted by life and a couple other projects this week. Yeah... the current one is not only not performant, but strikingly incorrect. My inclination is that we should revert without breaking binary compatibility, and meanwhile keep exploring a way to clean it up and remain performant. |
Alright, we can release #175 to undo the damage, and then keep exploring better ways via this PR. |
Background
We merged some dependency updates and noticed a spike in CPU usage and in GC frequency/duration. We managed to narrow it down to
http4s-servlet:0.23.12
(.13
is also affected). With some git bisection and making local snapshots we managed to pin it down to #50.Changes in this PR
This PR adds a benchmark to compare the performance of
requestBody
andreader
and makes some suggested improvements torequestBody
off the back of that. The benchmarks do show a very significant reduction in throughput and increase in allocations when usingrequestBody
. The most significant thing that I found was that therequestBody
benchmark was allocating an enormous number ofint[]
(35-40% of allocations) because we are spawning 3 additional fibers per chunk - one in the parallelDispatcher
and two because the use ofStream#concurrently
means thatPull.eval(q.take)
is an interruptible eval.Conequently I made the following changes:
Dispatcher
(I was actually confused how the current code works as there are no ordering guarantees forq.submit(chunk).unsafeRunAndForget
when running on the parallelDispatcher
?) to eliminate one fiber allocation. I'm only forcing sequential in the benchmarks thus far but this very much supports Create a sequential dispatcher for each request and response #170concurrently
withflatMap
so thatPull.eval(q.take)
is no longer interruptible and hence eliminate the other fiber allocationsrequestBody
still gives ~50% less throughput and allocates about 10X more, I thought that every little helps.Benchmark results
Current version
This PR
[Edited by @rossabaker to make the benchmarks monospace]