-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reuse the TCP socket and buffer IO to improve performance #32
Conversation
wrt. tuple messages: I think it's a good idea to stop using NamedTuples. These carry symbols in their types, and symbols don't have a fixed size making them less efficient to serialize.
|
I made a type Then I was wondering how Distributed got this speed, and I found out that Julia has this secret function: You can call this on your I think it's best to use a |
HI @fonsp,
|
Ooh thanks @habemus-papadum , that's nice to hear! I had a fun time writing it, and I felt good to reach the same performance as Distributed, so I'm not sad to find out that What do you think about using |
@fonsp Your last commit which removes |
I was also going to suggest Doing all of this in a robust cross platform way can be hard. Libuv provides a lot of the hard work, but you may end uo finding adding an extra layer like https://nng.nanomsg.org/ may help avoid a lot complexity/re-inventing the wheel. I don't think it makes sense to worry about this now, but you can always ping me if you get curious and want a high level description. |
whoops |
I also implemented remote channels, and for the implementation, I cheated and looked at the Distributed implementation. Theirs was surprisingly easy: they don't have any special communication dedicated to channels, it is just built with simple The main difference is in "eagerness": in Distributed, channel values are not sent to the host as they come in, but as they are requested with julia> using Distributed
julia> p = Distributed.addprocs(1)
1-element Vector{Int64}:
2
julia> c = Distributed.RemoteChannel(() -> eval(:(c = Channel{Any}(3))), 2)
RemoteChannel{Channel{Any}}(2, 1, 8)
julia> Distributed.remotecall_eval(Main, 2, :(put!(c, 123)))
123
julia> Distributed.remotecall_eval(Main, 2, :(put!(c, 99)))
99
julia> Distributed.remotecall_eval(Main, 2, :(isready(c)))
true
julia> take!(c)
123
julia> Distributed.remotecall_eval(Main, 2, :(isready(c)))
true
julia> take!(c)
99
julia> Distributed.remotecall_eval(Main, 2, :(isready(c)))
false This also affects the blocking behaviour: if a channel is full on the worker side, it won't empty on its own, it needs the host to julia> Distributed.remotecall_eval(Main, 2, :(put!(c, 1)))
1
julia> Distributed.remotecall_eval(Main, 2, :(put!(c, 2)))
2
julia> Distributed.remotecall_eval(Main, 2, :(put!(c, 3)))
3
julia> Distributed.remotecall_eval(Main, 2, :(put!(c, 4)))
^C^C^C^C^CWARNING: Force throwing a SIGINT
ERROR: InterruptException: In the Distributed code, you can see that their implementation is similar to mine One problem with my implementation might be performance: I used |
This PR is almost ready! 🎉 The only thing left is:
@habemus-papadum or @Pangoraw Maybe you could take a look? My questions are:
|
@fonsp
|
actually, I see there is a slow down, but not as bad on GHA -- will look closer |
Thanks for investigating Nehal! If this turns out to be a blocker, then we could also try switching to a different IPC method already (#32 (comment)) rather than solving this TCP issue, since you said that that might be relatively easy to get going. |
I'll be around at 12noon EST if the dev call is happening -- changing from TCP sockets to Unix Domain Sockets should be very quick (something as easy as make |
Nice! Let's discuss during the dev call |
It looks like bc8d408 added the overhead on benchmark 3 :o Let's experiment with reverting that! |
@habemus-papadum note to myself: try fifo, try removing serialization. |
Co-Authored-By: Nehal Patel <[email protected]> Co-Authored-By: Paul Berg <[email protected]>
bc8d408 indeed caused the extra overhead in Expr 3. Right now we match Distributed in every benchmark! |
On Julia 1.9, we are 4x faster with Malt on Expr 3. It looks like this is because Julia 1.9 made this benchmark 4x slower, compared to Julia 1.8: Julia 1.8
Julia 1.9
Reported this in JuliaLang/julia#48938 |
Now we have #32 (comment) again 😵💫 |
I don't really understand what could be causing this overhead. |
Yeah that's weird... maybe it's a more Julian performance issue due to typing and optimization etc? bc8d408 also fixed #32 (comment) , right? Maybe we can see how the Julia devs solve JuliaLang/julia#48938 and learn something from that in case it's related? |
I wrote a simple workaround for JuliaLang/julia#48938 🙂: replaced |
Awesome! Only thing left is the small slowdown on Expr 3 on Windows nightly... hmmmmmmm we can just ignore it? |
Going to merge this! I will leave the last performance issue for a separate PR |
Heyy tests passed on |
Before this PR, we set up a new TCP connection for each round trip (like
remotecall
plus its response), and we suspected that this caused #24. This PR uses the same functionality as before, but it reuses the TCP socket. Often, this meant wrapping things inwhile true end
.Strangely enough, this did not improve our overhead... it doubled it 😵💫
Buffered IO
The real trick was suggested in the Pluto community call today by @habemus-papadum, we need to buffer the IO! See https://juliapluto.github.io/weekly-call-notes/2023/01-24/notes.html
I still want to test if this PR was even necessary, perhaps buffering the IO is already enough, and the overhead of TCP handshake is small enough.
New messaging protocol
I also changed the messaging protocol into something new, described in the new messages.md file. The changes are because:
catch
block can keep reading bytes until it has found the boundary. This allows us to continue using the socket.TODO
BufferedIO
. 🥲latest
task, to interrupt on Windows, is not working right now, because we process all messages synchronously. This means that the worker can only receive thefrom_host_interrupt
message after it processed all previous work, so there is nothing to interrupt.zeros(UInt8, 50_000_000)
does not pass our benchmark tests before Julia 1.8: it is 1.5x slower than Distributed stdlib. Also see Reuse the TCP socket and buffer IO to improve performance #32 (comment)