Skip to content
This repository has been archived by the owner on Jun 27, 2022. It is now read-only.

Added connection timeouts. #9

Open
wants to merge 12 commits into
base: cloneable-socket
Choose a base branch
from

Conversation

ned14
Copy link

@ned14 ned14 commented Jun 22, 2015

Your earlier code doesn't retry a socket read receiving EINTR, plus I need connections to die if they are not live. I therefore added a loop of all reads with CONNECTION_TIMEOUT which is 20 secs, as it is for TCP on Linux.

Niall Douglas (s [underscore] sourceforge {at} nedprod [dot] com) added 2 commits June 22, 2015 18:53
@meqif
Copy link
Owner

meqif commented Jun 23, 2015

Well, this is awkward. I also have a branch to handle the timeout cases I mentioned in #8, with commits I held back from that merge for further tests and refinement. I did so because I will have to focus exclusively on my thesis for at least a couple of days.

My approach was slightly different from yours, based on the number of retries rather than the time, but either should be fine.

After a quick read, I have a couple of concerns about some of the proposed changes:

  • MAX_CONGESTION_TIMEOUT is supposed to be 60 seconds, per the LEDBAT RFC (emphasis mine):

    The CTO is a gating mechanism that ensures exponential backoff of
    sending rate under heavy congestion, and it may be implemented with
    or without a timer. An implementation choosing to avoid timers may
    consider using a "next-time-to-send" variable, set based on the CTO,
    to control the earliest time a sender may transmit without receiving
    any ACKs. A maximum value MAY be placed on the CTO, and if placed,
    it MUST be at least 60 seconds.

  • I think the way you split part of close into send_fin pushes too much responsibility to the latter: unlike its name suggests, it also flushes the send buffer.

  • That split may even result into infinite recursion (although I may have misread the diff): flush calls recv, which will always call send_fin at the end (which I see no reason to do), which in turn will also call flush and so on.

dd60097 looks okay, apart from some minor style quibbles.

As I said previously, it'll be a couple of days until I can review this pull request properly. Sorry.

@ned14
Copy link
Author

ned14 commented Jun 23, 2015

Firstly, you should prioritise your schooling ahead of all else. I have no problem if you throw away everything I do, indeed I would expect it given my lack of experience in Rust. And besides it's your library, you get the right to rewrite any pull requests anyone sends you or simply refuse a pull request, indeed I regularly do so with pull requests against my own libraries and the Boost libraries I am maintainer for. So long as some time in the next six months the ability to time out a stale connection is possible I don't think MaidSafe cares either.

Regarding the congestion timeout, there are two issues here really. The first is that a retry timeout is very different to a connection timeout. If I understand your code correctly, the congestion timeout is a retry timeout, and you were using a multiple of that for a connection timeout. I've simplinearly replaced that with an absolute timeout as the same loop can also handle EINTR, so it solves both problems. If you look up maximum timeouts for TCP you should find there is a substantial difference in philosophy about when a connection should be considered dead, with Linux being particularly aggressive.

I think in your library you should simply provide a settable option. You can default that to minutes if you like so long as it can be overridden. In crust we'll need it in seconds to reduce DDoS attack surface.

What I'm about to do next is to change your use of select() to use a "cancellation fd" which I suppose should be a pipe. When I write a byte to the pipe, the blocking select will immediately exit. I'll use that to implement ungraceful connection shutdown as a way of immediately cancelling all blocking reads and listens.

Once I have that in place, crust should supposedly now work on UTP, and I would suppose MaidSafe will reallocate me to new work. Anyway, I'll send that new code to you as additional pull requests, but honestly don't worry about them until your studies are done. And feel absolutely free to ignore and/or rewrite anything I send. I am well aware that my Rust skills are poor, indeed I still can't write lifetimes which actually compile yet, I haven't had the time to practice.

Niall Douglas (s [underscore] sourceforge {at} nedprod [dot] com) added 2 commits June 23, 2015 17:27
@meqif
Copy link
Owner

meqif commented Jun 23, 2015

Thank you for understanding. :)

You are correct in your analysis that the congestion timeout is a retry timeout. According to the LEDBAT specification, this timeout doubles every time it is exceeded during a receive operation, and it has a ceiling of 60 seconds (which was the point I tried to make in my previous comment).

I intentionally reused that same rule in the connection timeout (despite of the difference in situations), admitting at most 5 timeouts in connect (i.e., waiting at most a little over a minute before giving up [1]), partly because of laziness, partly because of the philosophic difference you mentioned.
However, the timeouts in connect do not affect the socket's congestion timeout — notice that it's the local timeout variable syn_timeout that doubles in timeouts, not self.congestion_timeout.

I think in your library you should simply provide a settable option. You can default that to minutes if you like so long as it can be overridden. In crust we'll need it in seconds to reduce DDoS attack surface.

I see your point.

I hope your changes go smoothly and you don't have to fight the compiler too much. The compiler can be unforgiving at times, especially when lifetimes are involved (and they tend to proliferate like bacteria once you need them!).

[1]: A total of 6 attempts: 1 + 2 + 4 + 8 + 16 + 32 = 63 seconds.

Niall Douglas (s [underscore] sourceforge {at} nedprod [dot] com) added 2 commits June 24, 2015 18:38
…don't necessarily exit,

rather they check to see if the connection got closed in another thread, and if so then they
might exit.
…esult, and you need to check again :(. Anyway fixes a spurious hang on Linux.
@ned14
Copy link
Author

ned14 commented Jun 26, 2015

@meqif Ricardo it looks like my time with MaidSafe will be coming to an end in the next few weeks, so can I check what your schedule will be this coming month? When do you submit your thesis?

@meqif
Copy link
Owner

meqif commented Jun 27, 2015

@ned14 My schedule is not entirely rigid, except for the meetings with my thesis adviser and deadlines related to paper submissions. The deadline for submitting the thesis is late September, although the plan is to finish it by late July.

I will try to carve out as much time as I can (within reason) to make and integrate the needed changes in the library, so you can get crust working with "vanilla" rust-utp before your time is up.

@ned14
Copy link
Author

ned14 commented Jun 28, 2015

@meqif It looks currently like my time with MaidSafe will end somewhere between mid July and end of July. I'd like to deliver a working UTP support by then, so what I was going to do was to keep monkey patching my fork of rust-utp with temporary code to get crust and the rest of the MaidSafe codebase working on UTP. That should round out my time with MaidSafe nicely.

What I'd suggest to you is that you draw up a contract proposal for a refactoring of rust-utp which is threadsafe and reentrant, and send it along with an estimate of hours and your desired hourly rate to MaidSafe. That way instead of you doing needed changes for free, you get paid.

Regarding a suitable hourly rate to propose, it's negotiable based on the strength of your CV and how much the client wants your time. I'll assume you won't be familiar with this (apologies if you are and this is patronising), so taking off any MaidSafe hat I can tell you a fresh graduate working remotely from home on open source programming with no commercial programming experience might be worth maybe €20/hour in the EU. With three years commercial programming experience that might reach €30-35/hour. With ten years and a strong open source portfolio in your CV it's about €50/hour. Remember working remotely comes with about a 30-35% rate discount from being on-site, but equally there are no commuting costs. The big bucks arise only if you aren't writing open source and/or are onsite, if you're working for a US hedge fund on proprietary trading algorithms you would start negotiating from €150/hour upwards but then your liability insurance and work enjoyment are substantially affected.

Also, as a rule American clients pay more than European clients, but you'll find a markedly increased pressure to deliver code as quickly as possible and they really dislike you taking time off. In Europe we tend to prefer a work life balance and you are not punished culturally for saying so, in the US if you don't claim to live only to work then it's viewed as you being a slacker and you'll be treated accordingly.

Anyway, that's my recommendation to you. I certainly would advise looking at this situation as an opportunity to earn some money and get valuable commercial programming experience onto your CV rather than working ad hoc for free. Besides, I'd assume MaidSafe will need someone to replace me :) and it might as well be you.

@ned14
Copy link
Author

ned14 commented Jun 28, 2015

@meqif To keep you in the loop I've spent today investigating and prototyping how to solve a problem in crust with UTP, which is that we're running out of file descriptors during the unit testing. The problem is caused by my use of a pipe to workaround the inability to cancel blocking reads, so I am causing 3x the file descriptor consumption over normal. The crust unit tests configure two hundred connections, so that's 600 fds. I'll admit I have no idea how that then exceeds the 1024 maximum to cause the hard fault, but the real point is that three fd's per UTP connection is unacceptable.

I've decided after a bit of research that on POSIX at least, I'll combine pselect() and pthread_kill() to let one thread unblock reads occurring in other threads. That should dispense with the need for a pipe. On Windows none of this is a problem as everything there is natively async, so cancelling reads is a cinch.

I'll get to work on that on Monday. An alternative for you in a later ground up refactor of rust-utp is to add magic unblocking datagram support, so to unblock reads happening in other threads you ping them with the magic datagram, taking care that that isn't a DDoS attack. A null sized datagram does also work on some platforms if I remember, but not all.

Niall Douglas (s [underscore] sourceforge {at} nedprod [dot] com) added 2 commits June 29, 2015 20:16
…tion. Works very nicely, though there is obviously a spurious loop in rust-utp because it's occasionally timing out. I'll nail that tomorrow.
@meqif
Copy link
Owner

meqif commented Jun 30, 2015

@ned14 Sorry for the late response. I'm very grateful for your advice. :)

I'm sure you tried this, but I want to check for peace of mind: did you try calling shutdown(socket.as_raw_fd(), SHUT_RD)? It should work across all threads and dispenses with sending signals, and it should also avoid the spurious loop you mention in the e74d496 commit message.

@ned14
Copy link
Author

ned14 commented Jul 1, 2015

@meqif I've rebooted into Windows to write the Windows support, but I am pretty sure that shutdown on UDP sockets is not portable. If you look at the POSIX spec for shutdown http://pubs.opengroup.org/onlinepubs/009695399/functions/shutdown.html, you will note the language assumes a connection based transport. https://bugs.python.org/issue19530 suggests calling shutdown on UDP sockets is a nest of vipers.

Regarding the loop problem, it is due to two reasons (i) I'm force shutting down the reader thread too soon by breaking reads when the writer thread exits, thus occasionally losing packets which causes the unit test to hang till timeout and (ii) there is a race, due to lack of synchronisation, in the graceful close() sequencing part caused by me forcing closed the connection in (i) from the other side, so depending on timing it can hang. The force close is necessary because if we have exited the writer thread, there is no code available to pump the recv() function because the reader thread is still blocked on reads. The existing code tries to work around that by "breaking reads" really means that the reader thread sets a socket timeout of 1ns i.e. loop polls for new packets, constantly checking if the connection state is closed, and therefore pumping recv for us.

But to be honest, I'm just grafting on bad hack atop bad hack here on a library design not designed for concurrent read/write (no offense intended). What rust-utp really needs is a ground up redesign around concurrent operations, and that is what I'd suggest you send MaidSafe a quote for. If you choose mio as your base you'd make your life much easier. I'd even wonder if porting mio to Windows IOCP is the least cost path forwards, why mio isn't already on WinIOCP is beyond me :(

To be honest, if I were implementing mio I'd take Boost.ASIO and wrap it with the mio Rust API. mio is thus solved for all platforms quickly and easily and you're not constantly hammering down weird async bugs, as ASIO has already invested a decade in doing that for you, indeed ASIO is entering the official C++ standard next release. I suspect wrapping a C++ library isn't considered Rust idiomatic though :(

@meqif
Copy link
Owner

meqif commented Jul 1, 2015

@ned14 Yes, I don't expect shutdown to work on Windows either, I only considered using it in the *nixes since apparently Windows has a safer way of achieving the same result.
I noticed that on the manual page for shutdown, but the "full-duplex" bit stood out to me. Even though UDP isn't connection-oriented, calling shutdown on a UDP socket should unblock the thread and force the receive call to return with a failure. That seemed reasonable enough for the time being.
Further investigation yielded this discussion. Ultimately, shutdown looks like a dead-end. :(

Carl (mio's author) mentioned here he's working on Windows support on another library, but I can't seem to find it in his public repositories. There's also rust-iocp, which looks unmaintained. At a glance, Peter's winapi-rs looks promising for the Windows side of things.

I'm not sure how idiomatic that approach (wrapping Boost) would be, but I agree it's a viable path nonetheless.

@ned14
Copy link
Author

ned14 commented Jul 1, 2015

@meqif Note that Boost.ASIO doesn't require any of Boost, it only needs a C++ 11 compiler. One of those comes with LLVM :), so Rust is sorted :)

ASIO's enormous benefit is it is debugged. All those weird Windows quirks, and quirks across XP - Win7 - WinRT are all handled and worked around. Any new Rust library would likely take at least three to five years to achieve a similar reliability, as indeed ASIO originally did. Indeed Microsoft themselves use ASIO internally as it's easier than dealing with their own bugs, plus it targets all their platform stacks equally.

I also think such a wrapper would be highly Rust idiomatic in practice as Rust does most of C++ also does, so patterns such as:

buffer_sequence bufferseq=...;
socket.async_write(bufferseq, [](error_code ec, size_t bytes){ completion handler; });

... in Rust would be the highly idiomatic:

let bufferseq = some slice;
socket.async_write(buffereq, |ec, bytes| { completion handler });

My single biggest concern would be how to maintain such wrappers. There is no tooling for automating Rust C++ wrappers at all, so you'd probably have to wrap ASIO in an automatically generated C API using SWIG/CppComponents and then build a Rust mirror using that C API.

Anyway, these are just fanciful musings on my part. One daydreams when you spend hours getting pthread signals to unblock reads for you!

Niall Douglas (s [underscore] sourceforge {at} nedprod [dot] com) and others added 3 commits July 1, 2015 18:56
…the non-blocking sockets emulation in Winsock2.

Note that the Windows emulation of POSIX non-blocking sockets has the unfortunate habit of returning WSAWOULDBLOCK everywhere which is not correctly trapped by Rust's libc as WouldBlock, thus causing some of the rust-utp unit tests to fail. If you wrap all socket operations with a map_err() lambda trapping and handling WSAWOULDBLOCK, it should work perfectly. crust, however passes beautifully on Windows now, so I'm pressing onwards without fixing up the rest of rust-utp (sorry!).
vinipsmaker pushed a commit to vinipsmaker/rust-utp that referenced this pull request Dec 29, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants