-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UDP reliability for the DHT #44
Comments
There are several things to consider before doing anything:
I may be wrong on this however and am open to feedback. |
Reliability has been considerably improve in f9224c2, the main issue being that kademlia is not thread safe and all access must be done from the twisted thread. I will confirm that this is being done everywhere and change it if not, otherwise I maintain my position given in the previous comment. |
I mostly agree with you but the tests are still breaking randomly and I'm not sure what else would cause that other than critical packets going astray. If you look at some of the Twisted test code for UDP there's parts where they send up to 60 packets because they assume the inherent unreliability of the connection even on loop back. As it stands: the current DHT isn't yet reliable enough for file transfers as a single dropped packet would ruin the whole handshake. You said that this state should be handled in the file transfer code but I don't agree with that since reliability is still important for key, value stores and any of the other multitude of protocols we build on top of the DHT which simply can't fail. I speculate the problem is that on local interfaces (and also WAN links) the rate of UDP failing is actually lower than what most people expect -- say 98% success rate. But obviously, when you're looping a bunch of calls to test interactions many times in the unit tests the chances of all the tests passing are much, much, much lower (since you need everything to work.)
You're right about that but I'm not sure its currently setup to rebroadcast on failure. It has a simple timeout mechanism from what I can see. Maybe this would be a perfect place to modify in order to test whether unreliable packets are the cause? |
@robertsdotpm It is not the job of rpcudp to ensure reliability, that is your job to do on a higher level. Relay messages and rpcudp work on a best effort only any code write should work regardless of dropped packets, if not its wrong. Adding a rebroadcast mechanism to rpcudp will help but should not be required. You should implement a rebroadcast after a time out with exponential backoff to handles this, just as I do with the monitor code https://github.com/Storj/storjnode/blob/master/storjnode/network/monitor.py#L99 The only way to ensure reliability over UDP is to basally implement LEDBAT (which we may do for the final product), but now is not the time to do such fundamental changes when our main goal is to get a MVP implementation of the stack. |
Alright, this is a simple solution for now |
So basically: "It's not a bug, it's a feature", right Fabian? |
Essentially yes, each layer makes promises of what it does and does not do. UDP and RPCUDP promise best effort only and not reliability. We must account for this in our code and to do otherwise would be wrong. Shifting the burden of reliability would also be wrong as the protocols are not made for it and you would basically be changing the entire protocol. Simply adding reliability is not possible it must be designed from the beginning to deliver this. LEDBAT promises this and is designed to do this over UDP. We may end up using only LEDBAT with encryption for exactly that reason, but for now we must keep our goals in sight instead of falling victim to feature creep. |
Adding more bootstrapping nodes seems to have decreased the empty node problem on startup. I wouldn't be surprised if its also more reliable now for real world tests |
Certain situations can result in the need for successfully relaying UDP messages being critical: i.e. if there aren't many nodes in the routing table failure to any could prevent communication from being established.
Current thoughts:
The text was updated successfully, but these errors were encountered: