UDP reliability for the DHT #44

robertsdotpm · 2015-12-06T11:25:38Z

Certain situations can result in the need for successfully relaying UDP messages being critical: i.e. if there aren't many nodes in the routing table failure to any could prevent communication from being established.

Current thoughts:

Some kind of ACK protocol implemented in udprcp
Could also make ACK conditional based on routing table size / number of outgoing packets sent for one query
Rebroadcasts (quite noisy, ACK might be better.)

F483 · 2015-12-06T16:28:59Z

There are several things to consider before doing anything:

Why are packets actually being dropped in the tests? I have the feeling this has less to do with the network and more to do with twisted or its possible incorrect usage. I think this because more packets drop when you decrease the time to wait for a stable network in the tests. This indicates it my be a local threading issue, for example some buffer fill up while the thread is busy and causes drops.
rpcudp already has a ACK mechanism, but this breaks down for relay messages as you can only ACK one hop. I am also not sure if I like rpcudp's ack mechinism as it adds complexity and requires 20 bytes packet overhead for a message id (which is a problem).
An ACK system requires extra state which may be an attack vector and also increases the protocol complexity.
Do we really want to change anything before we add network level authentication and end to end encryption, as this may cause us to have to redo this work and delay things unnecessary. Our current top priority is to get a working (not perfect) network interface, that is stable so we can move on to higher layers and improve it later. Others are waiting for an interface to code against and when we have several layers team management becomes easier.
In general I am for keeping the entire protocol stateless and fault tolerant regarding dropped packets, for the sake of simplicity and reliability. Adding an ACK mechanism will only increase complexity. All of the DHT functions can deal with a dropped packet (nodes find each other eventually and data is re keyed on the hour). Any key feature that needs reliability (data transfer) should maintain the state themselves instead of having the lower level be changed and forced upon everything else.

I may be wrong on this however and am open to feedback.

CC @gamedevsam @gordonwritescode @robertsdotpm

F483 · 2015-12-21T15:09:11Z

Reliability has been considerably improve in f9224c2, the main issue being that kademlia is not thread safe and all access must be done from the twisted thread.

I will confirm that this is being done everywhere and change it if not, otherwise I maintain my position given in the previous comment.

robertsdotpm · 2015-12-26T07:19:06Z

I mostly agree with you but the tests are still breaking randomly and I'm not sure what else would cause that other than critical packets going astray. If you look at some of the Twisted test code for UDP there's parts where they send up to 60 packets because they assume the inherent unreliability of the connection even on loop back. As it stands: the current DHT isn't yet reliable enough for file transfers as a single dropped packet would ruin the whole handshake.

You said that this state should be handled in the file transfer code but I don't agree with that since reliability is still important for key, value stores and any of the other multitude of protocols we build on top of the DHT which simply can't fail. I speculate the problem is that on local interfaces (and also WAN links) the rate of UDP failing is actually lower than what most people expect -- say 98% success rate. But obviously, when you're looping a bunch of calls to test interactions many times in the unit tests the chances of all the tests passing are much, much, much lower (since you need everything to work.)

rpcudp already has a ACK mechanism

You're right about that but I'm not sure its currently setup to rebroadcast on failure. It has a simple timeout mechanism from what I can see. Maybe this would be a perfect place to modify in order to test whether unreliable packets are the cause?

F483 · 2015-12-26T13:06:16Z

@robertsdotpm It is not the job of rpcudp to ensure reliability, that is your job to do on a higher level.

Relay messages and rpcudp work on a best effort only any code write should work regardless of dropped packets, if not its wrong. Adding a rebroadcast mechanism to rpcudp will help but should not be required.

You should implement a rebroadcast after a time out with exponential backoff to handles this, just as I do with the monitor code https://github.com/Storj/storjnode/blob/master/storjnode/network/monitor.py#L99

The only way to ensure reliability over UDP is to basally implement LEDBAT (which we may do for the final product), but now is not the time to do such fundamental changes when our main goal is to get a MVP implementation of the stack.

robertsdotpm · 2015-12-27T00:30:21Z

Alright, this is a simple solution for now

robertsdotpm · 2015-12-27T01:12:00Z

So basically: "It's not a bug, it's a feature", right Fabian?

F483 · 2015-12-27T14:34:05Z

Essentially yes, each layer makes promises of what it does and does not do. UDP and RPCUDP promise best effort only and not reliability. We must account for this in our code and to do otherwise would be wrong.

Shifting the burden of reliability would also be wrong as the protocols are not made for it and you would basically be changing the entire protocol.

Simply adding reliability is not possible it must be designed from the beginning to deliver this. LEDBAT promises this and is designed to do this over UDP. We may end up using only LEDBAT with encryption for exactly that reason, but for now we must keep our goals in sight instead of falling victim to feature creep.

robertsdotpm · 2015-12-30T09:10:33Z

Adding more bootstrapping nodes seems to have decreased the empty node problem on startup. I wouldn't be surprised if its also more reliable now for real world tests

F483 added help wanted question labels Dec 6, 2015

super3 removed the question label Dec 16, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UDP reliability for the DHT #44

UDP reliability for the DHT #44

robertsdotpm commented Dec 6, 2015

F483 commented Dec 6, 2015

F483 commented Dec 21, 2015

robertsdotpm commented Dec 26, 2015

F483 commented Dec 26, 2015

robertsdotpm commented Dec 27, 2015

robertsdotpm commented Dec 27, 2015

F483 commented Dec 27, 2015

robertsdotpm commented Dec 30, 2015

UDP reliability for the DHT #44

UDP reliability for the DHT #44

Comments

robertsdotpm commented Dec 6, 2015

F483 commented Dec 6, 2015

F483 commented Dec 21, 2015

robertsdotpm commented Dec 26, 2015

F483 commented Dec 26, 2015

robertsdotpm commented Dec 27, 2015

robertsdotpm commented Dec 27, 2015

F483 commented Dec 27, 2015

robertsdotpm commented Dec 30, 2015