Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UDP reliability for the DHT #44

Open
robertsdotpm opened this issue Dec 6, 2015 · 8 comments
Open

UDP reliability for the DHT #44

robertsdotpm opened this issue Dec 6, 2015 · 8 comments

Comments

@robertsdotpm
Copy link
Contributor

Certain situations can result in the need for successfully relaying UDP messages being critical: i.e. if there aren't many nodes in the routing table failure to any could prevent communication from being established.

Current thoughts:

  • Some kind of ACK protocol implemented in udprcp
  • Could also make ACK conditional based on routing table size / number of outgoing packets sent for one query
  • Rebroadcasts (quite noisy, ACK might be better.)
@F483
Copy link
Contributor

F483 commented Dec 6, 2015

There are several things to consider before doing anything:

  • Why are packets actually being dropped in the tests? I have the feeling this has less to do with the network and more to do with twisted or its possible incorrect usage. I think this because more packets drop when you decrease the time to wait for a stable network in the tests. This indicates it my be a local threading issue, for example some buffer fill up while the thread is busy and causes drops.
  • rpcudp already has a ACK mechanism, but this breaks down for relay messages as you can only ACK one hop. I am also not sure if I like rpcudp's ack mechinism as it adds complexity and requires 20 bytes packet overhead for a message id (which is a problem).
  • An ACK system requires extra state which may be an attack vector and also increases the protocol complexity.
  • Do we really want to change anything before we add network level authentication and end to end encryption, as this may cause us to have to redo this work and delay things unnecessary. Our current top priority is to get a working (not perfect) network interface, that is stable so we can move on to higher layers and improve it later. Others are waiting for an interface to code against and when we have several layers team management becomes easier.
  • In general I am for keeping the entire protocol stateless and fault tolerant regarding dropped packets, for the sake of simplicity and reliability. Adding an ACK mechanism will only increase complexity. All of the DHT functions can deal with a dropped packet (nodes find each other eventually and data is re keyed on the hour). Any key feature that needs reliability (data transfer) should maintain the state themselves instead of having the lower level be changed and forced upon everything else.

I may be wrong on this however and am open to feedback.

CC @gamedevsam @gordonwritescode @robertsdotpm

@F483
Copy link
Contributor

F483 commented Dec 21, 2015

Reliability has been considerably improve in f9224c2, the main issue being that kademlia is not thread safe and all access must be done from the twisted thread.

I will confirm that this is being done everywhere and change it if not, otherwise I maintain my position given in the previous comment.

@robertsdotpm
Copy link
Contributor Author

I mostly agree with you but the tests are still breaking randomly and I'm not sure what else would cause that other than critical packets going astray. If you look at some of the Twisted test code for UDP there's parts where they send up to 60 packets because they assume the inherent unreliability of the connection even on loop back. As it stands: the current DHT isn't yet reliable enough for file transfers as a single dropped packet would ruin the whole handshake.

You said that this state should be handled in the file transfer code but I don't agree with that since reliability is still important for key, value stores and any of the other multitude of protocols we build on top of the DHT which simply can't fail. I speculate the problem is that on local interfaces (and also WAN links) the rate of UDP failing is actually lower than what most people expect -- say 98% success rate. But obviously, when you're looping a bunch of calls to test interactions many times in the unit tests the chances of all the tests passing are much, much, much lower (since you need everything to work.)

rpcudp already has a ACK mechanism

You're right about that but I'm not sure its currently setup to rebroadcast on failure. It has a simple timeout mechanism from what I can see. Maybe this would be a perfect place to modify in order to test whether unreliable packets are the cause?

@F483
Copy link
Contributor

F483 commented Dec 26, 2015

@robertsdotpm It is not the job of rpcudp to ensure reliability, that is your job to do on a higher level.

Relay messages and rpcudp work on a best effort only any code write should work regardless of dropped packets, if not its wrong. Adding a rebroadcast mechanism to rpcudp will help but should not be required.

You should implement a rebroadcast after a time out with exponential backoff to handles this, just as I do with the monitor code https://github.com/Storj/storjnode/blob/master/storjnode/network/monitor.py#L99

The only way to ensure reliability over UDP is to basally implement LEDBAT (which we may do for the final product), but now is not the time to do such fundamental changes when our main goal is to get a MVP implementation of the stack.

@robertsdotpm
Copy link
Contributor Author

Alright, this is a simple solution for now

@robertsdotpm
Copy link
Contributor Author

So basically: "It's not a bug, it's a feature", right Fabian?

@F483
Copy link
Contributor

F483 commented Dec 27, 2015

Essentially yes, each layer makes promises of what it does and does not do. UDP and RPCUDP promise best effort only and not reliability. We must account for this in our code and to do otherwise would be wrong.

Shifting the burden of reliability would also be wrong as the protocols are not made for it and you would basically be changing the entire protocol.

Simply adding reliability is not possible it must be designed from the beginning to deliver this. LEDBAT promises this and is designed to do this over UDP. We may end up using only LEDBAT with encryption for exactly that reason, but for now we must keep our goals in sight instead of falling victim to feature creep.

@robertsdotpm
Copy link
Contributor Author

Adding more bootstrapping nodes seems to have decreased the empty node problem on startup. I wouldn't be surprised if its also more reliable now for real world tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants