Skip to content
This repository has been archived by the owner on Jan 6, 2020. It is now read-only.

Logic flaws (?) on PunchedUdpSocket error handling #33

Closed
2 tasks done
vinipsmaker opened this issue Feb 16, 2016 · 11 comments
Closed
2 tasks done

Logic flaws (?) on PunchedUdpSocket error handling #33

vinipsmaker opened this issue Feb 16, 2016 · 11 comments

Comments

@vinipsmaker
Copy link
Contributor

@vinipsmaker vinipsmaker changed the title Not all errors should be fatal Logic flaws (?) on PunchedUdpSocket algorithm Feb 16, 2016
@vinipsmaker vinipsmaker changed the title Logic flaws (?) on PunchedUdpSocket algorithm Logic flaws (?) on PunchedUdpSocket error handling Feb 16, 2016
@canndrew
Copy link
Contributor

It should be an error if we fail to send to ​all​ addresses, not just a single address: https://github.com/maidsafe/nat_traversal/blob/8207018ff4332157975765b7fe1eaee893d020f4/src/punched_udp_socket.rs#L208
This error shouldn't abort the whole attempt: https://github.com/maidsafe/nat_traversal/blob/8207018ff4332157975765b7fe1eaee893d020f4/src/punched_udp_socket.rs#L247

Maybe. If we're having trouble writing to the socket then something has gone wrong. Is there any specific IO error that would be better to handle by continuing? Possibly ErrorKind::Interrupted but even then it might make more sense to exit the whole operation.

I really wish the standard library used more precise error types instead of this crappy io::Error everywhere so that I knew exactly what errors could occur here and could handle them accordingly.

We shouldn't abort operation ...

In that case we've received an ACK so we know that the peer has received our packets. We could send back our own ACK before exiting but I don't think it would help.

... before we send at least one packet to the other peer (current code looks very racy to me).

I don't know what you mean by racy here but re-reading that code I think it should send the ACK more than twice and wait more than 100ms.

deadline += time::Duration::milliseconds(DELAY_BETWEEN_RESENDS_MS) is not thread::sleep (i.e. no delays between attempts are happening at all; removing PeriodicSender was a bad idea)

The sleep is the call to recv_until. That loop processes incoming packets until the next deadline and blocks when waiting for new packets to arrive.

@vinipsmaker
Copy link
Contributor Author

Maybe. If we're having trouble writing to the socket then something has gone wrong. Is there any specific IO error that would be better to handle by continuing? Possibly ErrorKind::Interrupted but even then it might make more sense to exit the whole operation.

Well, one of the addresses that we get may be unreachable and we can get a failure just to one of the addresses. We're trying to punch a hole and this includes searching which address is usable. If we find one address is unusable it doesn't make sense to abort the hole operation.

I really wish the standard library used more precise error types instead of this crappy io::Error everywhere so that I knew exactly what errors could occur here and could handle them accordingly.

I disagree. These Rust APIs are low-level and behavior depends a lot on operating system.

We shouldn't abort operation ...

In that case we've received an ACK so we know that the peer has received our packets.

Sorry. I checked again and there is no problem with current algorithm.

@vinipsmaker
Copy link
Contributor Author

The sleep is the call to recv_until. That loop processes incoming packets until the next deadline and blocks when waiting for new packets to arrive.

I missed that. Thanks.

But shouldn't failing to decode the packet also imply in increase the timeout (i.e. break) like failing to receive a packet does? I mean, a misbehaving node could send invalid packets and keep you in infinite loop because the timeout never is recalculated.

Also, timeout calculate looks very inaccurate for me. If packet is received before timeout is reached, we should not increase elapsed time by time::Duration::milliseconds(DELAY_BETWEEN_RESENDS_MS). We should just update deadline to SteadyTime::now().

@canndrew
Copy link
Contributor

Well, one of the addresses that we get may be unreachable and we can get a failure just to one of the addresses. We're trying to punch a hole and this includes searching which address is usable. If we find one address is unusable it doesn't make sense to abort the hole operation.

What do we check for in this case? I still think a ErrorKind::WouldBlock or ErrorKind::Interrupted should abort the function. I don't know what other variants can occur.

I disagree. These Rust APIs are low-level and behavior depends a lot on operating system.

They could have enough enum variants to cover all the possible failures on all operating systems, and make those variants more descriptive and still avoid having bogus variants like AddrNotAvailable for a UdpSocket::send_to

But shouldn't failing to decode the packet also imply in increase the timeout (i.e. break) like failing to receive a packet does? I mean, a misbehaving node could send invalid packets and keep you in infinite loop because the timeout never is recalculated.

recv_until exits with Ok(None) if the absolute time reaches the deadline. The loop will keep reentering the recv_until call and eventually the deadline will be reached.

Also, timeout calculate looks very inaccurate for me. If packet is received before timeout is reached, we should not increase elapsed time by time::Duration::milliseconds(DELAY_BETWEEN_RESENDS_MS). We should just update deadline to SteadyTime::now().

If we receive a valid packet we never use the deadline again before exiting the function. If we receive an invalid packet we still don't exit the loop until the deadline is reached. Setting the deadline to SteadyTime::now() will cause recv_until to always exit immediately.

@vinipsmaker
Copy link
Contributor Author

Well, one of the addresses that we get may be unreachable and we can get a failure just to one of the addresses. We're trying to punch a hole and this includes searching which address is usable. If we find one address is unusable it doesn't make sense to abort the hole operation.

What do we check for in this case? I still think a ErrorKind::WouldBlock or ErrorKind::Interrupted should abort the function. I don't know what other variants can occur.

I didn't want to illustrate/give-examples because behavior is system-dependent. The only sane cross-platform behavior is to ignore the error and try again (withing a timeout ofc). However, the reason why I didn't want to illustrate the problem is because you may conclude that we should ignore some specific errors and release a "fix" because the fix work on our machines, but the error isn't fixed and we may see random failures on the wild (without a chance to debug). You already started the sentence with "What do we check for..." suggesting that you might do exactly that.

Anyway, I stumbled upon this kind of issue when I was working on rust-utp stabilization and one of the errors that can happen only on one of the addresses is 10054 (WSAECONNRESET). This is an example of a specific issue (and very rare, I think), but the real problem is the behavior we're coding.

We promised to test several addresses to find one that works. But what we're doing is finding one that does NOT working and aborting the search operation. This is completely the opposite of what we promised.

They could have enough enum variants to cover all the possible failures on all operating systems [...]

I find this very unlikely, but it doesn't matter what I think here because this discussion is OFF-TOPIC for this thread/issue.

recv_until exits with Ok(None) if the absolute time reaches the deadline. The loop will keep reentering the recv_until call and eventually the deadline will be reached.

Some misbehaving node could keep sending invalid packets that will cause the packet cbor decoding to fail and we'll never leave this loop. This type of timeout is very inaccurate and a shame to have such fragile implementation (even if very unlikely to happen).

If we receive a valid packet we never use the deadline again before exiting the function. If we receive an invalid packet we still don't exit the loop until the deadline is reached. Setting the deadline to SteadyTime::now() will cause recv_until to always exit immediately.

You're right. I'm still "eating" this code so I missing some parts of the behavior still.

@canndrew
Copy link
Contributor

I find this very unlikely, but it doesn't matter what I think here because this discussion is OFF-TOPIC for this thread/issue.

Agreed it's off-topic but anyway.. I don't share your pessimism. Version 0.0.1 of this design might panic when it sees WSAECONNRESET from a udp socket recv_from. But someone would report the bug and an error variant would get added for it. Eventually all the weird OS-specific behaviours like this could be accounted for.

Back on topic.. Okay I'll make the error handling a bit more detailed and not abort on unknown errors.

Some misbehaving node could keep sending invalid packets that will cause the packet cbor decoding to fail and we'll never leave this loop. This type of timeout is very inaccurate and a shame to have such fragile implementation (even if very unlikely to happen).

This couldn't happen though. recv_until exits immediately if the deadline has been reached. It doesn't even perform a read. Once the deadline is reached the loop will exit no matter what.

@vinipsmaker
Copy link
Contributor Author

This couldn't happen though. recv_until exits immediately if the deadline has been reached. It doesn't even perform a read. Once the deadline is reached the loop will exit no matter what.

I checked again. This time I realized my mistake. I was interpreting the code was using the "wait_for" primitive, but the actual primitive being used was "wait_until". I can see now that the timeout logic is pretty much correct.

@vinipsmaker
Copy link
Contributor Author

#34 fixes "It should be an error if we fail to send to ​all​ addresses, not just a single address: https://github.com/maidsafe/nat_traversal/blob/8207018ff4332157975765b7fe1eaee893d020f4/src/punched_udp_socket.rs#L208"

@canndrew
Copy link
Contributor

This error shouldn't abort the whole attempt: https://github.com/maidsafe/nat_traversal/blob/8207018ff4332157975765b7fe1eaee893d020f4/src/punched_udp_socket.rs#L247

What should we do in this case? There's no point returning back to the loop because we know that we're trying to message the correct address.

@vinipsmaker
Copy link
Contributor Author

This error shouldn't abort the whole attempt: https://github.com/maidsafe/nat_traversal/blob/8207018ff4332157975765b7fe1eaee893d020f4/src/punched_udp_socket.rs#L247

What should we do in this case? There's no point returning back to the loop because we know that we're trying to message the correct address.

We should try a few more times (e.g. until our timeout is reached or maybe after 10 attempts are made).

@canndrew
Copy link
Contributor

mk, here's a PR: #38
It keeps trying until it times out or successfully sends twice. It won't timeout until it's tried to send at least twice.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants