Skip to content
This repository has been archived by the owner on Nov 15, 2019. It is now read-only.

Adds uTP network test #454

Merged
merged 4 commits into from
Jan 26, 2016

Conversation

vinipsmaker
Copy link
Contributor

Review on Reviewable

@maidsafe-highfive
Copy link

r? @inetic

(maidsafe_highfive has picked a reviewer for you, use r? to override)

@vinipsmaker
Copy link
Contributor Author

Continued work from #451

@vinipsmaker vinipsmaker changed the title Adds uTP network test (currently failing) Adds uTP network test Dec 28, 2015
@vinipsmaker
Copy link
Contributor Author

This test seems to be hanging on Windows, but actually the Windows support on master is already broken and shows the exact same issue. In summary, this PR might not be affecting Windows support at all.

I'll leave it to you whether to accept/reject this PR as is or to fix the the Windows hanging issue prior to the addition of the test-network-utp.

@vinipsmaker
Copy link
Contributor Author

I'll leave it to you whether to accept/reject this PR as is or to fix the the Windows hanging issue prior to the addition of the test-network-utp.

I changed my mind. Test is failing on Windows. Better to fix one problem at a time instead accumulating'em all.

@vinipsmaker
Copy link
Contributor Author

There is some issue under Windows:

Received unknown event OnConnect(Err(Error { repr: Custom(Custom { kind: Other, error: StringError("handshake failed") }) }), 0)

@vinipsmaker
Copy link
Contributor Author

The tracked down Windows error:

Error setting read timeout: Error { repr: Os { code: 10093, message: "Either the application has not called WSAStartup, or WSAStartup failed." } }

Probable origin: https://github.com/maidsafe/rust-utp/blob/ffa531c0c58e078baa60c36661fd185825739813/src/socket.rs#L576

@vinipsmaker
Copy link
Contributor Author

Probable origin: https://github.com/maidsafe/rust-utp/blob/ffa531c0c58e078baa60c36661fd185825739813/src/socket.rs#L576

I was wrong about that.

Error setting read timeout: Error { repr: Os { code: 10093, message: "Either the application has not called WSAStartup, or WSAStartup failed." } }

I'm confused. It's like I'm getting a different error every time.

Windows support seems to be much less robust than Linux/OSX.

@vinipsmaker
Copy link
Contributor Author

I investigated the issue more and tracked down to this function, which is weird. It's weird because the error is "An existing connection was forcibly closed by the remote host.".

UDP is connection-less, so why this error? I found an explanation on StackOverflow. So I think the correct behaviour is to ignore this errors, as UDP is connection-less and not being able to receive now doesn't mean not able to receive a few seconds later. I'll make a test and find out.

@vinipsmaker
Copy link
Contributor Author

I'll make a test and find out.

Just did a small test and it was not enough, but I didn't get much info. Now I'll investigate Err(e) values from the send_to function.

@vinipsmaker
Copy link
Contributor Author

A part of the crust workflow in the introduced test is like (note that some parts only exist in crust and the test will indirectly and implicitly invoke such actions):

  1. Establish a uTP connection (involves exchanging UDP messages). This is happening successfully.
  2. Moves the uTP connection to another thread that will continuously wait for new UDP/uTP messages. This is failing and eventually the connection times out.
  3. Start handshake exchange to confirm the new connection establishment. Then expose the connection to the user (in this case the test).
  4. The connection times out, so no connection is confirmed and the test fails without even exchanging all messages.

I can confirm the uTP connection is established as the multiplexer thread is created. I only got this far by making rust-utp ignore the error codes 10054 (WSAECONNRESET) and 10040 (WSAEMSGSIZE) on Windows.

WSAECONNRESET looks safe to ignore, but WSAEMSGSIZE is really weird.

For now, I'll research if moving UDP socket descriptors among threads can cause any problem on Windows.

And it's troublesome to debug some error without having direct access to the operating system. Maybe next year I'll get a new laptop and install Windows on my current/old laptop.

@vinipsmaker
Copy link
Contributor Author

For now, I'll research if moving UDP socket descriptors among threads can cause any problem on Windows.

Before continuing to investigate this, I did one more test just to confirm that the errors only start after the objects are moved among threads. Now my suspicion has grow stronger.

@vinipsmaker
Copy link
Contributor Author

  • I haven't found much (actually I found nothing) on StackOverflow (or Google...) about threading and UDP problems under Windows.
  • I stopped getting 10040 error after increasing buffer size.
  • There are some "Failed to decode CBOR" errors happening before connection times out. This is one thing I could investigate.
  • I could also refactor crust support to rust-utp to restrict the "uTP thread" to be the same from connection establishment until application exit. And then check if the error persists.

@vinipsmaker
Copy link
Contributor Author

After a long time debugging, I found something interesting.

receive_handshake from transport.rs fails at CBOR decoding because "Missing field \'remote_ip\' in map object.". Now I need to know if it's a bug thanks to corrupted stream from rust-utp or if it is the act of sending of a wrong structure within crust side (which would be very obscure, as TCP is working and the error only happens on Windows).

Also, even if we're facing this bug now, it doesn't mean it is THE bug, as other changes are required to make crust working properly under Windows for UDP support (ignoring 10054 error...).

I think it's unlikely I'll finish the fix before Jan 4th, but maybe next week I can bury these bugs for good.

@vinipsmaker
Copy link
Contributor Author

I found something interesting.

First a summary of the important part of the test to understand my finding:

  1. The test create several nodes.
  2. Each node will:
    1. Repeat the following for every other node:
      1. Connect to other node.
        1. Exchange handshake.
      2. Send message (it's coded as a UserBlob variant somewhere deeper at crust level).
      3. Wait for message to arrive.
      4. Once all messages are received, exit.

Now consider only a pair of nodes. We can have the following:

  1. First node is "faster" (race?). It'll send handshake, receive handshake and then send UserBlob.
  2. The second node will try to receive a handshake (which should be the first CBOR structure that is passed in the channel), but it receives a UserBlob.
    1. The remote_ip field missing.
    2. I actually changed rust-cbor to print the value of the map variable when this error happens. In the CI output, this is {"fields": Array([Array([Unsigned(UInt8(105)), Unsigned(UInt8(77)), Unsigned(UInt8(69)), Unsigned(UInt8(83)), Unsigned(UInt8(83)), Unsigned(UInt8(65)), Unsigned(UInt8(71)), Unsigned(UInt8(69)), Unsigned(UInt8(32)), Unsigned(UInt8(48))])]), "variant": Unicode("UserBlob")}.

Some observations:

  • The error only happens on Windows.
  • Several connections succeed and handshake is established successfully.
  • The rust-utp stream is NOT corrupted. That's why CBOR is successfully decoded (but an unexpected structure is found). So it's very very very unlikely this error is happening on rust-utp side. At least initially, we should focus on debugging on the crust side.

In short, there might be a race within crust. UserBlob is being written/received before handshake.

@vinipsmaker
Copy link
Contributor Author

@inetic mentioned that he also was getting out-of-order packets (at least I understood this) and (implicitly) suggested me to try his fork: https://github.com/inetic/rust-utp/tree/master

But using his fork I got this error (WSAEMSGSIZE) when calling recv_from within utp_wrapper.rs: https://ci.appveyor.com/project/MaidSafe-QA/crust/build/1.0.532/job/pd4ehiqnngm643mh#L781

Then I proceed to merge my changes and these other two @inetic's commits:

And the test passed (but CI machine hanged). I'll do a few more tests before cleaning the code for release.

@vinipsmaker
Copy link
Contributor Author

I got a new error, EndOfStream, which is weird because the connection was supposed to be established at this point. Gotta try disable some of @inetic's commits to check if error persists.

@vinipsmaker
Copy link
Contributor Author

So... both commits are required to fix the out-of-order packets issue, but then we have a new issue to fix.

@vinipsmaker
Copy link
Contributor Author

So... both commits are required to fix the out-of-order packets issue, but then we have a new issue to fix.

Just a quick correction. Actually, I was seeing an UserBlob before a Handshake, but I have not verified that a Handshake was being received, so it could very well be a cause of a lost packet being incorrectly handled and corrupting stream (i.e. a bug within rust-utp). After discussing with @inetic, he argued that he did not fix an out-of-order issue. Anyways...

@vinipsmaker
Copy link
Contributor Author

I stopped getting 10040 error after increasing buffer size.

I checked again and I saw that actually still happens. It just happens to be very rare.

Also, it doesn't prevent the test from succeeding. Therefore, I reverted the commit that increases buffer size and tested again. The frequency of this error increased, but again, it didn't prevent the test from succeeding. The commit will remain disabled, as it represents no improvement.

@vinipsmaker
Copy link
Contributor Author

My progress in this task is stalled until branch master is fixed.

@vinipsmaker vinipsmaker force-pushed the vinipsmaker-test-utp2 branch from 73118c1 to 7def39c Compare January 20, 2016 20:51
@vinipsmaker
Copy link
Contributor Author

My progress in this task is stalled until branch master is fixed.

I fixed branch master and now I'm back to this problem.

@vinipsmaker
Copy link
Contributor Author

uTP seems stable when we use Peter's patchset.

However, I went a little further to be sure just in case:

I'll investigate why the new changes make some of the old tests fail before merging the branch.

@vinipsmaker
Copy link
Contributor Author

Just because I'm used to find races, I've run the rust-utp tests again a few more times to be sure only the following tests were failing:

  • socket::test::test_acks_on_socket
  • socket::test::test_handle_packet
  • socket::test::test_receive_unexpected_reply_type_on_connect

So... I'll start investigating one of these.

@vinipsmaker
Copy link
Contributor Author

My work on maidsafe-archive/rust-utp#16 PR is done. So, as soon as that PR is merged, this very one can also be merged.

@vinipsmaker
Copy link
Contributor Author

maidsafe-archive/rust-utp#16 was merged.

I added a dummy commit to trigger a new CI build, but it seems the uTP test failed on Windows:

I'll have to investigate why.
😞

@vinipsmaker
Copy link
Contributor Author

I'll have to investigate why.

It seems if the bugfixes2 branch from rust-utp is used alone, tests pass. However, if rust-utp branch is merged into master, something goes terribly wrong. I'll start the investigation from here (which commit from master that is not on bugfixes2 causes the failure?).

@vinipsmaker vinipsmaker force-pushed the vinipsmaker-test-utp2 branch from 95acde4 to 7def39c Compare January 26, 2016 11:46
@vinipsmaker
Copy link
Contributor Author

I'll have to investigate why.

It seems if the bugfixes2 branch from rust-utp is used alone, tests pass. However, if rust-utp branch is merged into master, something goes terribly wrong. I'll start the investigation from here (which commit from master that is not on bugfixes2 causes the failure?).

I was wrong. The CI builds on this PR failed because I forgot to release a new crate for temp_utp. Cargo.toml gets the latest released crate, not the latest commit on the master branch.

CI tests are passing now. PR is ready to be merged.

@vinipsmaker
Copy link
Contributor Author

r? @dirvine

self.connection_map.clone()));
self.acceptors.push(acceptor);

// FIXME: Instead of hardcoded wrapping in loopback V4, the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed the return of the bind should be used in these cases Ok for now though

@vinipsmaker
Copy link
Contributor Author

@dirvine: all these pieces of code you're commenting related to make an uTP listener just mirror the current TCP listener in master. Code is almost the same for current TCP. And I think it's wise to leave as is to avoid confusion.

I'll create a new merge/PR for the new_refactor branch after this PR is merged.

@dirvine
Copy link
Contributor

dirvine commented Jan 26, 2016

Sounds good, we are still toiling with compilation in new_refactor branch but yes good idea. Merging now.

dirvine added a commit that referenced this pull request Jan 26, 2016
@dirvine dirvine merged commit c452674 into maidsafe-archive:master Jan 26, 2016
@vinipsmaker vinipsmaker deleted the vinipsmaker-test-utp2 branch January 26, 2016 14:13
@vinipsmaker
Copy link
Contributor Author

This one took quite a while:

And at least three developers have worked on this issue.

@vinipsmaker vinipsmaker restored the vinipsmaker-test-utp2 branch January 29, 2016 18:29
@vinipsmaker vinipsmaker deleted the vinipsmaker-test-utp2 branch January 29, 2016 18:41
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants