-
-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Every 5th TCP connection fails on Arduino Giga R1 if we don't fully read the HTTP content sent from the server each time #937
Comments
... at 4.5s delay it takes 8 calls before it fails, an interesting multiple of 4. Any hints? I tried moving WifiClient globally but that swaps -3005 errors for -3003 errors on the first re-use:
31s timeout this time. Since I've included the code this should replicate well. |
... I am also guessing this 30s timeout is in HttpClient since there's no delay on the return from the failed connect. |
Note: calling client.stop() does not resolve the issue |
The maximum number of TCP connections is 4. As defined in the file: "mbed_config.h"
I don't understand why it's so low. Need to recompile locally to have more connections available. |
I never have 4 tcp sockets open but I am confident this limit is at the core of the issue. Some clean up isn’t being executed. Increasing the socket count isn’t going to help though as it just puts the issue off. |
@schnoberts1 can you try these changes? https://github.com/arduino/ArduinoCore-mbed/pull/912/files |
Of course @JAndrassy. I've applied the patch and the issue still remains. |
@schnoberts1 the 30s timeout comes from the Http library, you can get rid of it adding I'm still investigating about the It looks that before complete deletion they can be put in a waitstate and closed afterwards |
does the HttpClient send the "Connection: close" header? |
@JAndrassy yes HttpClient sends the "Connection: close" header. I think I've found the rootcause, and it is server side, in fact |
Good find! We close the socket though so I would have thought LWIP would have sent FIN and either got a FIN ACK and closed the underlying socket or timed out waiting for one. I wonder, is LWIP not sending FIN? On 1 Oct 2024, at 09:23, Mattia Pennasilico ***@***.***> wrote:
@JAndrassy yes HttpClient sends the "Connection: close" header.
I think I've found the rootcause, and it is server side, in fact httpforever.com keeps sending TCP Keep-Alive messages, and thus connections are not closed. Using a different server like example.org connections are closed correctly and the NSAPI_ERROR_NO_SOCKET error is not generated..
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Here is what i get with wireshark, I'm now thinking if the TCP Window Full messages are somehow involved |
I get the same |
Interesting. I can see on the exampleorg PCAP that the Giga sends a GET and then it gets the data and an HTTP 200 response followed immediately by a FIN. The Giga then sends an ACK for data up to sequential 537 and then an ack for data up to 1619 (which includes the FIN that got sent to the Giga). The Giga never sends a FIN in that packet capture. It should do that. I think the server is then in FIN_WAIT2 but since the HTTP server has closed the socket it’ll time it out.
In the httpforever it gets interesting.
The Giga sends the GET (split across two packets weirdly). Packet 24 has the GET and packet 25 has / HTTP/1.1 Host: httpforever.com
The same happens again, the Giga sends User-Agent in one packet then : Arduino/2.2.0 \nConnection: close in the other.
At this point it’s the same as the exampleorg capture really. A key difference is that the exampleorg payload is less than the advertised window (2144) so it never needs to write a packet with WindowFull set.
The server sends the data with the Connection: close also in the header. Not all the data is in the packet so the next packet is more of the data but the TCP Window Full is set.
This is because the advertised window from the Giga is 2144 and it’s sent exactly 2144 bytes across those two packets. It needs an ACK now.
The Giga ACKs the two packets.
The server sends a 536 byte packet so it can’t be finished, the content length was around 5k. The Giga acks it but now its window is down to 536 bytes.
The server fills the window and tags the packet accordingly.
The Giga announces a zero window (I don’t know why, it’s ACKed everything - but then I am not a TCP guru).
The giga then RST’s the connection. At this point it should be closing the underlying socket since it’s an abort. It does not need to wait for anything from the server. However, it doesn’t abort. The server sends a TCP keep alive after the RST and the Giga responds to it! I don’t think that is right.
You are right, it could be the WindowFull but the response from the Giga seems odd at this stage. I do not understand why it continues to process responses after sending a RST.
… On 1 Oct 2024, at 11:05, Mattia Pennasilico ***@***.***> wrote:
I get the same NSAPI_ERROR_NO_SOCKET using anothe http site sneaindia.com TCP Window Full
messages are also present.
snea.pcapng.zip <https://github.com/user-attachments/files/17203234/snea.pcapng.zip>
—
Reply to this email directly, view it on GitHub <#937 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACMQSMWJFQ5RFGX6ID5LXKLZZJXX5AVCNFSM6AAAAABMS5OJFWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOBVGM3DINRUHE>.
You are receiving this because you were mentioned.
|
Increasing TCP MSS to 1460 seems to fix this issue using |
I do think though this might be some toxic mix of issues.
When you call contentLength() on the HttpClient it (quite rightly) does not read the entire data stream so, if your content is longer than the advertised window the the server is going to stall and not send anything else (which again is what you want). If the LWIP stack isn’t handling that well and then doesn’t implement RST properly then this may well be the cause.
So, what does the packet capture look like with the larger MSS? Do the windowfull flags still get set and RST set, but LWIP is in a better state?
… On 1 Oct 2024, at 12:43, Mattia Pennasilico ***@***.***> wrote:
Increasing TCP MSS to 1460 seems to fix this issue using httpforever.com
—
Reply to this email directly, view it on GitHub <#937 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACMQSMRNZPANOH5B2BMBK73ZZKDGLAVCNFSM6AAAAABMS5OJFWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOBVGU2TMMZXGU>.
You are receiving this because you were mentioned.
|
Here it is the capture with increased MSS, no |
Doesn’t this mean the increase to MSS is simply hiding an underlying problem?On 1 Oct 2024, at 13:49, Mattia Pennasilico ***@***.***> wrote:
Here it is the capture with increased MSS, no TCP Window Full occurs
mss-increased.pcapng.zip
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
I will take a look at it tomorrow as wellOn 1 Oct 2024, at 13:49, Mattia Pennasilico ***@***.***> wrote:
Here it is the capture with increased MSS, no TCP Window Full occurs
mss-increased.pcapng.zip
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
FYI, In addition to the mbed conf settings mentioned here #966 (comment), I've also set |
Hi @schnoberts1, I ran your sketch on my setup for 5mins with 5 sec delay, 5mins with 1 sec delay and for 2mins with 100ms delay - output attached. 100ms 20:48:23.390 -> HI |
This makes sense as I think the increased MSS results in the issue being masked. |
Just re-ran your sketch (even thoguh after bedtime) with stock mbedlib and just the SOCKET_BUFFER_SIZE override. It works. Please try |
That’s a good find. I will try tomorrow. I bet this is because with the extra socket buffer wise it’s not advertising a smaller TCP window because it’s for room for the payload . Thank you very much. I think I will spin up a local web server and test it against payload size. On 1 Oct 2024, at 22:16, megacct ***@***.***> wrote:
Just re-ran your sketch (even thoguh after bedtime) with stock mbedlib and just the SOCKET_BUFFER_SIZE override. It works. Please try
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
just increasing |
My apologies, you are correct. It must not have used the stock lib (it was late) |
I looked at the packet capture (increased MSS). The increased MSS size results in a larger window being advertised by the Giga. It advertises a 5840 byte window which is larger than the payload that's being sent to it. We get no ZeroWindow. It also means that the http server can send the last bit of data with the FIN state set so when the RST/ACK is sent the socket is already presumably half closed as the Giga got a FIN. Without this increase the server can't send all the data, the FIN state is not sent, the RST/ACK is sent on an open connection. The Server doesn't abort its side of the connection and sends Keep Alives (which I don't think is right, but the TCP stack is supposed to behave well when things aren't quire right). The Giga's stack is processing these keep alives despite the fact it's aborted the connection. I don't think this is correct. The result is the socket being held open incorrectly. I propose this is the issue and what we see with increased MSS (etc) is a mask. I can't build a custom mbed build at the moment, but I imagine testing this hypothesis would involve an http server with keep-alive that responds like the one I chose, but sends a much larger http body content, triggering the zero window, no FIN etc. To sort of demonstrate this, if I don't just call contentLength() but instead drain the connection by calling responseBody() then the problem goes away, because the FIN can be sent, RST is on a half closed socket and the Keep Alives aren't set. The reason I keep coming back to this 'MSS is a mask' is because I think there's a defect here and the defect isn't "wrong MSS" but rather wrong behaviour under the specific situations the small MSS creates relative to this web server. Does that make sense? You might ask why I care? Well, it's because the stack is not freeing resources when it should (in my opinion) and the internet is full of servers that do weird things TCP wise and I want my system to be robust in the face of those and I'm not convinced that the MSS is really addressing the core issue. That being said, a more standard MSS for a machine like the Giga which has plenty of memory does seem like a good change irrespective. |
it makes perfectly sense, I've done other tests with the increased MSS and trying to download bigger files... know what? the issue comes back again, so i agree with you increasing MSS is masking the issue and is not the fix. |
I think at this stage it might be worth confirming with the LWIP crowd what they expect to happen in the situation given the PCAP showing the failure mode. Maybe I am wrong and this is the expected behaviour and it's fine because there's a timeout that handles it eventually. To some extent I think you could triage this issue away if you were inclined to on the basis of:
... but it'd be nice to get a response from the LWIP community (I assume this is where the issue might lie, or perhaps not.... skimming the LWIP code it does look like resource freeing happens on a RST and also shortly after in its background thread that cleans up dead connections - that's an every 0.5s/0.25s interval thread though, not a 5s interval). I should probably change the title of this ticket to be clearer about the situation. BTW: Obv when I fully drain the connection by calling responseBody() on HttpClient the problem doesn't occur. |
... I guess the counter argument is that there are plenty of situations where you might want to abandon your connection while the server is still writing data, e.g. validation of content fails but we have a workaround, always consume everything up until you think it's safe to close the connection (i.e. you are likely caught up), even in that situation. Obv won't work for streaming data I imagine. |
Also, I do appreciate the effort put into looking at this. It's helped a lot. |
Platform: Giga R1
Arduino Core Mbed 4.1.5
HttpClient: 0.6.1
Here's a sequence of events with a crudely instrumented MbedClient. We connect using ArduinoHttpClient every 1s. Every 5th connection fails with error -3005 from
static_cast<TCPSocket *>(sock)->open(getNetwork());
in theMbedClient::connect() call
. When it fails it takes 30s to return. This is in version 4.1.5. In version 4.1.1 it would never recover and continue to fail until device reboot. There would be no 30s delay. This issue occurs on any http server I've tried. I can continue to connect to the server in the code with python on my Mac with no problems. If I extend the delay so I connect to the server once every 5s the problem goes away. At 4s it reappears.code
The text was updated successfully, but these errors were encountered: