Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Every 5th TCP connection fails on Arduino Giga R1 if we don't fully read the HTTP content sent from the server each time #937

Open
schnoberts1 opened this issue Aug 15, 2024 · 31 comments · May be fixed by #966

Comments

@schnoberts1
Copy link

schnoberts1 commented Aug 15, 2024

Platform: Giga R1
Arduino Core Mbed 4.1.5
HttpClient: 0.6.1

Here's a sequence of events with a crudely instrumented MbedClient. We connect using ArduinoHttpClient every 1s. Every 5th connection fails with error -3005 from static_cast<TCPSocket *>(sock)->open(getNetwork()); in the MbedClient::connect() call. When it fails it takes 30s to return. This is in version 4.1.5. In version 4.1.1 it would never recover and continue to fail until device reboot. There would be no 30s delay. This issue occurs on any http server I've tried. I can continue to connect to the server in the code with python on my Mac with no problems. If I extend the delay so I connect to the server once every 5s the problem goes away. At 4s it reappears.

22:10:12.229 -> Fetching content length from httpforever.com:80
22:10:12.393 -> Return code from connect0
22:10:13.213 -> took=993ms length: 5124
22:10:15.229 -> Fetching content length from httpforever.com:80
22:10:15.394 -> Return code from connect0
22:10:16.216 -> took=979ms length: 5124
22:10:18.198 -> Fetching content length from httpforever.com:80
22:10:18.365 -> Return code from connect0
22:10:19.225 -> took=1008ms length: 5124
22:10:21.203 -> Fetching content length from httpforever.com:80
22:10:21.366 -> Return code from connect0
22:10:22.222 -> took=1004ms length: 5124
22:10:24.202 -> Fetching content length from httpforever.com:80
22:10:24.202 -> Return code from TCPSocket::open-3005
22:10:54.295 -> took=30099ms length: -1

code

#include <Arduino.h>
#include <WiFi.h>
#include <HttpClient.h>

void setup() {
  Serial.begin(115200);
  delay(1000);
  Serial.println("HI");
  // put your setup code here, to run once:
  auto wifi = WiFiInterface::get_default_instance();
  wifi->set_dhcp(true);
  wifi->set_credentials("ZZZZ", "ZZZZ", nsapi_security_t::NSAPI_SECURITY_WPA2);
  Serial.println(wifi->connect());
}

void loop() {
  // put your main code here, to run repeatedly:
  WiFiClient client;
 
   HttpClient httpClient(client, "httpforever.com", 80);
   Serial.println("Fetching content length from httpforever.com:80");
   auto then = millis();
   httpClient.beginRequest();
   httpClient.get("/");
   httpClient.endRequest();
   auto len = httpClient.contentLength(); // Doesn't read all content
   Serial.print("took="); Serial.print((millis() - then)); Serial.print("ms "); Serial.print("length: "); Serial.println(len);
   delay(5000);
}
@schnoberts1
Copy link
Author

schnoberts1 commented Aug 15, 2024

... at 4.5s delay it takes 8 calls before it fails, an interesting multiple of 4. Any hints? I tried moving WifiClient globally but that swaps -3005 errors for -3003 errors on the first re-use:

22:25:36.246 -> Fetching content length from httpforever.com:80
22:25:36.448 -> Return code from connect0
22:25:37.243 -> took=1001ms length: 5124
22:25:38.260 -> Fetching content length from httpforever.com:80
22:25:38.260 -> Return code from TCPSocket::open-3003
22:26:09.251 -> took=31000ms length: -1

31s timeout this time.

Since I've included the code this should replicate well.

@schnoberts1
Copy link
Author

... I am also guessing this 30s timeout is in HttpClient since there's no delay on the return from the failed connect.

@schnoberts1
Copy link
Author

Note: calling client.stop() does not resolve the issue

@stopandchatchfire
Copy link

stopandchatchfire commented Sep 5, 2024

The maximum number of TCP connections is 4.

As defined in the file: "mbed_config.h"

#define MBED_CONF_LWIP_TCP_SOCKET_MAX 4

I don't understand why it's so low. Need to recompile locally to have more connections available.

@schnoberts1
Copy link
Author

schnoberts1 commented Sep 5, 2024

I never have 4 tcp sockets open but I am confident this limit is at the core of the issue. Some clean up isn’t being executed. Increasing the socket count isn’t going to help though as it just puts the issue off.

@JAndrassy
Copy link
Contributor

@schnoberts1
Copy link
Author

schnoberts1 commented Sep 9, 2024

Of course @JAndrassy. I've applied the patch and the issue still remains.

@pennam
Copy link
Contributor

pennam commented Oct 1, 2024

@schnoberts1 the 30s timeout comes from the Http library, you can get rid of it adding httpClient.setHttpResponseTimeout(1000); to your sketch.

I'm still investigating about the NSAPI_ERROR_NO_SOCKET return value but i agree it is related to the MBED_CONF_LWIP_TCP_SOCKET_MAX define and in the way lwip TCP connections are deleted https://github.com/arduino/mbed-os/blob/e0cad5c4277b3c5ee9cb01a9529df9333612bbeb/connectivity/lwipstack/lwip/src/api/lwip_api_lib.c#L233

It looks that before complete deletion they can be put in a waitstate and closed afterwards

https://github.com/arduino/mbed-os/blob/e0cad5c4277b3c5ee9cb01a9529df9333612bbeb/connectivity/lwipstack/lwip/src/api/lwip_api_msg.c#L913

@JAndrassy
Copy link
Contributor

does the HttpClient send the "Connection: close" header?

@pennam
Copy link
Contributor

pennam commented Oct 1, 2024

@JAndrassy yes HttpClient sends the "Connection: close" header.

I think I've found the rootcause, and it is server side, in fact httpforever.com keeps sending TCP Keep-Alive messages, and thus connections are not closed. Using a different server like example.org connections are closed correctly and the NSAPI_ERROR_NO_SOCKET error is not generated..

@schnoberts1
Copy link
Author

schnoberts1 commented Oct 1, 2024 via email

@pennam
Copy link
Contributor

pennam commented Oct 1, 2024

Here is what i get with wireshark, I'm now thinking if the TCP Window Full messages are somehow involved
wireshark.zip

@pennam
Copy link
Contributor

pennam commented Oct 1, 2024

I get the same NSAPI_ERROR_NO_SOCKET using anothe http site sneaindia.com TCP Window Full
messages are also present.
snea.pcapng.zip

@schnoberts1
Copy link
Author

schnoberts1 commented Oct 1, 2024 via email

@pennam pennam linked a pull request Oct 1, 2024 that will close this issue
@pennam
Copy link
Contributor

pennam commented Oct 1, 2024

Increasing TCP MSS to 1460 seems to fix this issue using httpforever.com

@schnoberts1
Copy link
Author

schnoberts1 commented Oct 1, 2024 via email

@pennam
Copy link
Contributor

pennam commented Oct 1, 2024

Here it is the capture with increased MSS, no TCP Window Full occurs
mss-increased.pcapng.zip

@schnoberts1
Copy link
Author

schnoberts1 commented Oct 1, 2024 via email

@schnoberts1
Copy link
Author

schnoberts1 commented Oct 1, 2024 via email

@megacct
Copy link

megacct commented Oct 1, 2024

FYI, In addition to the mbed conf settings mentioned here #966 (comment), I've also set
compiler.c.extra_flags=-DSOCKET_BUFFER_SIZE=2048
compiler.cpp.extra_flags=-DSOCKET_BUFFER_SIZE=2048
in my platform local

@megacct
Copy link

megacct commented Oct 1, 2024

Hi @schnoberts1, I ran your sketch on my setup for 5mins with 5 sec delay, 5mins with 1 sec delay and for 2mins with 100ms delay - output attached.
output.zip

100ms

20:48:23.390 -> HI
20:48:29.958 -> 0
20:48:29.958 -> Fetching content length from httpforever.com:80
20:48:31.140 -> took=1178ms length: 5124
20:48:31.233 -> Fetching content length from httpforever.com:80
20:48:31.852 -> took=645ms length: 5124
20:48:31.947 -> Fetching content length from httpforever.com:80
20:48:32.608 -> took=641ms length: 5124
20:48:32.702 -> Fetching content length from httpforever.com:80
20:48:33.361 -> took=649ms length: 5124
20:48:33.455 -> Fetching content length from httpforever.com:80
20:48:34.117 -> took=646ms length: 5124
20:48:34.212 -> Fetching content length from httpforever.com:80
20:48:34.825 -> took=642ms length: 5124
20:48:34.966 -> Fetching content length from httpforever.com:80
20:48:35.579 -> took=645ms length: 5124
20:48:35.673 -> Fetching content length from httpforever.com:80
20:48:36.337 -> took=650ms length: 5124
20:48:36.432 -> Fetching content length from httpforever.com:80
20:48:37.094 -> took=647ms length: 5124
20:48:37.189 -> Fetching content length from httpforever.com:80
20:48:37.851 -> took=650ms length: 5124
20:48:37.946 -> Fetching content length from httpforever.com:80
20:48:38.608 -> took=652ms length: 5124
20:48:38.702 -> Fetching content length from httpforever.com:80
20:48:39.364 -> took=671ms length: 5124
20:48:39.460 -> Fetching content length from httpforever.com:80
20:48:40.120 -> took=654ms length: 5124
20:48:40.215 -> Fetching content length from httpforever.com:80
20:48:40.877 -> took=647ms length: 5124
20:48:40.972 -> Fetching content length from httpforever.com:80
20:48:41.628 -> took=645ms length: 5124
20:48:41.723 -> Fetching content length from httpforever.com:80
20:48:42.385 -> took=665ms length: 5124
20:48:42.481 -> Fetching content length from httpforever.com:80
20:48:43.144 -> took=651ms length: 5124
20:48:43.240 -> Fetching content length from httpforever.com:80
20:48:43.903 -> took=648ms length: 5124
20:48:43.998 -> Fetching content length from httpforever.com:80
20:48:44.659 -> took=654ms length: 5124
20:48:44.752 -> Fetching content length from httpforever.com:80
20:48:45.367 -> took=644ms length: 5124
20:48:45.460 -> Fetching content length from httpforever.com:80
20:48:46.123 -> took=647ms length: 5124
20:48:46.217 -> Fetching content length from httpforever.com:80
20:48:46.879 -> took=646ms length: 5124
20:48:46.975 -> Fetching content length from httpforever.com:80
20:48:47.636 -> took=649ms length: 5124
20:48:47.731 -> Fetching content length from httpforever.com:80
20:48:48.392 -> took=644ms length: 5124
20:48:48.485 -> Fetching content length from httpforever.com:80
20:48:49.147 -> took=653ms length: 5124
20:48:49.241 -> Fetching content length from httpforever.com:80
20:48:49.855 -> took=650ms length: 5124
20:48:49.997 -> Fetching content length from httpforever.com:80
20:48:50.610 -> took=654ms length: 5124
20:48:50.751 -> Fetching content length from httpforever.com:80
20:48:51.364 -> took=646ms length: 5124
20:48:51.459 -> Fetching content length from httpforever.com:80
20:48:52.119 -> took=645ms length: 5124
20:48:52.214 -> Fetching content length from httpforever.com:80
20:48:52.876 -> took=650ms length: 5124
20:48:52.971 -> Fetching content length from httpforever.com:80
20:48:53.633 -> took=650ms length: 5124
20:48:53.728 -> Fetching content length from httpforever.com:80
20:48:54.389 -> took=645ms length: 5124
20:48:54.483 -> Fetching content length from httpforever.com:80
20:48:55.141 -> took=654ms length: 5124
20:48:55.233 -> Fetching content length from httpforever.com:80
20:48:55.897 -> took=652ms length: 5124
20:48:55.992 -> Fetching content length from httpforever.com:80
20:48:56.652 -> took=649ms length: 5124
20:48:56.746 -> Fetching content length from httpforever.com:80
20:48:57.360 -> took=651ms length: 5124
20:48:57.501 -> Fetching content length from httpforever.com:80
20:48:58.119 -> took=646ms length: 5124
20:48:58.260 -> Fetching content length from httpforever.com:80
20:48:58.876 -> took=654ms length: 5124
20:48:59.018 -> Fetching content length from httpforever.com:80
20:48:59.674 -> took=653ms length: 5124
20:48:59.768 -> Fetching content length from httpforever.com:80

@schnoberts1
Copy link
Author

This makes sense as I think the increased MSS results in the issue being masked.

@megacct
Copy link

megacct commented Oct 1, 2024

Just re-ran your sketch (even thoguh after bedtime) with stock mbedlib and just the SOCKET_BUFFER_SIZE override. It works. Please try
p.s. should probably be larger and declared static in arduino::MbedClient::readSocket for better perf but I've other things on the go

@schnoberts1
Copy link
Author

schnoberts1 commented Oct 1, 2024 via email

@pennam
Copy link
Contributor

pennam commented Oct 2, 2024

just increasing SOCKET_BUFFER_SIZE seems to not fix the issue in my case

@megacct
Copy link

megacct commented Oct 2, 2024

My apologies, you are correct. It must not have used the stock lib (it was late)

@schnoberts1
Copy link
Author

schnoberts1 commented Oct 2, 2024

I looked at the packet capture (increased MSS). The increased MSS size results in a larger window being advertised by the Giga. It advertises a 5840 byte window which is larger than the payload that's being sent to it. We get no ZeroWindow. It also means that the http server can send the last bit of data with the FIN state set so when the RST/ACK is sent the socket is already presumably half closed as the Giga got a FIN.

Without this increase the server can't send all the data, the FIN state is not sent, the RST/ACK is sent on an open connection. The Server doesn't abort its side of the connection and sends Keep Alives (which I don't think is right, but the TCP stack is supposed to behave well when things aren't quire right). The Giga's stack is processing these keep alives despite the fact it's aborted the connection. I don't think this is correct. The result is the socket being held open incorrectly. I propose this is the issue and what we see with increased MSS (etc) is a mask. I can't build a custom mbed build at the moment, but I imagine testing this hypothesis would involve an http server with keep-alive that responds like the one I chose, but sends a much larger http body content, triggering the zero window, no FIN etc.

To sort of demonstrate this, if I don't just call contentLength() but instead drain the connection by calling responseBody() then the problem goes away, because the FIN can be sent, RST is on a half closed socket and the Keep Alives aren't set.

The reason I keep coming back to this 'MSS is a mask' is because I think there's a defect here and the defect isn't "wrong MSS" but rather wrong behaviour under the specific situations the small MSS creates relative to this web server.

Does that make sense?

You might ask why I care? Well, it's because the stack is not freeing resources when it should (in my opinion) and the internet is full of servers that do weird things TCP wise and I want my system to be robust in the face of those and I'm not convinced that the MSS is really addressing the core issue.

That being said, a more standard MSS for a machine like the Giga which has plenty of memory does seem like a good change irrespective.

@pennam
Copy link
Contributor

pennam commented Oct 2, 2024

Does that make sense?

it makes perfectly sense, I've done other tests with the increased MSS and trying to download bigger files... know what? the issue comes back again, so i agree with you increasing MSS is masking the issue and is not the fix.

@schnoberts1
Copy link
Author

I think at this stage it might be worth confirming with the LWIP crowd what they expect to happen in the situation given the PCAP showing the failure mode. Maybe I am wrong and this is the expected behaviour and it's fine because there's a timeout that handles it eventually.

To some extent I think you could triage this issue away if you were inclined to on the basis of:

  1. The connection is torn down eventually.
  2. Always read the entire stream from the server and if you don't then don't expect the stack to free resources in a timely fashion.
  3. It is a somewhat contrived example.

... but it'd be nice to get a response from the LWIP community (I assume this is where the issue might lie, or perhaps not.... skimming the LWIP code it does look like resource freeing happens on a RST and also shortly after in its background thread that cleans up dead connections - that's an every 0.5s/0.25s interval thread though, not a 5s interval).

I should probably change the title of this ticket to be clearer about the situation.

BTW: Obv when I fully drain the connection by calling responseBody() on HttpClient the problem doesn't occur.

@schnoberts1 schnoberts1 changed the title Every 5th TCP connection fails on Arduino Giga R1. Every 5th TCP connection fails on Arduino Giga R1 if we don't fully read the HTTP content sent from the server each time Oct 2, 2024
@schnoberts1
Copy link
Author

schnoberts1 commented Oct 2, 2024

... I guess the counter argument is that there are plenty of situations where you might want to abandon your connection while the server is still writing data, e.g. validation of content fails but we have a workaround, always consume everything up until you think it's safe to close the connection (i.e. you are likely caught up), even in that situation. Obv won't work for streaming data I imagine.

@schnoberts1
Copy link
Author

Also, I do appreciate the effort put into looking at this. It's helped a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants