Improvements needed to handling of socket connection lost #946

colinl · 2024-06-06T08:42:00Z

Description

Currently, in the client, if the socket disconnects then it pops up Connection Lost alert and enters a sequence of repeated attempts to reconnect. Initially it tries every 5 seconds, for a period of 60 seconds, then it drops back to trying every 30 seconds, After a total of 5 minutes it gives up entirely, leaving the Connection Lost, Trying to reconnect message on screen, which is not correct as it is no longer retrying. Some other action needs to be taken at this time. In addition there are some reasons for connection lost that will not recover (such as an authorisation token expiring), the retry sequence is not appropriate in these cases.

I am currently attempting to identify possible causes of disconnection and how to identify them. I will update this as I build the list, and will propose appropriate recovery action.

The first class of failure is for situations where attempting to reconnect for period is entirely appropriate. This includes network failures that prevent access to the server, and problems with node red itself that prevent the connection from succeeding, for example the server is stopped or is in the process of being restarted.
The retry algorithm described above has two problems:

If a laptop is suspended then the network may be shut down and the connection may fail before node red is suspended. If it is left suspended for more than 5 minutes then the retry algorithm notices that more than 5 minutes has elapsed since the connection failed, so does not try again. I believe a similar thing may happen on Android if the the OS decides to put the app to sleep when the screen is off or the app is put in the background. To avoid this I propose to change the retry algorithm from time based to retry count based, so it will try 12 times at 5 second intervals, then another 8 times at 30 second intervals.
If the connection is not restored within the 5 minutes then the user is left with a notification saying the app is trying to reconnect, but in fact it is not trying. I propose that after the 5 minutes, rather than just hanging, then we should force a page reload. I will do some tests to see what the effect of this is when the connection still cannot be restored.

The second class of failures is for situations where it is permanently not possible to reconnect. For these there is no point retrying, the solution is to immediately force a page reload. An examples of this is where an authorisation token has expired and it is necessary to reload in order to load the logon page. The only situation I have been able to test for this is a system that uses Cloudflare Zero Trust to connect to the server, and in that case, when the token expires, after a little while a socket.on('connect_error', (err) => { is triggered, where the error contains {"code": "parser error"}. I propose to trap that, and immediately force a page reload for that situation. There may well be other such permanent failure modes, but unless anyone can identify them now I don't see what we can do other than wait for them to be identified.

Have you provided an initial effort estimate for this issue?

I am no FlowFuse team member

The text was updated successfully, but these errors were encountered:

arturv2000 · 2024-06-06T10:23:29Z

I think since I use basic authentication directly with Node-Red, I don't think I have issues with token expiring. At least if I force a page refresh because some widgets stop working no login is required from the browser.

colinl · 2024-06-06T12:41:20Z

OK, thanks @arturv2000

colinl · 2024-06-07T08:27:51Z

I have updated the Description with my findings and proposal. I will generate a PR to implement them.

cstns · 2024-06-07T08:38:25Z

The {"code": "parser error} on connect_error seems to be a bit odd! Is there anything else on the error object other than the code key?

colinl · 2024-06-07T08:43:13Z

No, that is all. Could it be that the server is again responding with the cloudflare logon page? So effectively the same as the JSON parser error we were getting when trying to parse the _setup response in the same situation?

cstns · 2024-06-07T08:54:03Z

That is very unfortunate! That would be my best guess as well, expecting to receive a json or related type of response and receiving text, html or something unexpected.

We we would need some more info to go on. Could you add a global event listener and print out all the messages socket.io is brokering? Something similar to

socket.onAny((event, ...args) => {
    console.log(`Received event: ${event}`, args);
});

What would interest us is the event and args right before the parser error. I hope it will shed some light on what's happening because we're kind of shooting in the dark

colinl · 2024-06-07T11:00:20Z

I put that in, but it doesn't show anything useful. The connection initially keeps going, even though the token has expired, presumably because it is an already open connection. It is not until the connection is dropped for some reason that there is an issue. The code attempts to reconnect and fails. For example, if I trigger it by suspending the laptop briefly then I see

index-CYywKa1H.js:231 
 
 GET https://nodered.clanlaw.org.uk/dashboard/socket.io/?EIO=4&transport=polling&t=O_oLHRK net::ERR_CONNECTION_CLOSED

index-CYywKa1H.js:239 
 SIO connect error: 
Error: timeout
    at index-CYywKa1H.js:231:55308
 err: {}

index-CYywKa1H.js:239 
 SIO connect error: 
Error: server error
    at lt.onPacket (index-CYywKa1H.js:231:36803)
    at Emitter.emit (index-CYywKa1H.js:231:20587)
    at Polling.onPacket (index-CYywKa1H.js:231:22781)
    at r (index-CYywKa1H.js:231:24971)
    at Array.forEach (<anonymous>)
    at Polling.onData (index-CYywKa1H.js:231:25024)
    at Emitter.emit (index-CYywKa1H.js:231:20587)
    at Request.onLoad (index-CYywKa1H.js:231:27661)
    at a.onreadystatechange (index-CYywKa1H.js:231:27088)
 err: {"code":"parser error"}

The final line there is shown because I have modified the connect_error log to

console.error('SIO connect error:', err, `err: ${JSON.stringify(err)}`)

The only way I know to recover from this is to reload the page.

cstns · 2024-06-07T11:52:07Z

It wouldn't be enough to deduce what's happening. Is there anything logged on the NR side when these errors occur?

If we can't get anything out of NR logs I think the only solution left would be the initial option of hard reloading the page.

colinl · 2024-06-07T12:40:22Z

Nothing useful in the server log. Even at Trace level in node red and NODE_ENV=development for the dashboard, all I get is
7 Jun 13:36:11 - [info] [ui-base:Dashboard] Disconnected rRwjJ9_VueSmXX22AAAS due to transport close

cstns · 2024-06-07T12:56:27Z

Would a better course of action be to force a page reload only in case of a disconnect ws message combined with authentication required.

It would prevent unauthorized user access to stale dashboard data and break off redirect loops because presumably the page would be redirected to the authentication provider. The tricky part in this situation would be how to find out if the NR instance requires authentication. A happy scenario would be to have access to this information in the response on the _setup call.

With stuff like this I'd seek @joepavitt's guidance as my knowledge of NR and the NR-dashboard intricacies is somewhat limited.

colinl · 2024-06-07T13:27:30Z

Would a better course of action be to force a page reload only in case of a disconnect ws message combined with authentication required.

The check on the error code, and potential page reload, would only happen if we had previously been connected, so this is a reconnect, not an initial connect, so that should stop any possibility of repeated load on startup. I don't know whether it is possible for the server to know that authentication is required. I guess that would be something in the socket connection code. If we could have an indication that the attempt to setup the socket has been prevented because of the redirection to the logon page that would be conclusive.

A happy scenario would be to have access to this information in the response on the _setup call.

The _setup call is not being invoked here, this is purely an attempt to reconnect to the socket. The _setup call was before the disconnection.

In the meantime I will submit a PR for the the class 1 failures described in the description. That is, how to reconnect after a disconnection where re-authorisation is not involved.

colinl · 2024-06-08T19:45:58Z

I have managed to capture a 'parser error' in the console and from the network activity in the browser tools I can see what is happening. It is in fact very simple.
After the authorisation token has expired the dashboard carries on as normal, using the existing socket connection, until the connection is broken for some reason. Then when the reconnect is attempted I can see that the socket.connect() call in function reconnect() does a GET from https://nodered.mydomain.org.uk/dashboard/socket.io/?EIO=4&transport=polling&t=O_ufrkf.
As the token has expired, Cloudflare responds with a 302 Found status, redirecting to https://mydomain.cloudflareaccess.com/cdn-cgi/access/login/nodered.mydomain.org.uk?kid=9df892976242106878f0a65fd2c4543695b45472395325aba3f8277a33c5e44c&redirect_url=%2Fdashboard%2Fsocket.io%2F%3FEIO%3D4%26transport%3Dpolling%26t%3DO_ufrkf&meta=eyJraWQiOiIyZmE3OTM1NWY2YzdlYjRmNTdhY....
Which is the cloudflare login web page, which is not what socket.io is expecting, so it returns the parser error failure.

cstns · 2024-06-09T09:27:42Z

That is an awesome result!

If we can detect that 302 redirect we can safely force a page reload!

colinl · 2024-06-09T09:30:59Z

I think that would have to be down in the socket code somewhere.
I don't see the argument against forcing a reload when parser error is seen. Even if it isn't caused by exactly this problem, what else can we do?

cstns · 2024-06-09T10:55:46Z

You have a good point, I can't think of anything else at the moment.

colinl · 2024-06-09T11:22:23Z

I will do a PR for that. It is working well for me here. Anyone with any other ideas can chip in there.

joepavitt · 2024-06-11T08:05:37Z

Thanks @colinl - I'm playing catch up on this since returning today from holiday, can you reference this issue in any PR you open please so that it gets linked.

colinl · 2024-06-11T10:00:41Z

I have done so from the first one #950. I am waiting for that to be merged before doing the second as they are in the same bit of code.

joepavitt · 2024-06-11T10:09:37Z

You can open still open the PR, just put a comment in the first bit of the PR that it's dependent upon the other one first :)

colinl · 2024-06-12T10:07:32Z

Completed by PR #962

colinl added the task A piece of work that isn't necessarily tied to a specific Epic or Story. label Jun 6, 2024

colinl mentioned this issue Jun 6, 2024

PWA-Chrome issue #937

Closed

colinl mentioned this issue Jun 8, 2024

Use retry count based algorithm for socket reconnections #950

Merged

9 tasks

colinl mentioned this issue Jun 12, 2024

Force page reload on socket connect error 'parser error' #962

Merged

9 tasks

colinl closed this as completed Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements needed to handling of socket connection lost #946

Improvements needed to handling of socket connection lost #946

colinl commented Jun 6, 2024 •

edited

Loading

arturv2000 commented Jun 6, 2024

colinl commented Jun 6, 2024

colinl commented Jun 7, 2024

cstns commented Jun 7, 2024

colinl commented Jun 7, 2024

cstns commented Jun 7, 2024

colinl commented Jun 7, 2024 •

edited

Loading

cstns commented Jun 7, 2024

colinl commented Jun 7, 2024

cstns commented Jun 7, 2024

colinl commented Jun 7, 2024

colinl commented Jun 8, 2024

cstns commented Jun 9, 2024

colinl commented Jun 9, 2024

cstns commented Jun 9, 2024

colinl commented Jun 9, 2024

joepavitt commented Jun 11, 2024 •

edited

Loading

colinl commented Jun 11, 2024

joepavitt commented Jun 11, 2024

colinl commented Jun 12, 2024

Improvements needed to handling of socket connection lost #946

Improvements needed to handling of socket connection lost #946

Comments

colinl commented Jun 6, 2024 • edited Loading

Description

Have you provided an initial effort estimate for this issue?

arturv2000 commented Jun 6, 2024

colinl commented Jun 6, 2024

colinl commented Jun 7, 2024

cstns commented Jun 7, 2024

colinl commented Jun 7, 2024

cstns commented Jun 7, 2024

colinl commented Jun 7, 2024 • edited Loading

cstns commented Jun 7, 2024

colinl commented Jun 7, 2024

cstns commented Jun 7, 2024

colinl commented Jun 7, 2024

colinl commented Jun 8, 2024

cstns commented Jun 9, 2024

colinl commented Jun 9, 2024

cstns commented Jun 9, 2024

colinl commented Jun 9, 2024

joepavitt commented Jun 11, 2024 • edited Loading

colinl commented Jun 11, 2024

joepavitt commented Jun 11, 2024

colinl commented Jun 12, 2024

colinl commented Jun 6, 2024 •

edited

Loading

colinl commented Jun 7, 2024 •

edited

Loading

joepavitt commented Jun 11, 2024 •

edited

Loading