Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements needed to handling of socket connection lost #946

Closed
colinl opened this issue Jun 6, 2024 · 20 comments
Closed

Improvements needed to handling of socket connection lost #946

colinl opened this issue Jun 6, 2024 · 20 comments
Labels
task A piece of work that isn't necessarily tied to a specific Epic or Story.

Comments

@colinl
Copy link
Contributor

colinl commented Jun 6, 2024

Description

Currently, in the client, if the socket disconnects then it pops up Connection Lost alert and enters a sequence of repeated attempts to reconnect. Initially it tries every 5 seconds, for a period of 60 seconds, then it drops back to trying every 30 seconds, After a total of 5 minutes it gives up entirely, leaving the Connection Lost, Trying to reconnect message on screen, which is not correct as it is no longer retrying. Some other action needs to be taken at this time. In addition there are some reasons for connection lost that will not recover (such as an authorisation token expiring), the retry sequence is not appropriate in these cases.

I am currently attempting to identify possible causes of disconnection and how to identify them. I will update this as I build the list, and will propose appropriate recovery action.

The first class of failure is for situations where attempting to reconnect for period is entirely appropriate. This includes network failures that prevent access to the server, and problems with node red itself that prevent the connection from succeeding, for example the server is stopped or is in the process of being restarted.
The retry algorithm described above has two problems:

  1. If a laptop is suspended then the network may be shut down and the connection may fail before node red is suspended. If it is left suspended for more than 5 minutes then the retry algorithm notices that more than 5 minutes has elapsed since the connection failed, so does not try again. I believe a similar thing may happen on Android if the the OS decides to put the app to sleep when the screen is off or the app is put in the background. To avoid this I propose to change the retry algorithm from time based to retry count based, so it will try 12 times at 5 second intervals, then another 8 times at 30 second intervals.
  2. If the connection is not restored within the 5 minutes then the user is left with a notification saying the app is trying to reconnect, but in fact it is not trying. I propose that after the 5 minutes, rather than just hanging, then we should force a page reload. I will do some tests to see what the effect of this is when the connection still cannot be restored.

The second class of failures is for situations where it is permanently not possible to reconnect. For these there is no point retrying, the solution is to immediately force a page reload. An examples of this is where an authorisation token has expired and it is necessary to reload in order to load the logon page. The only situation I have been able to test for this is a system that uses Cloudflare Zero Trust to connect to the server, and in that case, when the token expires, after a little while a socket.on('connect_error', (err) => { is triggered, where the error contains {"code": "parser error"}. I propose to trap that, and immediately force a page reload for that situation. There may well be other such permanent failure modes, but unless anyone can identify them now I don't see what we can do other than wait for them to be identified.

Have you provided an initial effort estimate for this issue?

I am no FlowFuse team member

@colinl colinl added the task A piece of work that isn't necessarily tied to a specific Epic or Story. label Jun 6, 2024
@colinl colinl mentioned this issue Jun 6, 2024
@arturv2000
Copy link
Contributor

I think since I use basic authentication directly with Node-Red, I don't think I have issues with token expiring. At least if I force a page refresh because some widgets stop working no login is required from the browser.

@colinl
Copy link
Contributor Author

colinl commented Jun 6, 2024

OK, thanks @arturv2000

@colinl
Copy link
Contributor Author

colinl commented Jun 7, 2024

I have updated the Description with my findings and proposal. I will generate a PR to implement them.

@cstns
Copy link
Contributor

cstns commented Jun 7, 2024

The {"code": "parser error} on connect_error seems to be a bit odd! Is there anything else on the error object other than the code key?

@colinl
Copy link
Contributor Author

colinl commented Jun 7, 2024

No, that is all. Could it be that the server is again responding with the cloudflare logon page? So effectively the same as the JSON parser error we were getting when trying to parse the _setup response in the same situation?

@cstns
Copy link
Contributor

cstns commented Jun 7, 2024

That is very unfortunate! That would be my best guess as well, expecting to receive a json or related type of response and receiving text, html or something unexpected.

We we would need some more info to go on. Could you add a global event listener and print out all the messages socket.io is brokering? Something similar to

socket.onAny((event, ...args) => {
    console.log(`Received event: ${event}`, args);
});

What would interest us is the event and args right before the parser error. I hope it will shed some light on what's happening because we're kind of shooting in the dark

@colinl
Copy link
Contributor Author

colinl commented Jun 7, 2024

I put that in, but it doesn't show anything useful. The connection initially keeps going, even though the token has expired, presumably because it is an already open connection. It is not until the connection is dropped for some reason that there is an issue. The code attempts to reconnect and fails. For example, if I trigger it by suspending the laptop briefly then I see

index-CYywKa1H.js:231 
 
 GET https://nodered.clanlaw.org.uk/dashboard/socket.io/?EIO=4&transport=polling&t=O_oLHRK net::ERR_CONNECTION_CLOSED

index-CYywKa1H.js:239 
 SIO connect error: 
Error: timeout
    at index-CYywKa1H.js:231:55308
 err: {}

index-CYywKa1H.js:239 
 SIO connect error: 
Error: server error
    at lt.onPacket (index-CYywKa1H.js:231:36803)
    at Emitter.emit (index-CYywKa1H.js:231:20587)
    at Polling.onPacket (index-CYywKa1H.js:231:22781)
    at r (index-CYywKa1H.js:231:24971)
    at Array.forEach (<anonymous>)
    at Polling.onData (index-CYywKa1H.js:231:25024)
    at Emitter.emit (index-CYywKa1H.js:231:20587)
    at Request.onLoad (index-CYywKa1H.js:231:27661)
    at a.onreadystatechange (index-CYywKa1H.js:231:27088)
 err: {"code":"parser error"}

The final line there is shown because I have modified the connect_error log to

console.error('SIO connect error:', err, `err: ${JSON.stringify(err)}`)

The only way I know to recover from this is to reload the page.

@cstns
Copy link
Contributor

cstns commented Jun 7, 2024

It wouldn't be enough to deduce what's happening. Is there anything logged on the NR side when these errors occur?

If we can't get anything out of NR logs I think the only solution left would be the initial option of hard reloading the page.

@colinl
Copy link
Contributor Author

colinl commented Jun 7, 2024

Nothing useful in the server log. Even at Trace level in node red and NODE_ENV=development for the dashboard, all I get is
7 Jun 13:36:11 - [info] [ui-base:Dashboard] Disconnected rRwjJ9_VueSmXX22AAAS due to transport close

@cstns
Copy link
Contributor

cstns commented Jun 7, 2024

Would a better course of action be to force a page reload only in case of a disconnect ws message combined with authentication required.

It would prevent unauthorized user access to stale dashboard data and break off redirect loops because presumably the page would be redirected to the authentication provider. The tricky part in this situation would be how to find out if the NR instance requires authentication. A happy scenario would be to have access to this information in the response on the _setup call.

With stuff like this I'd seek @joepavitt's guidance as my knowledge of NR and the NR-dashboard intricacies is somewhat limited.

@colinl
Copy link
Contributor Author

colinl commented Jun 7, 2024

Would a better course of action be to force a page reload only in case of a disconnect ws message combined with authentication required.

The check on the error code, and potential page reload, would only happen if we had previously been connected, so this is a reconnect, not an initial connect, so that should stop any possibility of repeated load on startup. I don't know whether it is possible for the server to know that authentication is required. I guess that would be something in the socket connection code. If we could have an indication that the attempt to setup the socket has been prevented because of the redirection to the logon page that would be conclusive.

A happy scenario would be to have access to this information in the response on the _setup call.

The _setup call is not being invoked here, this is purely an attempt to reconnect to the socket. The _setup call was before the disconnection.

In the meantime I will submit a PR for the the class 1 failures described in the description. That is, how to reconnect after a disconnection where re-authorisation is not involved.

@colinl
Copy link
Contributor Author

colinl commented Jun 8, 2024

I have managed to capture a 'parser error' in the console and from the network activity in the browser tools I can see what is happening. It is in fact very simple.
After the authorisation token has expired the dashboard carries on as normal, using the existing socket connection, until the connection is broken for some reason. Then when the reconnect is attempted I can see that the socket.connect() call in function reconnect() does a GET from https://nodered.mydomain.org.uk/dashboard/socket.io/?EIO=4&transport=polling&t=O_ufrkf.
As the token has expired, Cloudflare responds with a 302 Found status, redirecting to https://mydomain.cloudflareaccess.com/cdn-cgi/access/login/nodered.mydomain.org.uk?kid=9df892976242106878f0a65fd2c4543695b45472395325aba3f8277a33c5e44c&redirect_url=%2Fdashboard%2Fsocket.io%2F%3FEIO%3D4%26transport%3Dpolling%26t%3DO_ufrkf&meta=eyJraWQiOiIyZmE3OTM1NWY2YzdlYjRmNTdhY....
Which is the cloudflare login web page, which is not what socket.io is expecting, so it returns the parser error failure.

@cstns
Copy link
Contributor

cstns commented Jun 9, 2024

That is an awesome result!

If we can detect that 302 redirect we can safely force a page reload!

@colinl
Copy link
Contributor Author

colinl commented Jun 9, 2024

I think that would have to be down in the socket code somewhere.
I don't see the argument against forcing a reload when parser error is seen. Even if it isn't caused by exactly this problem, what else can we do?

@cstns
Copy link
Contributor

cstns commented Jun 9, 2024

You have a good point, I can't think of anything else at the moment.

@colinl
Copy link
Contributor Author

colinl commented Jun 9, 2024

I will do a PR for that. It is working well for me here. Anyone with any other ideas can chip in there.

@joepavitt
Copy link
Collaborator

joepavitt commented Jun 11, 2024

Thanks @colinl - I'm playing catch up on this since returning today from holiday, can you reference this issue in any PR you open please so that it gets linked.

@colinl
Copy link
Contributor Author

colinl commented Jun 11, 2024

I have done so from the first one #950. I am waiting for that to be merged before doing the second as they are in the same bit of code.

@joepavitt
Copy link
Collaborator

You can open still open the PR, just put a comment in the first bit of the PR that it's dependent upon the other one first :)

@colinl
Copy link
Contributor Author

colinl commented Jun 12, 2024

Completed by PR #962

@colinl colinl closed this as completed Jun 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task A piece of work that isn't necessarily tied to a specific Epic or Story.
Projects
None yet
Development

No branches or pull requests

4 participants