-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements needed to handling of socket connection lost #946
Comments
I think since I use basic authentication directly with Node-Red, I don't think I have issues with token expiring. At least if I force a page refresh because some widgets stop working no login is required from the browser. |
OK, thanks @arturv2000 |
I have updated the Description with my findings and proposal. I will generate a PR to implement them. |
The |
No, that is all. Could it be that the server is again responding with the cloudflare logon page? So effectively the same as the JSON parser error we were getting when trying to parse the _setup response in the same situation? |
That is very unfortunate! That would be my best guess as well, expecting to receive a json or related type of response and receiving text, html or something unexpected. We we would need some more info to go on. Could you add a global event listener and print out all the messages socket.io is brokering? Something similar to socket.onAny((event, ...args) => {
console.log(`Received event: ${event}`, args);
}); What would interest us is the event and args right before the parser error. I hope it will shed some light on what's happening because we're kind of shooting in the dark |
I put that in, but it doesn't show anything useful. The connection initially keeps going, even though the token has expired, presumably because it is an already open connection. It is not until the connection is dropped for some reason that there is an issue. The code attempts to reconnect and fails. For example, if I trigger it by suspending the laptop briefly then I see
The final line there is shown because I have modified the connect_error log to
The only way I know to recover from this is to reload the page. |
It wouldn't be enough to deduce what's happening. Is there anything logged on the NR side when these errors occur? If we can't get anything out of NR logs I think the only solution left would be the initial option of hard reloading the page. |
Nothing useful in the server log. Even at Trace level in node red and NODE_ENV=development for the dashboard, all I get is |
Would a better course of action be to force a page reload only in case of a disconnect ws message combined with authentication required. It would prevent unauthorized user access to stale dashboard data and break off redirect loops because presumably the page would be redirected to the authentication provider. The tricky part in this situation would be how to find out if the NR instance requires authentication. A happy scenario would be to have access to this information in the response on the _setup call. With stuff like this I'd seek @joepavitt's guidance as my knowledge of NR and the NR-dashboard intricacies is somewhat limited. |
The check on the error code, and potential page reload, would only happen if we had previously been connected, so this is a reconnect, not an initial connect, so that should stop any possibility of repeated load on startup. I don't know whether it is possible for the server to know that authentication is required. I guess that would be something in the socket connection code. If we could have an indication that the attempt to setup the socket has been prevented because of the redirection to the logon page that would be conclusive.
The _setup call is not being invoked here, this is purely an attempt to reconnect to the socket. The _setup call was before the disconnection. In the meantime I will submit a PR for the the class 1 failures described in the description. That is, how to reconnect after a disconnection where re-authorisation is not involved. |
I have managed to capture a 'parser error' in the console and from the network activity in the browser tools I can see what is happening. It is in fact very simple. |
That is an awesome result! If we can detect that 302 redirect we can safely force a page reload! |
I think that would have to be down in the socket code somewhere. |
You have a good point, I can't think of anything else at the moment. |
I will do a PR for that. It is working well for me here. Anyone with any other ideas can chip in there. |
Thanks @colinl - I'm playing catch up on this since returning today from holiday, can you reference this issue in any PR you open please so that it gets linked. |
I have done so from the first one #950. I am waiting for that to be merged before doing the second as they are in the same bit of code. |
You can open still open the PR, just put a comment in the first bit of the PR that it's dependent upon the other one first :) |
Completed by PR #962 |
Description
Currently, in the client, if the socket disconnects then it pops up Connection Lost alert and enters a sequence of repeated attempts to reconnect. Initially it tries every 5 seconds, for a period of 60 seconds, then it drops back to trying every 30 seconds, After a total of 5 minutes it gives up entirely, leaving the Connection Lost, Trying to reconnect message on screen, which is not correct as it is no longer retrying. Some other action needs to be taken at this time. In addition there are some reasons for connection lost that will not recover (such as an authorisation token expiring), the retry sequence is not appropriate in these cases.
I am currently attempting to identify possible causes of disconnection and how to identify them. I will update this as I build the list, and will propose appropriate recovery action.
The first class of failure is for situations where attempting to reconnect for period is entirely appropriate. This includes network failures that prevent access to the server, and problems with node red itself that prevent the connection from succeeding, for example the server is stopped or is in the process of being restarted.
The retry algorithm described above has two problems:
The second class of failures is for situations where it is permanently not possible to reconnect. For these there is no point retrying, the solution is to immediately force a page reload. An examples of this is where an authorisation token has expired and it is necessary to reload in order to load the logon page. The only situation I have been able to test for this is a system that uses Cloudflare Zero Trust to connect to the server, and in that case, when the token expires, after a little while a
socket.on('connect_error', (err) => {
is triggered, where the error contains{"code": "parser error"}
. I propose to trap that, and immediately force a page reload for that situation. There may well be other such permanent failure modes, but unless anyone can identify them now I don't see what we can do other than wait for them to be identified.Have you provided an initial effort estimate for this issue?
I am no FlowFuse team member
The text was updated successfully, but these errors were encountered: