-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More control over http keepalive and sockjs timeouts #264
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just a couple of nits.
Should we note the disable_websockets on;
caveat on the config settings?
} | ||
|
||
sockjs_disconnect_delay { | ||
desc "How long the SockJS server should wait between HTTP requests before considering the client to be disconnected."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you note the default?
* Add additional configuration directives `http_keepalive_timeout`, | ||
`sockjs_heartbeat_delay`, and `sockjs_disconnect_delay` to allow working | ||
with very slow connections and large SockJS payloads. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth clarifying whether or not any of these default values changed from the previous release. (I'm curious)
And/or forbid larger-than-default delays when websockets are enabled, as that's almost certainly a misconfiguration? |
Filed an issue on sockjs-node to make the websocket timeout configurable: sockjs/sockjs-node#219 |
@trestletech I'm not going to forbid or warn on larger-than-default delays with websockets enabled, for now. That's a great suggestion, but I'm not comfortable with the magnitude of the change at the moment (the logic for interpreting The verbiage suggestions have gone in. I'm going to merge now so we can have a build for that customer, but if you have other suggestions I'm happy to take them. Thanks! |
A user reported disconnects when users on slow internet connections try to use apps that have large outputs (it's unrelated to how long the outputs take to calculate, it's purely the payload size that matters).
Repro
I was able to repro using the following app, plus using commands to slow down the network interface on Linux:
This app sends a ~22MB payload to the client (which is ignored) before rendering the plot.
Run this command on the server to slow down one of the network interfaces (tested on Ubuntu):
(Replace with your interface name, e.g.
eth0
. If you don't know your network interface names, call ifconfig. Don't use thelo
interface, that's the loopback and affects not only client-to-SS communications but also SS-to-R, which seems to result in a different failure mode.)To turn off the network interface slowness use this command:
Because we are not using the loopback adapter, you have to use a browser that's on a different machine (or if you're using a Linux VM, then a browser on the host OS will do).
You will get different results depending on what SockJS transport you use:
ERR_INCOMPLETE_CHUNKED_ENCODING
after 45-50 seconds, I believe. Client then attempts to auto-reconnect but the session is gone.I didn't test any other transports.
Root cause
There are several different timeouts that work in concert to disrupt this scenario. However, the bottom line is that SockJS was not designed for large amounts of latency in messages being received. There is an issue filed against sockjs-node that points to the underlying problem, which I'll describe.
From the server side, SockJS needs to know when a client connection is dead, so it can close resources that it's holding from the server side and let the application know it should execute whatever end-of-session logic it has. With xhr-streaming and xhr-polling, this obviously requires a timeout in some way; because the client will serially issue multiple xhr-streaming or (especially) xhr-polling requests over the course of a single connection, there are bound to be times when one request has ended and we're not sure whether another one is ever going to arrive. SockJS has a
disconnect_delay
option that lets you control how long to wait, the default is 5 seconds. This means that when the client ends an xhr-streaming/xhr-polling request, by default it has to issue a new request with 5 seconds or else it will be considered disconnected.The problem is that the 5 second timer begins not when the current request disconnects, but when the last (asynchronous) message send begins. So if that message takes a long time to deliver, as in this case, the timer will have long since expired even if the client quickly starts a new xhr request.
With websockets, there's one continuous socket the whole time, so there isn't this particular problem. However, there's (apparently) a similar situation, where the client may have gone away but we never got a TCP
FIN
orRST
or whatever; SockJS wants to know in a timely manner when this situation has occurred. The websocket transport is unique among the transport types in that when a heartbeat packet is sent from server to client (by default every 25 seconds), there is a hardcoded time limit of 10 seconds by which the client must acknowledge the heartbeat by sending a message back to the server. Similar to the xhr problem, this 10 second timer starts when the heartbeat is queued, not when it makes it to the client; and in our repro case above, it's going to be a couple of minutes between when the heartbeat is queued and when it is delivered.Finally, there's one more timeout that affects the scenario above. In main.js we were calling
socket.setTimeout(45*1000)
on every HTTP request we receive. This was intended to be an http keepalive timeout, so idle sockets don't stay connected to us too long. However, this timeout doesn't seem to work the way I expected; it terminates 45 seconds after data is written to the socket object, regardless of how long it takes for the data to actually be sent. (It was hard to find the right Node.js issue for this problem as there seems to be some different bugged behavior that used to happen when you use the http library on top, but this seems like it may be it: nodejs/node#3319)Workaround
This pull request merely lets the admin configure these timeouts by hand. Whatever the longest payload-delivery is expected, at least that number of seconds should be set for
sockjs_disconect_delay
. Andhttp_keepalive_timeout
should be that number plus another buffer, like 20% or 20 seconds, whichever is less.Also,
disable_websockets on;
should be used, since the heartbeat response timeout for websockets is hardcoded and can't be changed.For the repro above, this works for me:
Potential future work
response#end
andsocket#write
both have callbacks that, in this test case at least, fire when the data is fully sent on the socket (so in this case, 2 minutes after write). So it seems like it'd be possible to implement thehttp_keepalive_timeout
the "right" way, by monkeypatching the socket'swrite
andsetTimeout
methods.In multiple conversations with @trestletech and now @jmcphers we've questioned whether SockJS and websockets in general are worth the trouble, given how smoothly RStudio's long-polling works. They might not be, and switching to long-polling would be much simpler and give us much more control over these kinds of issues. However, there is the big disadvantage for SS/SSP that the number of outstanding long-polling requests you can have open to a single hostname are quite limited (usually 6), while the limit for WebSockets is much larger (like 255). It may be that HTTP/2 will solve this problem for us, seems worth investigating.