-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Too-large values of request_pool and put_concurrency yield 'disconnected' errors #616
Comments
Increasing the pb_backlog to 64 in riak app.config is a workaround. It was introduced in this commit: 3a07c25 However, it's not actually a bug per se. This scenario can happen often enough. For example if riak dies for some reason or is unavailable to CS the same error will crop up. The best thing is probably to handle it and return a 503. We will be mumbling about it today. |
Does the test complete fine before 3a07c25? |
My guess is no. In that case it probably crashes the proc as you suggest in that commit. I will give it a try though. |
It actually completes with no errors on release/1.3 |
I didn't notice, but shino also confirmed that. |
Steps to fix this type of issue:
|
New data: lowering the request_pool to 128 also does not result in this error. Debugging further, but it seems to be that the poolboy procs never actually fully hit 400 connections during intiialization. <3 netstat |
I found the issue finally. When you have a large number of connection attempts and a low backlog from the accept process the kernel sends RST packets that signal to riakc_pb_client to close the socket and re-attempt a connection. However, if you are attempting a much larger number than the backlog at the same time (thundering herd), the kernel actually rate limits RST packets in order to slow down attackers. Therefore all 400 processes think they are connected to riak, and only get a RST when they try to use the connection. What we would like to happen, when testing locally, is for the clients to know they are not connected, close the socket, and retry to connect. That way when the test is run, we don't get {error, disconnected}. On my OSX machine the limit was set at 250, and hence 400 connections and a backlog of 5 causes this issue everytime. Log messasges show up in /var/log/system.log that look like the following:
There are 2 ways to fix this issue.
Issue closed. I'm out. Peace. |
Not to flog a dead horse or comment on a closed ticket... ... client connections between CS <-> Riak can fail at inconvenient times. The crazy OS BS that you found (and which I'd never suspected existed) is only one of those ways. IMO, CS needs to be behave better during those inconvenient cases. I suspect that the problems as yet unresolved by #519 covers a subset of "behave better". Having said that, I've also created #660 |
Ha. @slfritchie I may have been a bit overzealous in closing this ticket, considering it was Friday evening before the long weekend :) However, there is still an open question of what to do here. I'm not convinced that trying to figure out every place where {error, disconnected} occurs and fixing it is the way to go. That goes doubly so after reading the source ;) While it's simple enough to fix this for the obvious block_server error on the first block, as @reiddraper mentioned to me, if this happens on later blocks the header has already been sent. In this case a crash seems reasonable. I'll try to get some more discussion/consensus going and maybe I'll make some changes to handle these more gracefully. But in general, if your local riak instance isn't connected to CS something has gone horribly wrong. Putting much more effort into this seems to have diminishing returns. |
Using Riak + Riak CS on my Mac in single node dev configuration, plus these changes to the Riak CS vm.args (-env ERL_MAX_PORTS 16000) and these to app.config:
Then I see the following errors when running basho_bench with a 100% insert workload.
... and lots of complaints from basho_bench:
Here is the basho_bench config that I used:
The text was updated successfully, but these errors were encountered: