About ClientPolicy.MinConnectionsPerNode vs server's proto-fd-idle-ms #374

xqzhang2015 · 2022-05-25T03:16:34Z

Client

Go 1.15.3
Aerospike: v4.5.0

Issue

Here is the drop-idle-connections logic.

Line 237 in b936c14

// DropIdle closes all idle connections.

For example, if we set ClientPolicy.MinConnectionPerNode to 5, these 5 idle connections may have no chance to check if it is idle or already timeout. Of course, This case is under not busy Aerospike access scenario.

For these idle conns, if it already exceeds server's proto-fd-idle-ms, Aerospike server may proactively close the conn and client conn hasn't realized that. Then the client will get such error

Node BB9353C99FXXX 10.y.y.y:3000: write tcp 10.x.x.x:46554->10.y.y.y:3000: write: broken pipe

Question

Is it possible to fully iterate the conns pool and clean up the idle ones?

khaf · 2022-05-25T05:39:29Z

You being on v1.x is the bigger news to me :)
The assumption has been that those connections are dropped and the command is usually retried. Is that not the case?

xqzhang2015 · 2022-05-25T07:16:41Z

Yes, it's retried.

Let me provide more details:

For the broken pipe log, it's from this line

aerospike-client-go/command.go

Line 2169 in b936c14

logger.Logger.Debug("Node " + cmd.node.String() + ": " + err.Error())
The command is Get()

func (clnt *Client) Get(policy *BasePolicy, key *Key, binNames ...string) (*Record, error) {

The command policy uses default values: MaxRetries=2, SleepBetweenRetries: 1ms
Because I set the conn-pool to 5, and initial-execute and retried-execute are total of 3 times, I get 3 error logs and each log interval is 1ms:

I0520 02:44:42.754393 command.go:2169] [aerospike] Node BB9353C99F5580E 10.y.y.y:3000: write tcp 10.x.x.x:57552->10.y.y.y:3000: write: broken pipe
I0520 02:44:42.755625 command.go:2169] [aerospike] Node BB90D6223B88B0A 10.y2.y2.y2:3000: write tcp 10.x.x.x:54522->10.y2.y2.y2:3000: write: broken pipe
I0520 02:44:42.756857 command.go:2169] [aerospike] Node BB9353C99F5580E 10.y3.y3.y3:3000: write tcp 10.x.x.x:33098->10.y3.y3.y3:3000: write: broken pipe

Then the last write: broken pipe is returned by Client.Get().

xqzhang2015 · 2022-05-30T03:27:51Z

@khaf any further comment?

khaf · 2022-05-30T03:29:32Z

I think I understand the issue, I just need to implement the solution. ETA this week.

xqzhang2015 · 2022-05-30T03:30:19Z

@khaf thanks so much. Looking forward to good news.

xqzhang2015 · 2022-06-28T08:07:34Z

@khaf any update on this thread?

We recently encountered another issue because of this.

Issue

As cluster has 48 nodes. One client sets ClientPolicy.MinConnectionsPerNode to 100. So the expected connections to as cluster should be 4800. And we foundation nodeStats.ConnectionsOpen is 4800.
But by using netstat -apn | grep ":3000" | wc -l, it shows only about ~1300.

Analysis

We set min-conns to handle peak traffic. For general traffic, we don't need so many conns, and partial conns will be idle-timeout. But these partial conns will have no chance to be closed, even if idle-timeout, because of min-conns. Only for the next traffic peak, these conns will be touched, but maybe with aerospike access error or trigger creating many new conns.

Proposal

Does it work to simply iterate all heaps to drop idle-timeout conns? or Is it a good option?

func (h *connectionHeap) DropIdle() {
	for i := 0; i < len(h.heaps); i++ {
		for h.heaps[i].DropIdleTail() {
		}
	}
}

khaf · 2022-06-28T15:02:00Z

Sorry another high priority issue popped up that I had to take care of. I'll release the fix to this issue tomorrow.

xqzhang2015 · 2022-06-30T04:57:52Z

@khaf thanks for the update. We are using both v4.x and v5.x clients. Looking forward to the release.

khaf · 2022-06-30T09:18:33Z

I haven't forgotten about this ticket. The solutions that I came up with this week have not been satisfactory. Since this is an important feature of the client and I'll have to port it forward to the newer versions (and quite possibly to other clients as well), I need to take my time. But this is my current active issue, so will solve it before moving to other work.
Tell me, are you using Go modules yet? Should I make a v4 branch with module support for the client?

xqzhang2015 · 2022-07-03T02:04:52Z

@khaf Yes, we're using Go mod. BTW, it seems v4 branch currently works well with Go mod.

khaf · 2022-07-13T14:04:14Z

OK, after writing and testing the code for this for the client, it turned out in an engineering meeting that the correct resolution for this issue is to set the server idle time to 0 to avoid reaping the connections from the server side. Please let me know how that goes.

khaf · 2023-05-31T16:25:20Z

Closing this. Feel free to reopen or issue a new ticket in case you had further question.

xqzhang2015 · 2023-07-31T03:35:44Z

@khaf We just reencountered this issue recently. Previously we just disabled conn-pool in not-so-busy environments.

I understand the idea of setting the server idle timeout to 0.
But there is a challenge that our Aerospike cluster is shared by multi-services and multi-team, which is hard to make sure each service/team closes the idle connections duly, so as to use system resources efficiently.

Is it possible to re-evaluate the solution?

BTW, if it still prefers to current solution, is it possible to add a comment to IdleTimeout or MinConnectionsPerNode, to avoid other misunderstandings?

Thanks

khaf · 2023-07-31T10:57:48Z

I reopened the issue, so that I can track it. I don't know if the product or engineering will accept any change or addition to the current solution, but I'll bring it up.

For the comments on ClientPolicy, what would you like to be added for clarification?

ps: What version of the client are you currently on?

xqzhang2015 · 2023-08-21T07:02:17Z

@khaf sorry for the late reply.

For the comment, like When MinConnectionsPerNode is enabled, server side must set the idle time to 0

And my APP version is v4.5.0.

xqzhang2015 · 2023-10-09T02:45:52Z

@khaf any update for this thread?

nkonopinski · 2024-02-29T00:32:11Z

Any update here? I'm running client 6.13 and see number of connections drop below MinConnectionsPerNode.
Server CE 6.3.0.1 with proto-fd-idle-ms set to the default of 0.
Connections drop too low and when traffic spikes my requests timeout.

khaf closed this as completed May 31, 2023

khaf reopened this Jul 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About ClientPolicy.MinConnectionsPerNode vs server's proto-fd-idle-ms #374

About ClientPolicy.MinConnectionsPerNode vs server's proto-fd-idle-ms #374

xqzhang2015 commented May 25, 2022 •

edited

Loading

khaf commented May 25, 2022

xqzhang2015 commented May 25, 2022 •

edited

Loading

xqzhang2015 commented May 30, 2022

khaf commented May 30, 2022

xqzhang2015 commented May 30, 2022

xqzhang2015 commented Jun 28, 2022 •

edited

Loading

khaf commented Jun 28, 2022

xqzhang2015 commented Jun 30, 2022

khaf commented Jun 30, 2022

xqzhang2015 commented Jul 3, 2022

khaf commented Jul 13, 2022

khaf commented May 31, 2023

xqzhang2015 commented Jul 31, 2023

khaf commented Jul 31, 2023 •

edited

Loading

xqzhang2015 commented Aug 21, 2023

xqzhang2015 commented Oct 9, 2023

nkonopinski commented Feb 29, 2024

About ClientPolicy.MinConnectionsPerNode vs server's proto-fd-idle-ms #374

About ClientPolicy.MinConnectionsPerNode vs server's proto-fd-idle-ms #374

Comments

xqzhang2015 commented May 25, 2022 • edited Loading

Client

Issue

Question

khaf commented May 25, 2022

xqzhang2015 commented May 25, 2022 • edited Loading

xqzhang2015 commented May 30, 2022

khaf commented May 30, 2022

xqzhang2015 commented May 30, 2022

xqzhang2015 commented Jun 28, 2022 • edited Loading

Issue

Analysis

Proposal

khaf commented Jun 28, 2022

xqzhang2015 commented Jun 30, 2022

khaf commented Jun 30, 2022

xqzhang2015 commented Jul 3, 2022

khaf commented Jul 13, 2022

khaf commented May 31, 2023

xqzhang2015 commented Jul 31, 2023

khaf commented Jul 31, 2023 • edited Loading

xqzhang2015 commented Aug 21, 2023

xqzhang2015 commented Oct 9, 2023

nkonopinski commented Feb 29, 2024

xqzhang2015 commented May 25, 2022 •

edited

Loading

xqzhang2015 commented May 25, 2022 •

edited

Loading

xqzhang2015 commented Jun 28, 2022 •

edited

Loading

khaf commented Jul 31, 2023 •

edited

Loading