Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to random load balance to >1 server nodes #631

Open
wants to merge 3 commits into
base: dev
Choose a base branch
from

Conversation

kamran-m
Copy link

This closes #630

I have also verified the change for a client => server with low number of client/server nodes, where the imbalance was more severe (note that here I used peerConnectionCount>= numPeers=5, which is the extreme case but it's enough to verify the solution):

screen shot 2017-06-16 at 10 51 25 pm

@CLAassistant
Copy link

CLAassistant commented Jun 17, 2017

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@kamran-m kamran-m requested a review from prashantv June 21, 2017 21:37
@blampe blampe requested review from kriskowal and akshayjshah June 21, 2017 23:03
Copy link
Contributor

@prashantv prashantv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @kamran-m

Some changes I'd like to see:

  • Benchmarking to show the performance before and after this change, since currently there's some perf regressions.
  • Tests that cover all the different cases (peerConnectionCount = 0, peerConnectionCount = 1, peerConnectionCount < available peers, peerConnectionCount > available peers, etc)

peer.go Outdated
@@ -175,8 +186,8 @@ func (l *PeerList) Remove(hostPort string) error {
return nil
}
func (l *PeerList) choosePeer(prevSelected map[string]struct{}, avoidHost bool) *Peer {
var psPopList []*peerScore
var ps *peerScore
var chosenPSList []*peerScore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

presize chosenPSList to avoid a bunch of allocations on each selection, especially when it's > 1

peer.go Outdated
ps.chosenCount.Inc()
return ps.Peer
}

func randomSampling(psList []*peerScore) *peerScore {
peerRand := trand.NewSeeded()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please do not create a new rand for every single peer selection request.

peer.go Outdated
}

ps := randomSampling(chosenPSList)
if ps == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would we need this check?

@kamran-m
Copy link
Author

Thanks for the review @prashantv , I addressed your comments. I also added some benchmark tests with these results:

BenchmarkGetPeerWithPeerConnectionCount1  	 5000000	       348 ns/op
BenchmarkGetPeerWithPeerConnectionCount10 	 1000000	      1490 ns/op

and also used the same test in the old code to compare with peerConnectionCount = 1 case above:

BenchmarkGetPeerOld-8   	 5000000	       267 ns/op

Please let me know if the 267 => 348 is of any concern.

@akshayjshah akshayjshah removed their request for review July 5, 2017 16:40
@kriskowal kriskowal removed their request for review July 5, 2017 23:53
Copy link
Contributor

@prashantv prashantv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay in reviewing @kamran-m, I was OOO and was catching up after I got back.

I've been thinking a little bit about whether this is the right way to do it -- I don't know we should put every strategy into the TChannel core library. For example, see another strategy that we'd like: #640

If instead of this change, we just ensured that TChannel had 2 connected peers, then:

  • No changes are needed to TChannel
  • You get the standard peer selection (least pending, with round-robin fallback if the scores are the same).

That would likely have even better load distribution.

peer.go Outdated
@@ -48,6 +47,9 @@ var (
// ErrNoNewPeers indicates that no previously unselected peer is available.
ErrNoNewPeers = errors.New("no new peer available")

// ErrZeroPeerConnectionCount indicates that the peer connection count is set to zero.
ErrZeroPeerConnectionCount = errors.New("peer connection count must be greater than 0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason to export this error -- once you export an error, it's part of the public API and we can never change it. There shouldn't be a good reason for a user to compare a returned error against this, so it shouldn't be exported.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense!


for i := 0; i < b.N; i++ {
peer, _ := ch.Peers().Get(nil)
if peer == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this ever happen? If not, maybe we should do a t.Fatal instead of a println

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't, I just added it to guard against compiler optimization not to artificially lower the runtime of the benchmark. Changed it to Fatal!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't, I just added it to guard against compiler optimization not to artificially lower the runtime of the benchmark. Changed it to Fatal!

peer_test.go Outdated

func BenchmarkGetPeerWithPeerConnectionCount10(b *testing.B) {
numPeers := 10
peerConnectionCount := uint32(10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of copying + pasting the whole benchmark, abstract common logic out into a function, and then call it with different args for the peerConnectionCount and maybe even numPeers

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

}{
// the higher `peerConnectionCount` is, the smoother the impact of uneven scores
// become as we are random sampling among `peerConnectionCount` peers
{numPeers: 10, peerConnectionCount: 1, distMin: 1000, distMax: 1000},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add peerConnectionCount: 2, i imagine this will be a pretty small value normally?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed as the impact of it get's smaller with every extra connection. Added the case for 2!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed as the impact of it get's smaller with every extra connection. Added the case for 2!

for _, tc := range testCases {
// Selected is a map from rank -> [peer, count]
// It tracks how often a peer gets selected at a specific rank.
selected := make([]map[string]int, tc.numPeers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you make a slice here, why can't we just create a singel map in the loop for numIterations, and do the testDistribution call right there too? Makes 3 loops 1 loop, removes a slice, and simplifies the test a little.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if I follow. This is a similar test as TestPeerSelectionRanking where we are checking the distribution of the ranking after numIterations which means we have to do the checking outside of the loop. Or am I missing something here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if I follow. This is a similar test as TestPeerSelectionRanking where we are checking the distribution of the ranking after numIterations which means we have to do the checking outside of the loop. Or am I missing something here?

peer_test.go Outdated
b.ResetTimer()

for i := 0; i < b.N; i++ {
peer, _ := ch.Peers().Get(nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should think about testing the impact of passing a non-nil blacklist with peerConnectionCount > 1

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean if prevSelected actually has some selected peers? I am assuming if a client uses peerConnectionCount > 1, it should keep it into account that the prevSelected peers is going to be a bigger list. Or do you have another specific issue in mind?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean if prevSelected actually has some selected peers? I am assuming if a client uses peerConnectionCount > 1, it should keep it into account that the prevSelected peers is going to be a bigger list. Or do you have another specific issue in mind?

Adding possibility to set `peerConnectionCount` so that a single client
node can do a random load balancing among `peerConnectionCount` top
nodes using the score calculator and peer heap.
@kamran-m kamran-m force-pushed the rand-load-balance branch from f1d327f to f8f2139 Compare August 8, 2017 06:47
@kamran-m
Copy link
Author

kamran-m commented Aug 8, 2017

I agree with the idea that the core should not be strategy aware. As I commented on #630 I think all the logic regarding peer connection count, score strategy, peer heap, etc should move to something like PeerSelectionStrategy.

I am not sure why you think with 2 connected peers we won't need this core change and also get a better load distribution. This change is basically letting the client decide how many connected peers they want. So in cases, like ours, were we have a very low number of server/client nodes, it might actually be helpful to use 5 connected peers as you can verify with the stats I posted above.

Let me know your thoughts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Load balance traffic to N>1 server nodes
3 participants