Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Business calls keep failing on non running instance using load-balancer=least-response-time #233

Closed
vsevel opened this issue Feb 10, 2022 · 10 comments · Fixed by #234
Closed

Comments

@vsevel
Copy link
Contributor

vsevel commented Feb 10, 2022

My discovery returns 2 service instances:

  • 8080 (running)
  • 8090 (not running)

My client application repeatedly calls this service. The first call succeeds, the subsequent calls all fail.

To reproduce:

mvn clean install
java -Dquarkus.http.port=8080 -jar target/quarkus-app/quarkus-run.jar
curl http://localhost:8080/hello/stork => SUCCESS
curl http://localhost:8080/hello/stork => KO
curl http://localhost:8080/hello/stork => KO
curl http://localhost:8080/hello/stork => KO
...

see https://github.com/vsevel/stork-issue/tree/issue_233

@michalszynkiewicz
Copy link
Collaborator

Isn't it an issue of non preserving the service instance ids?
The code below seems to assing a new id on every service discovery refresh

    private List<ServiceInstance> map(String list) {
        return Arrays.stream(list.split(",")).map(this::createInstance).collect(Collectors.toList());
    }

    private ServiceInstance createInstance(String hostPort) {
        String[] split = hostPort.split(":");
        String host = split[0];
        int port = Integer.parseInt(split[1]);
        return new DefaultServiceInstance(ServiceInstanceIds.next(), host, port, false);
    }

@michalszynkiewicz
Copy link
Collaborator

hmm... with 600s refresh period the refresh should not be called at all

@vsevel
Copy link
Contributor Author

vsevel commented Feb 10, 2022

The code below seems to assing a new id on every service discovery refresh

what is the contract on service ids?
I tried to cache the service instance id across refresh per host-port:

    MyServiceDiscoveryProviderConfiguration config;
...
    private ServiceInstance createInstance(String hostPort) {

        String[] split = hostPort.split(":");
        String host = split[0];
        int port = Integer.parseInt(split[1]);

        if (!ids.containsKey(hostPort)) {
            ids.putIfAbsent(hostPort, ServiceInstanceIds.next());
        }
        long id = ids.get(hostPort);

        return new DefaultServiceInstance(id, host, port, false);
    }

but the test keeps failing.

on the first call (instance 0, which is up) the call succeeds
on the second call (instance 1, which is down) the call fails, and we store newTotalTime=60000.0 and newWeightSum=1.0
on the third call we choose the failing instance because:

  • instance 0 (up): score=3.39907167E7
  • instance 1 (down): score=60000.0, which is the errorPenalty

I think may be the issue is that the errorPenalty is in ms, whereas the timeInNs is in nanos

@vsevel
Copy link
Contributor Author

vsevel commented Feb 10, 2022

I confirm that by changing that line in the LeastResponseTimeLoadBalancer fixes the problem:
long errorPenalty = DurationUtils.parseDuration(config.getErrorPenalty()).toNanos();

note that caching the service id is probably a bad idea because an instance that failed once would be ignored forever I suspect. changing the service ids after each refresh will loose the stats, but allow failed instances to be retried.

the way we handled this in our impl was that failed instance would go on the "not_ok" list for 1 minute with no calls dispatch to it (unless there were no instances available at all), and come back into the "ok" list after spending this minute on the side.
if this instance was still failing. we would put it back into the "not_ok" list, and continue the cycle ...
since we were recognizing the idempotent situations, we would not mind trying out an instance that had failed, as long as we would not try to use it more than once per minute, until there was one successful call. then the instance would go back into the "ok" pool.

the the difference is that our default refresh period (3 mins) was unrelated to our purgatory period (1 min).
right now, the only way we get our failed instance to be retried is to reset stats, and my understanding is that the way to do this is to change the service ids during the discovery refresh. but it does not have to.

anyway, I am going to do a PR for the nanos issue.

@michalszynkiewicz
Copy link
Collaborator

michalszynkiewicz commented Feb 10, 2022

I confirm that by changing that line in the LeastResponseTimeLoadBalancer fixes the problem: long errorPenalty = DurationUtils.parseDuration(config.getErrorPenalty()).toNanos();

note that caching the service id is probably a bad idea because an instance that failed once would be ignored forever I suspect. changing the service ids after each refresh will loose the stats, but allow failed instances to be retried.

it's up to load balancer to decide when to reuse an instance that failed. The serviceIds should be preserved to let LB get a chance to make that decision

the way we handled this in our impl was that failed instance would go on the "not_ok" list for 1 minute with no calls dispatch to it (unless there were no instances available at all), and come back into the "ok" list after spending this minute on the side. if this instance was still failing. we would put it back into the "not_ok" list, and continue the cycle ... since we were recognizing the idempotent situations, we would not mind trying out an instance that had failed, as long as we would not try to use it more than once per minute, until there was one successful call. then the instance would go back into the "ok" pool.

the the difference is that our default refresh period (3 mins) was unrelated to our purgatory period (1 min). right now, the only way we get our failed instance to be retried is to reset stats, and my understanding is that the way to do this is to change the service ids during the discovery refresh. but it does not have to.

anyway, I am going to do a PR for the nanos issue.

great, thank you!

@vsevel
Copy link
Contributor Author

vsevel commented Feb 10, 2022

it's up to load balancer to decide when to reuse an instance that failed. The serviceIds should be preserved to let LB get a chance to make that decision

so I understand this is a must to preserve those ids.
I put the code back.

however, least-response-time seems to be designed to never retry a previously failed instance (unless all other instances would fail).
is this correct?
is this by design (considered a characteristic of this load balancer), or a non optimal behavior that shall be fixed?
that would not work for me, because we would be loosing load balancing after the first failure. is that correct?

@michalszynkiewicz
Copy link
Collaborator

here is how the least-response-time works now: #225

E.g. if your error-penalty is set to 60s, and declining-factor is 0.9, after another 60 other calls made, it's significance will be as if it was ca 100 ms response time at the most recent call.
So in time, the significance of the data collected in the past is lowered, hence even service instances that were failing, or had a long response time, a while back, will be retried in the future.

@michalszynkiewicz
Copy link
Collaborator

Both parameters are configurable

@vsevel
Copy link
Contributor Author

vsevel commented Feb 10, 2022

interesting. so if you have a high rate of outgoing calls (e.g. 20 per sec) the failed instance will be retried more often than if you have a low rate. I like better the predictable nature of the 1 minute penalty box, but this is not a big concern for me.

the ability to do automatic retries when possible (which means that the load balancer should be able to provide a list of prioritized eligible instances, and instruct on the situations where calls can be retried or not) is what I would be looking at next.

I am not an expert on MP fault tolerance, but one thing I would like to avoid is having all developers to annotate their @GET services with @Retry just because they know those services are idempotent as defined by the http spec. And even if we were asking them to do it, they would not know how many retries to set, because only the client side load balancing component knows how many instances are worth retrying.

I will expand those thoughts in #232

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants