Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NewExponentialJitterBackoff calculates maxBackoffAttempts to a negative number #2227

Closed
minj131 opened this issue Aug 11, 2023 · 3 comments
Closed
Assignees

Comments

@minj131
Copy link

minj131 commented Aug 11, 2023

Describe the bug

Passing any value less than 1 such as retry.NewExponentialJitterBackoff(100 * time.Millisecond) results in a negative number for maxBackoffAttempts when usingNewExponentialJitterBackoff

Expected Behavior

The retryer should retry with jitter not the max backoff until the max backoff attempts is reached.

Current Behavior

Output logs

on attempt 1 which is greater than maxBackoffAttempts -3: waiting for 100ms backoff
retryDelay is  100ms
on attempt 2 which is greater than maxBackoffAttempts -3: waiting for 100ms backoff
retryDelay is  100ms
on attempt 3 which is greater than maxBackoffAttempts -3: waiting for 100ms backoff
retryDelay is  100ms
on attempt 4 which is greater than maxBackoffAttempts -3: waiting for 100ms backoff
retryDelay is  100ms

Reproduction Steps

I create a custom retryer like below

retryer := retry.NewStandard(func(o *retry.StandardOptions) {
		o.MaxAttempts = 5
		o.Backoff = retry.NewExponentialJitterBackoff(100 * time.Millisecond)
	})

which returns the ExponentialJitterBackoff as

// NewExponentialJitterBackoff returns an ExponentialJitterBackoff configured
// for the max backoff.
func NewExponentialJitterBackoff(maxBackoff time.Duration) *ExponentialJitterBackoff {
	return &ExponentialJitterBackoff{
		maxBackoff: maxBackoff,
		maxBackoffAttempts: math.Log2(
			float64(maxBackoff) / float64(time.Second)),
		randFloat64: rand.CryptoRandFloat64,
	}
}

And retries run this logic

// BackoffDelay returns the duration to wait before the next attempt should be
// made. Returns an error if unable get a duration.
func (j *ExponentialJitterBackoff) BackoffDelay(attempt int, err error) (time.Duration, error) {
	if attempt > int(j.maxBackoffAttempts) {
		fmt.Println("on attempt", attempt, int(j.maxBackoffAttempts))
		return j.maxBackoff, nil
	}

	b, err := j.randFloat64()
	if err != nil {
		return 0, err
	}

	// [0.0, 1.0) * 2 ^ attempts
	ri := int64(1 << uint64(attempt))
	delaySeconds := b * float64(ri)

	return timeconv.FloatSecondsDur(delaySeconds), nil
}

However math.Log2(float64(100 * time.Milisecond) / float64(time.Second)) evaluates to a negative number so this

if attempt > int(j.maxBackoffAttempts) {
	fmt.Println("on attempt", attempt, int(j.maxBackoffAttempts))
	return j.maxBackoff, nil
}

always evaluates to true. Every retry will wait for its maxBackoff.

Possible Solution

No response

Additional Information/Context

No response

AWS Go SDK V2 Module Versions Used

github.com/aws/aws-sdk-go-v2 v1.17.8
github.com/aws/aws-sdk-go-v2/config v1.18.21
github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.13.2
github.com/aws/aws-sdk-go-v2/service/s3 v1.32.0
github.com/aws/smithy-go v1.13.5

Compiler and Version used

go version go1.20.4 darwin/amd64

Operating System and version

macOS Ventura 13.4

@minj131 minj131 added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Aug 11, 2023
@RanVaknin RanVaknin assigned RanVaknin and ajredniwja and unassigned RanVaknin Aug 14, 2023
@RanVaknin RanVaknin assigned RanVaknin and unassigned ajredniwja Sep 1, 2023
@RanVaknin
Copy link
Contributor

Hi @minj131 ,

Thanks for opening this issue, this is obviously a bug:

package main

import (
	"fmt"
	"github.com/aws/aws-sdk-go-v2/aws/retry"
	"time"
)

func main() {
	backoff := retry.NewExponentialJitterBackoff(100 * time.Millisecond)
	fmt.Println("maxBackoffAttempts:", backoff)
}
// results in 
// maxBackoffAttempts: -3.321928094887362

Im more curious about the use case; Why would you want a backoff that is less than 100ms? I'd probably throttle my requests in seconds range rather than milliseconds.

Thanks,
Ran~

@RanVaknin RanVaknin added p3 This is a minor priority issue queued This issues is on the AWS team's backlog and removed needs-triage This issue or PR still needs to be triaged. labels Sep 1, 2023
@lucix-aws
Copy link
Contributor

lucix-aws commented Nov 27, 2023

Your observation of the behavior is accurate but the value being calculated to < 0 doesn't really have a negative effect on the overall retry loop.

In the strictest sense, the retry delay for a given attempt in seconds is supposed to be calculated like so:

t_i = min(br^i, MAX_BACKOFF)
// i = attempt count
// b = randf in [0, 1]
// MAX_BACKOFF configured by the caller

Evidently we have previously applied this optimization of deriving the cutoff point where we reach the maximum to avoid some arithmetic.

In doing so we've introduced a quirk in the specified behavior where we "stick" to the maximum once we've crossed that # of attempts. This deviates from the specification, since values of b could conceivably be small enough to sometimes result in t_i evaluating to something less than MAX_BACKOFF even at large values of i. A side effect of this is that any max backoff below time.Second will cause us to immediately snap to that value as you've observed (since the result of that Log2 calculation will be < 0 for any input value of max backoff less than time.Second).

Regardless I view this as a minor quirk and an established one, changing it now in this MV would have some effect on the cadence of retried calls and I find it hard to justify changing it at this point.

@lucix-aws lucix-aws closed this as not planned Won't fix, can't repro, duplicate, stale Nov 27, 2023
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

@lucix-aws lucix-aws removed queued This issues is on the AWS team's backlog bug This issue is a bug. p3 This is a minor priority issue labels Nov 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants