-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support per-operation timeouts in waiters #1862
Comments
Also, I think this has something to do with the |
Hi @gh-ghuang , After investigating this issue I'm pretty confident this is not an SDK related bug. The waiter will hang until a response is given from the server, whereas the Retryer will retry 2 additional times if a retryable error is returned:
So the service has returned an error. You are expecting the retryer to retry more times but what you really want to do is investigate why the service is returning an error. I suggest you do that either by enabling retryer logging: cfg, err := config.LoadDefaultConfig(context.TODO(), config.WithRegion("us-east-1"), config.WithClientLogMode(aws.LogRetries))
if err != nil {
log.Fatalf("unable to load SDK config, %v", err)
} or by putting a break point here and seeing what kind of error the service has returned.
The waiter will hang for the amount you tell it to (in your case 15min) or until it gets a response from the service API. Let me know if that helps, |
Should
maxWaitDur as the context timeout for a particular request?
I wonder if it would be better if the per request timeout is limited to, let's say, 30s. (Maybe we can create a per request ctx). We pass 15 minutes as the |
@RanVaknin I think the behavior of hanging a single request during a wait until the And what could be the underlying issue that would cause the request to take 15m to finish? It that an expected SLA for this API to take this long? |
@gh-ghuang ,
Thanks, |
Here are some log traces I found after using
For some reason, the The following is what I got when I only have
Please note the 15 mins elapsed time between the two log lines, which is the
No, not used in this case.
No, we don't have custom http configs, we just use the default. |
Hi @gh-ghuang , Thanks for your response. This is odd. Are you sure you are enabling logging correctly? func main() {
cfg, err := config.LoadDefaultConfig(context.TODO(), config.WithRegion("us-east-1"), config.WithClientLogMode(aws.LogRetries|aws.LogRequestWithBody|aws.LogResponseWithBody))
if err != nil {
log.Fatalf("unable to load SDK config, %v", err)
}
client := ec2.NewFromConfig(cfg)
waiter := ec2.NewInstanceTerminatedWaiter(client, func(options *ec2.InstanceTerminatedWaiterOptions) {
options.LogWaitAttempts = true
})
err = waiter.Wait(context.Background(), &ec2.DescribeInstancesInput{InstanceIds: []string{"instanceID"}}, 15*time.Minute)
if err != nil {
panic(err)
}
fmt.Println("terminated successfully...")
} It's a shot in the dark without logging but here is some steps you can try:
You might want to try to configure your client http options like so: client := ec2.NewFromConfig(cfg, func(options *ec2.Options) {
options.HTTPClient = http.NewBuildableClient().WithDialerOptions(func(dialer *net.Dialer) {
dialer.Timeout = 5 * time.Second
}).WithTransportOptions(func(transport *http2.Transport) {
transport.TLSHandshakeTimeout = 5 * time.Second
})
}) Try to play around with the timeouts of the dialer, and / or the TLS handshake timeout. Maybe that will help? Like I said, until we have logs I can't really tell. |
Does this log line looks like what should be when enabled by Is an empty request ID suspicious to you? Re: client config. |
Also, I have filed a support ticket, and attached the request ID in one of the failure cases. I will update here for findings from that support case. |
A support ticket with AWS? Those eventually end up at our internal queue so I don't know if that's very helpful. Can you please share some of your code? Im not sure why you are not seeing any logs. Thanks, |
I figured out why the log lines are missing. We have our own logger, and I wasn't setting the right log level. Please allow me sometime to reproduce this. I will share out the log lines when I hit this again. Somehow the latency has gone better and this happens less frequently this week. I will keep running it until I hit one. |
I was able to reproduce this with (
It's definitely hanging, and I don't think the request log here would provide enough information into why the request hangs here. I think without per request timeout setting inside the waiter, this would happen sometimes and we ended up using all the context timeout on just one attempt. |
Hi @gh-ghuang , What you are asking to is quite difficult for me to test because the way your resource set up is very complicated. In theory you should add a middleware stage after the last stage of the client request construction (finalize step). In the middleware we can set the context timeout to be 10 seconds so whenever the client makes an API request it caps it at 10 seconds. myMiddleware := middleware.FinalizeMiddlewareFunc("MyTestMiddleware",
func(ctx context.Context, input middleware.FinalizeInput, next middleware.FinalizeHandler,
) (
output middleware.FinalizeOutput, metadata middleware.Metadata, err error,
) {
ctx2, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
return next.HandleFinalize(ctx2, input)
})
client := ec2.NewFromConfig(cfg)
waiter := ec2.NewInstanceTerminatedWaiter(client, func(options *ec2.InstanceTerminatedWaiterOptions) {
options.LogWaitAttempts = true
options.APIOptions = append(options.APIOptions, func(stack *middleware.Stack) error {
return stack.Finalize.Insert(myMiddleware, "fooMIddleware", middleware.After)
})
}) Let me know if this helps. Thanks, |
@gh-ghuang, By the way I received your internal ticket. Since we are corresponding here do you think you are able to close that one with the support team? Thanks, |
This issue has not received a response in 1 week. If you want to keep this issue open, please just leave a comment below and auto-close will be canceled. |
I don't think the middleware solution is graceful enough for us since it changes the behavior for all requests using that client. In our code base, we construct one client and reused that throughout the entire execution. Effectively it's kinda the same as configuring at the httpclient level. I still think the waiter should provide an option to timeout each attempt which should be separate from the context timeout. IIUC, the way AWS Go SDK uses the context timeout works perfectly fine for a single request. But for a compound operation that potentially would make multiple requests over a single context timeout, not having the ability to control the per-attempt timeout is very counter-intuitive. re: AWS support ticket, I'd like to keep that one open for a little bit. |
Hi @gh-ghuang , I will discuss it with the team and see if we can add this is a feature request to the backlog, but in the spirit of full transparency since theres a workaround,and since there is no bandwidth right now for new feature requests, if this will be added to the backlog it will get addressed some time in the future. Just a friendly reminder, the SDK is an open source platform and we are accepting pull requests from community members. If you are able to push a fix for that desired behavior yourself, I strongly encourage you to do so. I will give my update to the technical account manager handling your internal case. Thank you very much, |
|
@RanVaknin I closed the support ticket to favor of moving the discussion here. Thanks for making this a feature request, I understand that this will be in the backlog and addressed in the future. And it's okay for us to wait for AWS team to add this to the SDK in a future version. |
That's not the case in @RanVaknin's example though - the middleware here is being applied to the waiter's APIOptions, not the whole client. The imposed timeout will only be enforced in API calls made by the waiter, which accomplishes the desired behavior. I recognize this issue has shifted to a feature request for being able to configure per-op timeout in waiters, but it's very much possible to get that behavior today without direct configuration support from us, and it can absolutely be done without applying a blanket timeout on the client. Much like @RanVaknin said I'm concerned that we're talking past the real issue here i.e. the fact that @ghuang0570 I realize this issue has sat for a while - was there any followup on this on your end? |
Aside, this can be done without having to deal with middleware injection - remember the client itself is an input to the waiter, we can just wrap one to enforce the timeout type timed struct {
client ec2.DescribeInstancesAPIClient
}
func (t *timed) DescribeInstances(ctx context.Context, in *ec2.DescribeInstancesInput, opts ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) {
ctx, cancel := context.WithTimeout(ctx, 10*time.Second) // or make it configurable
defer cancel()
return t.client.DescribeInstances(ctx, in, opts...)
}
waiter := ec2.NewInstanceTerminatedWaiter(&timed{client: client}) |
This issue has not received a response in 1 week. If you want to keep this issue open, please just leave a comment below and auto-close will be canceled. |
|
1 similar comment
|
Describe the bug
We are using InstanceTerminatedWaiter to wait for a few ASG-managed instances to be in "terminated" state after we delete the ASG. However, we discovered that this wait operation often times out with an error of
exceeded max wait time for InstanceTerminated waiter
.Here are log lines from our code that hits this issue (with
LogWaitAttempts
enabled):The max time out is set to 15 minutes, however, we can only see 3 attempts during the first minute of waiting. Then the next log line is 15 mins later with an error message about max wait time exceeded.
We've cross checked our CloudTrail events, and confirmed that the 3 logged attempts are the only attempts of the DescribeInstances API calls.
Expected Behavior
We expect the waiter to make more attempts during the 15 mins max timeout, and eventually detects all instances being terminated.
Current Behavior
The InstanceTerminatedWaiter only attempts 2 or 3 times then it seems to get stuck and times out eventually. There are few cases were the InstanceTerminatedWaiter actually worked, but it's happening less frequently for us.
Reproduction Steps
Possible Solution
No response
Additional Information/Context
Here's the code snippet of how we use the waiter:
AWS Go SDK V2 Module Versions Used
github.com/aws/[email protected]
Compiler and Version used
go version go1.19.1
Operating System and version
Debian 11 (bullseye)/amd64
The text was updated successfully, but these errors were encountered: