Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRT doesn’t handle graceful shutdown properly #561

Closed
nicolas-laduguie opened this issue Dec 6, 2022 · 12 comments
Closed

CRT doesn’t handle graceful shutdown properly #561

nicolas-laduguie opened this issue Dec 6, 2022 · 12 comments
Labels
feature-request A feature should be added or improved. p2 This is a standard priority issue

Comments

@nicolas-laduguie
Copy link

nicolas-laduguie commented Dec 6, 2022

Describe the bug
We are using AWS Java CRT in the context of AWS sdk V2 with transfer manager for files uploads.
When our Spring boot application is shutdown gracefully, if a multipart upload is started through TransferManager (developer preview) during the graceful shutdown, the transfer is initiated but gets stuck.

Expected Behavior

A multipart upload through TransferManager should finish successfully even though it's started during a graceful shutdown.

Current Behavior

The multipart upload starts as we can see logs like :

2022-11-30 14:34:35.205  INFO 116 --- [rding-upload-14] s.a.a.t.s.p.LoggingTransferListener      : [] Transfer initiated...
2022-11-30 14:34:35.208  INFO 116 --- [rding-upload-14] s.a.a.t.s.p.LoggingTransferListener      : [] |                    | 0.0%
2022-11-30 14:34:35.233 DEBUG 116 --- [rding-upload-14] s.a.a.c.i.ExecutionInterceptorChain      : [] Creating an interceptor chain that will apply interceptors in the following order: [software.amazon.awssdk.core.internal.interceptor.HttpChecksumRequiredInterceptor@1366a521, software.amazon.awssdk.core.internal.interceptor.SyncHttpChecksumInTrailerInterceptor@113458ec, software.amazon.awssdk.core.internal.interceptor.HttpChecksumValidationInterceptor@275648d5, software.amazon.awssdk.core.internal.interceptor.AsyncRequestBodyHttpChecksumTrailerInterceptor@308d51cb, software.amazon.awssdk.core.internal.interceptor.HttpChecksumInHeaderInterceptor@16e4b556, software.amazon.awssdk.awscore.interceptor.HelpfulUnknownHostExceptionInterceptor@25a47c5e, software.amazon.awssdk.awscore.eventstream.EventStreamInitialRequestInterceptor@30a198df, software.amazon.awssdk.awscore.interceptor.TraceIdExecutionInterceptor@6af88409, software.amazon.awssdk.services.s3.endpoints.internal.S3ResolveEndpointInterceptor@5985d205, software.amazon.awssdk.services.s3.endpoints.internal.S3EndpointAuthSchemeInterceptor@5f4503e7, software.amazon.awssdk.services.s3.endpoints.internal.S3RequestSetEndpointInterceptor@639e0fb0, software.amazon.awssdk.services.s3.internal.handlers.EnableChunkedEncodingInterceptor@4d680706, software.amazon.awssdk.services.s3.internal.handlers.DisableDoubleUrlEncodingInterceptor@3a8af154, software.amazon.awssdk.services.s3.internal.handlers.EnableTrailingChecksumInterceptor@4da797d9, software.amazon.awssdk.services.s3.internal.handlers.CreateMultipartUploadRequestInterceptor@5e57f252, software.amazon.awssdk.services.s3.internal.handlers.GetObjectInterceptor@538a9ea9, software.amazon.awssdk.services.s3.internal.handlers.ExceptionTranslationInterceptor@3ee7d5dd, software.amazon.awssdk.transfer.s3.internal.ApplyUserAgentInterceptor@4f8f8487, software.amazon.awssdk.services.s3.internal.handlers.AsyncChecksumValidationInterceptor@2eab3a46, software.amazon.awssdk.services.s3.internal.handlers.GetBucketPolicyInterceptor@295396ec, software.amazon.awssdk.services.s3.internal.handlers.PutObjectInterceptor@299af726, software.amazon.awssdk.services.s3.internal.handlers.SyncChecksumValidationInterceptor@1a82e52e, software.amazon.awssdk.services.s3.internal.handlers.DecodeUrlEncodedResponseInterceptor@6d8c8532, software.amazon.awssdk.services.s3.internal.handlers.CreateBucketInterceptor@585d8a99, software.amazon.awssdk.services.s3.internal.handlers.CopySourceInterceptor@26807573, software.amazon.awssdk.services.s3.internal.crt.DefaultS3CrtAsyncClient$ValidateRequestInterceptor@2ba64f18, software.amazon.awssdk.services.s3.internal.crt.DefaultS3CrtAsyncClient$AttachHttpAttributesExecutionInterceptor@228aea4b]
2022-11-30 14:34:35.238 DEBUG 116 --- [rding-upload-14] s.a.a.c.i.ExecutionInterceptorChain      : [] Interceptor 'software.amazon.awssdk.services.s3.endpoints.internal.S3EndpointAuthSchemeInterceptor@5f4503e7' modified the message with its modifyRequest method.
2022-11-30 14:34:35.239 TRACE 116 --- [rding-upload-14] s.a.a.c.i.ExecutionInterceptorChain      : [] Old: PutObjectRequest(Bucket=bucket, Key=key, Tagging=tags)
New: PutObjectRequest(Bucket=bucket, Key=key, Tagging=tags)
2022-11-30 14:34:35.239 DEBUG 116 --- [rding-upload-14] s.a.a.c.i.ExecutionInterceptorChain      : [] Interceptor 'software.amazon.awssdk.transfer.s3.internal.ApplyUserAgentInterceptor@4f8f8487' modified the message with its modifyRequest method.
2022-11-30 14:34:35.240 TRACE 116 --- [rding-upload-14] s.a.a.c.i.ExecutionInterceptorChain      : [] Old: PutObjectRequest(Bucket=bucket, Key=key, Tagging=tags)
New: PutObjectRequest(Bucket=bucket, Key=key, Tagging=tags)
2022-11-30 14:34:35.257 DEBUG 116 --- [rding-upload-14] s.a.a.c.i.ExecutionInterceptorChain      : [] Interceptor 'software.amazon.awssdk.services.s3.endpoints.internal.S3RequestSetEndpointInterceptor@639e0fb0' modified the message with its modifyHttpRequest method.
2022-11-30 14:34:35.257 TRACE 116 --- [rding-upload-14] s.a.a.c.i.ExecutionInterceptorChain      : [] Old: DefaultSdkHttpFullRequest(httpMethod=PUT, protocol=http, host=172.18.0.7, port=4566, encodedPath=encodedpath, headers=[Content-Length, Content-Type, x-amz-tagging], queryParameters=[])
New: DefaultSdkHttpFullRequest(httpMethod=PUT, protocol=http, host=172.18.0.7, port=4566, encodedPath=/bucketencodedpath, headers=[Content-Length, Content-Type, x-amz-tagging], queryParameters=[])
2022-11-30 14:34:35.257 DEBUG 116 --- [rding-upload-14] s.a.a.c.i.ExecutionInterceptorChain      : [] Interceptor 'software.amazon.awssdk.services.s3.internal.handlers.PutObjectInterceptor@299af726' modified the message with its modifyHttpRequest method.
2022-11-30 14:34:35.257 TRACE 116 --- [rding-upload-14] s.a.a.c.i.ExecutionInterceptorChain      : [] Old: DefaultSdkHttpFullRequest(httpMethod=PUT, protocol=http, host=172.18.0.7, port=4566, encodedPath=/bucketencodedpath, headers=[Content-Length, Content-Type, x-amz-tagging], queryParameters=[])
New: DefaultSdkHttpFullRequest(httpMethod=PUT, protocol=http, host=172.18.0.7, port=4566, encodedPath=/bucketencodedpath, headers=[Content-Length, Content-Type, Expect, x-amz-tagging], queryParameters=[])
2022-11-30 14:34:35.257 DEBUG 116 --- [rding-upload-14] software.amazon.awssdk.request           : [] Sending Request: DefaultSdkHttpFullRequest(httpMethod=PUT, protocol=http, host=172.18.0.7, port=4566, encodedPath=/bucketencodedpath, headers=[amz-sdk-invocation-id, Content-Length, Content-Type, Expect, User-Agent, x-amz-tagging], queryParameters=[])

But then nothing happens, so seems like the multipart upload is stuck.

On CRT side, we can see error logs like :

[ERROR] [2022-12-06T15:47:19Z] [00007f821ae37700] [S3MetaRequest] - id=0x7f82a0009090 Could not read from body stream.
[ERROR] [2022-12-06T15:47:19Z] [00007f821ae37700] [S3MetaRequest] - id=0x7f82a0009090 Could not prepare request 0x7f8278012fb0 due to error 9216 (Attempt to use a JVM that has already been destroyed). 

That show CRT has detected beginning of JVM destroy, but not managing graceful shutdown.

Reproduction Steps

Spring boot 2.5.12
AWS SDK 2.18.3
s3-transfer-manager 2.18.3-PREVIEW

Possible Solution

No response

Additional Information/Context

No response

AWS Java SDK version used

2.18.3

JDK version used

17

Operating System and version

Docker Linux Ubuntu Focal 20

@bretambrose
Copy link
Contributor

I don't think your expectation is achievable. There comes a point in shutdown where it is not possible for native code to call back into the JVM and we do not know of anything more accurate (timepoint-wise) than the JVM shutdown hook. If you have a different understanding, then there might be something we could do.

One lesser possibility that comes to mind is potentially a registry of incomplete CRT futures that could at least be failed-en-masse by the CRT's JVM shutdown hook, which would at least unblock waiters.

@nicolas-laduguie
Copy link
Author

nicolas-laduguie commented Feb 10, 2023

IMHO, this is too early during an application shutdown to bind to a shutdown hook, because modern applications are doing/can do things still during the shutdown.
Taking concrete example :

  • Application serving HTTP calls, and uploading files to S3
  • This application uses transfer manager V2 and CRT Java behind the scene
  • Application receives a request and starts upload
  • During the upload, application receives shutdown signal (because of autoscale down or whatever)
  • Application starts shutting down, so all JVM shutdown hooks are executed including the CRT one
  • The CRT will not be gracefully shutdown, even though the upload is ongoing

It can even be the case where request received (upload not initiated yet) -> shutdown received -> upload initiated -> CRT resources no more available -> failure

A solution would be to give developers the choice to use the CRT defined shutdown hook, or disable the hook via a property, and expose a stop method that the developer would need to call during the shutdown.
That would be the easiest way to me.

@bretambrose
Copy link
Contributor

That might be a reasonable compromise.

@faucct
Copy link

faucct commented Jul 31, 2023

One lesser possibility that comes to mind is potentially a registry of incomplete CRT futures that could at least be failed-en-masse by the CRT's JVM shutdown hook, which would at least unblock waiters.

Is not it possible for them to be failed explicitly by closing the HTTP-client?

@bretambrose
Copy link
Contributor

Revisiting this, we've thrown around a couple of ideas that might help here; I may take a stab at them shortly:

  1. Experiment with having the shutdown hook's first execution do nothing but schedule a followup shutdown hook that does the actual firewall between native and the JVM. The hope is that shutdown hook registration is a list append so this "hack" would delay CRT shutdown until after all other hooks that were previously registered (pre-shutdown) had a chance to run.
  2. For rare cases where the above doesn't solve the problem (user code keeps resubmitting new shutdown hooks after shutdown starts), add a static API to disable the CRT shutdown hook as well as one to invoke the CRT shutdown manually. This would let a user from whom our default approach is inadequate to shutdown the CRT when they are ready for it to be shut down.

@nicolas-laduguie
Copy link
Author

Thank you for feedback.
The solution 2 is enough for us because in general with Java and spring boot ecosystem, it is easy to control all the shutdown flow, and it would be easy to call the CRT shutdown manually at the right time.
Would you have a timeline for this fix?

@bretambrose
Copy link
Contributor

Don't have a timeline but I can probably squeeze it in over the next week. A cursory google search just demonstrated that proposal (1) is a bad idea as it indicates there's no real fifo order (and they're concurrent too) to shutdown hook registration vs. execution.

I'm wondering if ref-counting may make sense here. So every call to disable_crt_shutdown() requires a corresponding call to shutdown_crt() and then the call that makes the ref count zero (almost always the first one, but supports complex cases where multiple dependencies might be trying to control the shutdown time of the CRT. In that case, the last user shutdown hook would "win" and the CRT would shutdown after its invocation, which would be the desired behavior).

@bretambrose
Copy link
Contributor

Here's a quick sketch (ie no testing yet):

#672

The basic contract is that anyone who wants to delay/control the CRT shutdown calls enableManualShutdown and when it's safe (from their perspective) to shutdown the CRT, they call manualShutdown. When the last matching call to manualShutdown is invoked, then the CRT actually gets shutdown.

@Drareeg
Copy link

Drareeg commented Jan 2, 2024

@bretambrose Do you have an update on this issue?
We are in a very similar situation as @nicolas-laduguie

@bretambrose
Copy link
Contributor

bretambrose commented Jan 6, 2024

I might be able to squeeze this into next oncall (next week).

Would the proposed solution PR work for you?

@Drareeg
Copy link

Drareeg commented Jan 9, 2024

Yes

@bretambrose
Copy link
Contributor

This was addressed a while back by #672 and released in v0.29.7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request A feature should be added or improved. p2 This is a standard priority issue
Projects
None yet
Development

No branches or pull requests

7 participants