-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IO Exception when listing s3 objects in bucket #1214
Comments
Hello, I created a new Micronaut project (version 4.3.1) and tested the behavior again. It still fails occasionally with the same error. However, in the fresh application it only needs 2 attempts instead of 6. I can't explain why exactly as the dependencies and the setup are the same in my opinion. Don't know if it is worth it to look into it. |
|
Hi, thanks for the report! I've set up a local application using Micronaut, but I can't replicate this behavior. Can you please configure |
Hello, sorry for the late response. Here the requested logs :) Like always after the sixth request it somehow gets trough. However, like mentioned before it does not happen a lot. Every 20 request or so.
|
We're experiencing the same exception w/ S3 (in our case, its on GetObject - we aren't doing a ListObject in this use case). Context:
Below entire request that retried (TRACE level output); you can see the initial request, the "call failed" (there is no response, just the EOFException), retry setup, establishing a new connection, and a second attempt. Seems like somehow the initial connection is closed.
|
Well, took a bit to find a minimal reproducer, and boy is it minimal - this snippet causes retries due to the connection being closed.
|
I've been able to reproduce the issue with the code you've given @cloudshiftchris and I've begun investigating possible causes and solutions. |
Additional context (perhaps you've seen this already), perhaps not related - the application never exits. Even with a single request (no retry / failure) there is a thread stuck in the background. EDIT: ok, it does exit - after about a minute or so, presumably when some timeout happens.
|
Update: S3 appears to be closing idle connections faster than the default SDK idle timeout. In my testing, I've observed S3 closes unused connections after ~5 seconds. The SDK's default idle timeout is 60 seconds. The OkHttp engine should be able to handle that gracefully but it appears not to be. I'm still performing tests to verify the exact sequence of connection events to determine if there is a bug in OkHttp, the SDK, or S3. That said, there are a few workarounds I've been able to verify:
Can you give these a try in your environment and verify whether they work for you? |
Awesome, thanks @ianbotsf. We'll redeploy this morning (with an idle timeout of 3.seconds) and validate. |
@ianbotsf we tested with the Java SDK v2 (it doesn't fail) - in looking through the debug logs & Apache HttpClient code there's the below check that HttpClient does (in the context of validating a connection prior to borrowing it from the pool); for the second attempt (after waiting for 10s) it hits org.apache.http.impl.BHttpConnectionBase#isStale
..will have other results of adjusting idle timeout shortly. |
@ianbotsf using the below compensates for the stale connection w/o retries:
We've put that in one use case (component) - don't plan to propagate that elsewhere atm (we've discovered that other components have the same issue) as the 'default' S3Client configuration should just work (once this issue is fixed). |
@cloudshiftchris : We're having the same issue ( import aws.sdk.kotlin.services.s3.S3Client
import aws.sdk.kotlin.services.s3.model.GetObjectRequest
import aws.sdk.kotlin.services.s3.model.PutObjectRequest
import aws.smithy.kotlin.runtime.content.ByteStream
import aws.smithy.kotlin.runtime.content.toByteArray
private val amazonS3ClientAsync: S3Client = S3Client {
region = Regions.DEFAULT_REGION.getName()
httpClient {
connectionIdleTimeout = 3.seconds
}
} We're using AWS Kotlin SDK 1.3.43 |
@sandrine-bedard that isn't a fix, its a workaround (largely for diagnostic purposes), until the SDK team can determine what an appropriate fix looks like. The SDK should "just work" in a default configuration. |
@cloudshiftchris : Sorry, I didn't use the correct word. I meant "workaround". Can you confirm this is the correct workaround? |
Hard to know if it is correct for your use case & preferences; we encountered situations where an S3 client that hasn't been used in > 5s would retry on subsequent requests (and, under load, the number of retries would exceed the default of 3 and throw an exception). Aside from confirming that the workaround does address our issue we've opted not to widely deploy it as we have many different components that would need to be adjusted (most aren't materially impacted by this issue, the retry limit masks it); an updated SDK (when available with a fix) will roll through with Dependabot. |
@ianbotsf you've likely gotten well into this already - this code in
...is problematic. It doesn't seem practical to handle these outside of the client; for example, these two scenarios are indistinguishable:
...the point being that dealing with a connection closed exception after-the-fact is problematic - we can't distinguish between a "connection closed because it was stale before we sent the request" vs "connection closed because it was closed after we sent the request", hence "connection closed" is not idempotent/retryable. It isn't clear how OkHttpClient detects/retries these internally (when imo, the OkHttp connection pool should skip over stale connections (as that isn't really a connection attempt) with retries kicking in after a connection is borrowed/established. There's related discussion here: square/okhttp#7007, but the premise is flawed:
...like the server knows it may crash in 10 seconds! Servers/networks can/will do all kinds of weird and wonderful things - the client should never trust that any arbitrary server behaves correctly. In this case - if a pooled connection is closed, well, don't attempt to send anything on it - simply grab another connection from the pool (and validate that) or establish a new connection. EDIT: there's some connection validation logic in OkHttp's connection pool but it has flawed premises: okhttp3.internal.connection.RealConnection#isHealthy
okhttp3.internal.connection.CallConnectionUser#doExtensiveHealthChecks
Why only validate idle connections older than ten seconds? The subsequent validations aren't expensive, relative to dealing with exceptions later. Apache HttpClient uses a 2s default (configurable). (it's a sad day when comparing to Apache HttpClient...) And the kicker is that all that mess repeats for the SDK retry count - if there are multiple connections in the pool that have gone stale (server has closed them), each one of those counts as a retry attempt, until you're out of attempts. Doesn't bode well for real-world scenarios where the server aggressively closes idle connections (S3) or a server/load balancer/network blip that drops some/all connections. |
@ianbotsf as suspected in the above there's an idempotency issue - Reproducer:
|
Latest updates from my investigation:
I'll update this issue when I get more info or results. |
Latest updates:
Once I make progress on the SDK-level retry approach, we'll decide among the available options for the fix. |
thanks @ianbotsf. On this point:
There's a reproducer above (also included below) that shows S3.PutObject failing w/ a 7s delay - greater than the 5s for S3 to close the connection, less than OkHttp's 10s minimum idle time for stale connection checking (that only applies to non-GET requests).
On this point:
The 'connection closed' exceptions are caught after the request is sent, making it impossible to distinguish between a connection that was closed while processing the request versus a stale connection from the pool. This has implications for idempotency of retries (esp. when combined with above reproducer showing mutating operations failing). Glad to hear that it may be possible to use an SDK-hosted connection monitor! This would be cleaner if OkHttp made the connection pool validation logic more correct / configurable. |
@ianbotsf and @cloudshiftchris : My team implemented a 30s timeout, and we still see private val amazonS3ClientAsync: S3Client = S3Client {
region = Regions.DEFAULT_REGION.getName()
httpClient {
connectionIdleTimeout = 30.seconds
}
} Anything else we can try, or should we just rollback this non-blocking client entirely (and revert to the Java SDK)? |
@sandrine-bedard, setting the connection idle timeout to 30 seconds will not resolve the issue with S3, which is automatically closing connections as quickly as 5 seconds after idle. The present workarounds are still the ones listed above:
We're continuing to investigate proper fixes for this so that workarounds are no longer needed. |
@ianbotsf : Thanks for the quick reply. I can't set |
Apologies, |
Latest updates:
|
A connection idle monitoring solution to address this is now in PR and being reviewed. |
|
The PR has been approved and merged and will be available in an upcoming release. A new configuration parameter for polling idle connections has been added to OkHttp config: S3Client.fromEnvironment {
httpEngine(OkHttpEngine) {
connectionIdlePollingInterval = 200.milliseconds
}
} When set to a non- |
The new poller was released today in SDK version 1.3.66. |
I was able to get it working with the following dependencies in dependencies {
implementation("aws.sdk.kotlin:s3:1.3.69")
implementation("aws.smithy.kotlin:http-client-engine-okhttp:1.3.20")
} And here's the code that sets up the S3 client: import aws.sdk.kotlin.services.s3.S3Client
import aws.smithy.kotlin.runtime.http.engine.okhttp.OkHttpEngine
fun main() {
val s3Client = S3Client {
httpClient(OkHttpEngine) {
connectionIdlePollingInterval = 200.milliseconds
}
}
} |
Describe the bug
When executing below code it randomly happens that the call to s3 fails with an IO Exception (exception is inserted below). I cannot put a precise number on it but I guess this happens once every 20 request or so. It can be "fixed" when the attempts are upped to more than 5. Because when this happens it failes always exactly 5 times with the 6th attempt going trough somehow.
I cannot explain why this occurs and why it always fails 5 times with the 6th attempt going trough.
with this factory
the resulting exception
Expected behavior
Not failing randomly exactly 5 times when this error occurs.
Current behavior
Like explained before the error which is thrown.
Here with more attempts set going trough
Steps to Reproduce
The framework in use is Micronaut in the version 4.2.2 . The s3 bucket itself has several objects in it and returns with the prefix key 10 elements.
The code was started locally via Gradle (version 8.4) and invoked over the micronaut controoler via Postman. Note that the same behavior is present in the cloud environment.
with this factory
with these build.gradle.kts file
Possible Solution
No response
Context
I want to show the user the s3 objects which the user has saved in the cloud via a web app. However at some point the request fails randomly (locally and in the cloud).
AWS Kotlin SDK version used
1.0.54
Platform (JVM/JS/Native)
Java 21.0.2 with gradle 8.4
Operating System and version
Windows 10 Pro
The text was updated successfully, but these errors were encountered: