Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OkHttp: Retry on SocketException: Broken pipe #962

Closed
berlix opened this issue Sep 26, 2023 · 9 comments
Closed

OkHttp: Retry on SocketException: Broken pipe #962

berlix opened this issue Sep 26, 2023 · 9 comments
Labels
bug This issue is a bug. closed-for-staleness p2 This is a standard priority issue

Comments

@berlix
Copy link

berlix commented Sep 26, 2023

Describe the bug

I am aware of the fix in #896. However, even using SDK versions that include the fix (AWS Kotlin SDK 0.32.0-beta), we run into frequent, unhandled networking exceptions when talking to just about any AWS service, and especially S3.

Expected Behavior

To retry the request, so that application code doesn't have to.

Current Behavior

The request is not retried, and the exception is propagated up to the application:

java.net.SocketException: Broken pipe (Write failed)
    at java.net.SocketOutputStream.socketWrite0(SocketOutputStream.java)
    at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:110)
    at java.net.SocketOutputStream.write(SocketOutputStream.java:150)
    at sun.security.ssl.SSLSocketOutputRecord.deliver(SSLSocketOutputRecord.java:345)
    at sun.security.ssl.SSLSocketImpl$AppOutputStream.write(SSLSocketImpl.java:1309)
    at okio.OutputStreamSink.write(JvmOkio.kt:57)
    at okio.AsyncTimeout$sink$1.write(AsyncTimeout.kt:102)
    at okio.RealBufferedSink.emitCompleteSegments(RealBufferedSink.kt:255)
    at okio.RealBufferedSink.write(RealBufferedSink.kt:146)
    at okhttp3.internal.http1.Http1ExchangeCodec$KnownLengthSink.write(Http1ExchangeCodec.kt:278)
    at okio.ForwardingSink.write(ForwardingSink.kt:29)
    at okhttp3.internal.connection.Exchange$RequestBodySink.write(Exchange.kt:223)
    at okio.RealBufferedSink.emitCompleteSegments(RealBufferedSink.kt:255)
    at okio.RealBufferedSink.write(RealBufferedSink.kt:146)
    at aws.smithy.kotlin.runtime.http.engine.okhttp.InstrumentedSink.write(MetricsInterceptor.kt:49)
    at okio.RealBufferedSink.emitCompleteSegments(RealBufferedSink.kt:255)
    at okio.RealBufferedSink.write(RealBufferedSink.kt:185)
    at okhttp3.internal._RequestBodyCommonKt$commonToRequestBody$1.writeTo(-RequestBodyCommon.kt:36)
    at aws.smithy.kotlin.runtime.http.engine.okhttp.InstrumentedRequestBody.writeTo(MetricsInterceptor.kt:69)
    at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.kt:64)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
    at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.kt:34)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
    at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.kt:95)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
    at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.kt:84)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
    at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:65)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
    at aws.smithy.kotlin.runtime.http.engine.okhttp.MetricsInterceptor.intercept(MetricsInterceptor.kt:30)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
    at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:205)
    at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:537)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.lang.Thread.run(Thread.java:829)

aws.smithy.kotlin.runtime.http.HttpException: java.net.SocketException: Broken pipe (Write failed)
    at aws.smithy.kotlin.runtime.http.engine.okhttp.OkHttpEngine.roundTrip(OkHttpEngine.kt:158)
    at aws.smithy.kotlin.runtime.http.engine.okhttp.OkHttpEngine$roundTrip$1.invokeSuspend(OkHttpEngine.kt)
    at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
    at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
    at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:584)
    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:793)
    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:697)
    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:684)

As a side note: we tried switching to the CRT client, but there we've also observed at least one instance of an exception that looks retryable:

aws.smithy.kotlin.runtime.http.HttpException: AWS_ERROR_HTTP_CONNECTION_CLOSED: The connection has closed or is closing.; crtErrorCode=2058 
    at aws.smithy.kotlin.runtime.http.engine.crt.SdkStreamResponseHandler.onResponseComplete(SdkStreamResponseHandler.kt:176)
    at aws.sdk.kotlin.crt.http.HttpRequestUtilKt$asJniStreamResponseHandler$1.onResponseComplete(HttpRequestUtil.kt:59)
    at software.amazon.awssdk.crt.http.HttpStreamResponseHandlerNativeAdapter.onResponseComplete(HttpStreamResponseHandlerNativeAdapter.java:58)

Steps to Reproduce

We observe this mostly (but not only) in long-running tasks that make many concurrent requests to AWS services.

Possible Solution

To consider java.net.SocketException retryable.

Your Environment

  • Smithy Kotlin version used: 0.27.1
  • Platform (JVM/JS/Native): JVM
  • Operating System and version: ECS Fargate Linux platform 1.4.0
@berlix berlix added the bug This issue is a bug. label Sep 26, 2023
@aajtodd
Copy link
Contributor

aajtodd commented Sep 27, 2023

Thanks for the issue.

we run into frequent, unhandled networking exceptions when talking to just about any AWS service, and especially S3.

I'd be interested in understanding what types of exceptions you are seeing. Are they all the same as the one you included in this issue (e.g. java.net.SocketException: Broken pipe (Write failed)) or are there others? Do they differ by service?


To consider java.net.SocketException retryable.

The issue is that knowing if a network exception is retryable generically for any operation depends on whether the operation is idempotent. Take the example given, you hit something like a socket closed issue (perhaps the server closed it or there was some other network issue). What if this failure happened after the server modified state? How do we know it's safe to retry at that point?

The answer is we don't (unless the operation is marked as @idempotent). We don't special case idempotent operations yet, that would be a good addition/feature request.

It's possible we may be able to special case S3's retry policy to be more relaxed but again we'd need a good way to classify which networking errors are retryable.


In the meantime you can always customize the retry policy for the SDK to retry whatever kind of error you want. This doesn't require changing the RetryStrategy, just the policy for what errors are considered retryable.

e.g.

class CustomRetryPolicy : RetryPolicy<Any?> {
    override fun evaluate(result: Result<Any?>): RetryDirective {
        return when(result.exceptionOrNull()) {
            is IOException -> RetryDirective.RetryError(RetryErrorType.Transient)
            else -> AwsDefaultRetryPolicy.evaluate(result)
        }
    }
}

fun main(): Unit = runBlocking {

    val s3 = S3Client.fromEnvironment {
        retryPolicy = CustomRetryPolicy()
    }

    ...

Note: You'll need a dependency on aws.sdk.kotlin:aws-http:<sdk-version> to access AwsDefaultRetryPolicy

@berlix
Copy link
Author

berlix commented Sep 28, 2023

Thank you for your response. I see the problem with non-idempotent requests and that not all errors (or all kinds of SocketException) should be considerend retryable.

And thanks for the hint about the RetryPolicy, we'll try that.

I'd be interested in understanding what types of exceptions you are seeing

I don't have good stats right now, because we also use some older SDK versions in some services, but I believe that from the recent versions, the SocketException with Broken pipe (Write failed) is the one we see the most by far, mostly from S3 and sometimes from CloudWatch (PutMetrics). I see you already use some hacks where you check the exception's message to determine retryability - it seems to me that "Write failed" does qualify for retryability, so one idea could be to add that as a discriminator.

One thing we also see regularly, but I suppose it's unrelated and just a bug in OkHttp, is IllegalStateException: Unbalanced enter/exit from inside OkHttp. That was why we tried switching to the CRT client. Now the connection-related exception from the CRT client (as seen in my initial comment) doesn't seem to be very specific, unfortunately, and doesn't say whether it was a write that failed or a read.

@ianbotsf
Copy link
Contributor

ianbotsf commented Apr 3, 2024

Hi @berlix, are you still encountering socket exceptions when using OkHttp with more recent versions of the SDK?

@madisp
Copy link

madisp commented Apr 3, 2024

Googling aws-kotlin-sdk "Unbalanced enter/exit" brought me here :)

@ianbotsf I'm getting Caused by: java.lang.IllegalStateException: Unbalanced enter/exit when spawning multiple SQS consumers, test: https://gist.github.com/madisp/c3c1e04ad0ef7c34ed66b5545606c846. Seems to reliably happen on my local m1 macbook when there are 10 or more parallel consumers.

(this is on aws.sdk.kotlin:sqs-jvm:1.1.10)

This might be a test coroutines issue though?

@ianbotsf
Copy link
Contributor

ianbotsf commented Apr 3, 2024

Thanks for the test code @madisp but I cannot reproduce the failure on my local Windows machine or my cloud EC2 Linux machine. It's likely that parallelism and timings are making this difficult to reproduce in different executing environments.

I see that OkHttp recently fixed an unbalanced enter/exit exception caused by rapid release/re-acquisition of connections (square/okhttp#7381). If you have stable repro code, it would be interesting to see if it still fails on OkHttp 5.0.0-alpha.12 (smithy-kotlin currently uses 5.0.0-alpha.11). If not then we can prioritize upgrading the OkHttp dependency to the latest alpha version.

@madisp
Copy link

madisp commented Apr 4, 2024

Yup, still getting with 5.0.0-alpha.12. One thing that makes it easier to reproduce is the number of MESSAGES and CONSUMERS - can you try increasing these to 1000 and 100 respectively?

@ianbotsf ianbotsf added the p2 This is a standard priority issue label Apr 4, 2024
@berlix
Copy link
Author

berlix commented Apr 10, 2024

Apologies for the late response. We haven't observed that specific exception ("Broken pipe") with SDK versions 1.0.13 or 1.1.13. We do still regularly get the following exception, at least when downloading objects from S3 - no clue if it's related:

aws.smithy.kotlin.runtime.http.HttpException: java.io.IOException: unexpected end of stream on https://s3.eu-west-1.amazonaws.com/...
    at aws.smithy.kotlin.runtime.http.engine.okhttp.OkHttpEngine.roundTrip(OkHttpEngine.kt:158)
    at aws.smithy.kotlin.runtime.http.engine.okhttp.OkHttpEngine$roundTrip$1.invokeSuspend(OkHttpEngine.kt)
    at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
    at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
    at kotlinx.coroutines.scheduling.CoroutineScheduler.runSafely(CoroutineScheduler.kt:584)
    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.executeTask(CoroutineScheduler.kt:793)
    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.kt:697)
    at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.kt:684)
Caused by: java.io.IOException: unexpected end of stream on https://s3.eu-west-1.amazonaws.com/...
    at okhttp3.internal.http1.Http1ExchangeCodec.readResponseHeaders(Http1ExchangeCodec.kt:209)
    at okhttp3.internal.connection.Exchange.readResponseHeaders(Exchange.kt:111)
    at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.kt:95)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
    at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.kt:34)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
    at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.kt:95)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
    at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.kt:84)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
    at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.kt:65)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
    at aws.smithy.kotlin.runtime.http.engine.okhttp.MetricsInterceptor.intercept(MetricsInterceptor.kt:30)
    at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.kt:109)
    at okhttp3.internal.connection.RealCall.getResponseWithInterceptorChain$okhttp(RealCall.kt:205)
    at okhttp3.internal.connection.RealCall$AsyncCall.run(RealCall.kt:537)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.io.EOFException: \n not found: limit=0 content=…
    at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.kt:335)
    at okhttp3.internal.http1.HeadersReader.readLine(HeadersReader.kt:29)
    at okhttp3.internal.http1.Http1ExchangeCodec.readResponseHeaders(Http1ExchangeCodec.kt:179)
    ... 18 common frames omitted    

@cloudshiftchris
Copy link

The above exception is perhaps related to awslabs/aws-sdk-kotlin#1214

@lauzadis
Copy link
Contributor

As cloudshiftchris said above, we believe this issue is related to S3 closing connections earlier than expected, which OkHttp does not handle by default.

We've added a new "connection monitoring" feature to detect when a connection is closed prematurely and prevent its use. You can enable this on your S3 client by configuring a non-null connectionIdlePollingInterval on the engine:

S3Client.fromEnvironment {
    httpEngine(OkHttpEngine) {
        connectionIdlePollingInterval = 200.milliseconds
    }
}

Please enable this and reach out if you have any more issues!

@lauzadis lauzadis added the closing-soon This issue will automatically close in 2 days unless further comments are made. label Dec 10, 2024
@github-actions github-actions bot added closed-for-staleness and removed closing-soon This issue will automatically close in 2 days unless further comments are made. labels Dec 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. closed-for-staleness p2 This is a standard priority issue
Projects
None yet
Development

No branches or pull requests

6 participants