-
Notifications
You must be signed in to change notification settings - Fork 656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add FAQ/troubleshooting and operation timing guide #2434
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,97 @@ | ||
--- | ||
title: "Frequently Asked Questions" | ||
linkTitle: "FAQ / Troubleshooting" | ||
description: "Answers to some commonly-asked questions about the {{% alias sdk-go %}}" | ||
weight: 9 | ||
--- | ||
|
||
### How do I configure my SDK's HTTP client? Are there any guidelines or best practices? | ||
|
||
We are unable to provide guidance to customers on how to configure their HTTP | ||
workflow in a manner that is most effective for their particular workload. The | ||
answer to this is the product of a multivariate equation, with input factors | ||
including but not limited to: | ||
|
||
* the network footprint of the application (TPS, throughput, etc.) | ||
* the services being used | ||
* the compute characteristics of the deployment | ||
* the geographical nature of the deployment | ||
* the desired application behavior or needs of the application itself (SLAs, | ||
timings, etc.) | ||
|
||
### How should I configure operation timeouts? | ||
|
||
Much like the previous question, it depends. Elements to consider here include | ||
the following: | ||
|
||
* All of the above factors concerning HTTP client config | ||
* Your own application timing or SLA constraints (e.g. if you yourself serve | ||
traffic to other consumers) | ||
|
||
**The answer to this question should almost NEVER be based on pure empirical | ||
observation of upstream behavior** - e.g. "I made 1000 calls to this operation, | ||
it took at most 5 seconds so I will set the timeout based on that with a safety | ||
factor of 2x to 10 seconds". Environment conditions can change, services can | ||
temporarily degrade, and these types of assumptions can become wrong without | ||
warning. | ||
|
||
### Requests made by the SDK are timing out or taking too long, how do I fix this? | ||
|
||
We are unable to assist with extended or timed-out operation calls due to | ||
extended time spent on the wire. "Wire time" in the SDK is defined as any of | ||
the following: | ||
* Time spent in an SDK client's `HTTPClient.Do()` method | ||
* Time spent in `Read()`s on an HTTP response body that has been forwarded to | ||
the caller (e.g. `GetObject`) | ||
|
||
If you are experiencing issues due to operation latency or timeouts, your first | ||
course of action should be to obtain telemetry of the SDK operation lifecycle | ||
to determine the timing breakdown between time spent on the wire and the | ||
surrounding overhead of the operation. See the guide on | ||
[timing SDK operations]({{< ref "/docs/faq/timing-operations.md" >}}), | ||
which contains a reusable code snippet that can achieve this. | ||
|
||
### How do I fix a `read: connection reset` error? | ||
|
||
The SDK retries any errors matching the `connection reset` pattern by default. | ||
This will cover error handling for most operations, where the operation's HTTP | ||
response is fully consumed and deserialized into its modeled result type. | ||
|
||
However, this error can still occur in a context **outside** of the retry loop: | ||
certain service operations directly forward the API's HTTP response body to the | ||
caller to be consumed from the wire directly via `io.ReadCloser` (e.g. | ||
`GetObject`'s object payload). You may encounter this error when performing a | ||
`Read` on the response body. | ||
|
||
This error indicates that your host, the service or any intermediary party | ||
(e.g. NAT gateways, proxies, load balancers) closed the connection while | ||
attempting to read the response. | ||
|
||
This can occur for several reasons: | ||
* You did not consume the response body for some time after the response itself | ||
was received (after the service operation was called). **We recommend you | ||
consume the HTTP response body as soon as possible for these types of | ||
operations.** | ||
* You did not close a previously-received response body. This can cause | ||
connection resets on certain platforms. **You MUST close any `io.ReadCloser` | ||
instances provided in an operation's response, regardless of whether you | ||
consume its contents.** | ||
|
||
Beyond that, try running a tcpdump for an affected connection at the edge of | ||
your network (e.g. after any proxies that you control). If you see that the AWS | ||
endpoint seems to be sending a TCP RST, you should use the AWS support console | ||
to open a case against the offending service. Be prepared to provide request | ||
IDs and specific timestamps of when the issue occured. | ||
|
||
### Why am I getting "invalid signature" errors when using an HTTP proxy with the SDK? | ||
|
||
The signature algorithm for AWS services (generally sigv4) is tied to the | ||
serialized request's headers, more specifically most headers prefixed with | ||
`X-`. Proxies are prone to modifying the outgoing request by adding additional | ||
forwarding information (often via an `X-Forwarded-For` header) which | ||
effectively breaks the signature that the SDK calculated. | ||
|
||
If you're using an HTTP proxy and experiencing signature errors, you should | ||
work to capture the request **as it appears outgoing from the proxy** and | ||
determine whether it is different. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,250 @@ | ||
--- | ||
title: "Timing SDK operations" | ||
linkTitle: "Timing Operations" | ||
description: "How to perform basic instrumentation in the {{% alias sdk-go %}} to time SDK operations" | ||
weight: 1 | ||
--- | ||
|
||
When debugging timeout / latency issues in the SDK, it is critical to identify | ||
the components of the operation lifecycle which are taking more time to execute | ||
than expected. As a starting point, you will generally need to inspect the | ||
timing breakdown between the overall operation call and the HTTP call itself. | ||
|
||
The following sample program implements a basic instrumentation probe in terms | ||
of `smithy-go` middleware for SQS clients and demonstrates how it is used. The | ||
probe emits the following information for each operation call: | ||
|
||
* AWS request ID | ||
* service ID | ||
* operation name | ||
* operation invocation time | ||
* http call time | ||
|
||
Each emitted message is prefixed with a unique (to a single operation) | ||
"invocation ID" which is set at the beginning of the handler stack. | ||
|
||
The entry point for instrumentation is exposed as `WithOperationTiming`, which | ||
is parameterized to accept a message handling function which will receive | ||
instrumentation "events" as formatted strings. `PrintfMSGHandler` is provided | ||
as a convenience which will simply dump messages to stdout. | ||
|
||
The service used here is interchangeable - ALL service client options accept | ||
`APIOptions` and an `HTTPClient` as configuration. For example, | ||
`WithOperationTiming` could instead be declared as: | ||
|
||
```go | ||
func WithOperationTiming(msgHandler func(string)) func(*s3.Options) | ||
func WithOperationTiming(msgHandler func(string)) func(*dynamodb.Options) | ||
// etc. | ||
``` | ||
|
||
If you change it, be sure to change the signature of the function it returns as | ||
well. | ||
|
||
```go | ||
import ( | ||
"context" | ||
"fmt" | ||
"log" | ||
"net/http" | ||
"sync" | ||
"time" | ||
|
||
awsmiddleware "github.com/aws/aws-sdk-go-v2/aws/middleware" | ||
awshttp "github.com/aws/aws-sdk-go-v2/aws/transport/http" | ||
"github.com/aws/aws-sdk-go-v2/config" | ||
"github.com/aws/aws-sdk-go-v2/service/sqs" | ||
"github.com/aws/smithy-go/middleware" | ||
smithyrand "github.com/aws/smithy-go/rand" | ||
) | ||
|
||
// WithOperationTiming instruments an SQS client to dump timing information for | ||
// the following spans: | ||
// - overall operation time | ||
// - HTTPClient call time | ||
// | ||
// This instrumentation will also emit the request ID, service name, and | ||
// operation name for each invocation. | ||
// | ||
// Accepts a message "handler" which is invoked with formatted messages to be | ||
// handled externally, you can use the declared PrintfMSGHandler to simply dump | ||
// these values to stdout. | ||
func WithOperationTiming(msgHandler func(string)) func(*sqs.Options) { | ||
return func(o *sqs.Options) { | ||
o.APIOptions = append(o.APIOptions, addTimingMiddlewares(msgHandler)) | ||
o.HTTPClient = &timedHTTPClient{ | ||
client: awshttp.NewBuildableClient(), | ||
msgHandler: msgHandler, | ||
} | ||
} | ||
} | ||
|
||
// PrintfMSGHandler writes messages to stdout. | ||
func PrintfMSGHandler(msg string) { | ||
fmt.Printf("%s\n", msg) | ||
} | ||
|
||
type invokeIDKey struct{} | ||
|
||
func setInvokeID(ctx context.Context, id string) context.Context { | ||
return middleware.WithStackValue(ctx, invokeIDKey{}, id) | ||
} | ||
|
||
func getInvokeID(ctx context.Context) string { | ||
id, _ := middleware.GetStackValue(ctx, invokeIDKey{}).(string) | ||
return id | ||
} | ||
|
||
// Records the current time, and returns a function to be called when the | ||
// target span of events is completed. The return function will emit the given | ||
// span name and time elapsed to the given message consumer. | ||
func timeSpan(ctx context.Context, name string, consumer func(string)) func() { | ||
start := time.Now() | ||
return func() { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. for my own curiosity, whats the advantage of using a higher-order function here? when it looks like all use cases in this code sample of where There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's not immediately called because of defer, I've clarified how it's used in a doc comment. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i meant the second set of parentheses (i.e. the second invocation). the second invocation is called immediately after the first invocation. i understand the first invocation is delayed by defer. but the comment clears it up, thanks. |
||
elapsed := time.Now().Sub(start) | ||
consumer(fmt.Sprintf("[%s] %s: %s", getInvokeID(ctx), name, elapsed)) | ||
} | ||
} | ||
|
||
type timedHTTPClient struct { | ||
client *awshttp.BuildableClient | ||
msgHandler func(string) | ||
} | ||
|
||
func (c *timedHTTPClient) Do(r *http.Request) (*http.Response, error) { | ||
defer timeSpan(r.Context(), "http", c.msgHandler)() | ||
|
||
resp, err := c.client.Do(r) | ||
if err != nil { | ||
return nil, fmt.Errorf("inner client do: %v", err) | ||
} | ||
|
||
return resp, nil | ||
} | ||
|
||
type addInvokeIDMiddleware struct { | ||
msgHandler func(string) | ||
} | ||
|
||
func (*addInvokeIDMiddleware) ID() string { return "addInvokeID" } | ||
|
||
func (*addInvokeIDMiddleware) HandleInitialize(ctx context.Context, in middleware.InitializeInput, next middleware.InitializeHandler) ( | ||
out middleware.InitializeOutput, md middleware.Metadata, err error, | ||
) { | ||
id, err := smithyrand.NewUUID(smithyrand.Reader).GetUUID() | ||
if err != nil { | ||
return out, md, fmt.Errorf("new uuid: %v", err) | ||
} | ||
|
||
return next.HandleInitialize(setInvokeID(ctx, id), in) | ||
} | ||
|
||
type timeOperationMiddleware struct { | ||
msgHandler func(string) | ||
} | ||
|
||
func (*timeOperationMiddleware) ID() string { return "timeOperation" } | ||
|
||
func (m *timeOperationMiddleware) HandleInitialize(ctx context.Context, in middleware.InitializeInput, next middleware.InitializeHandler) ( | ||
middleware.InitializeOutput, middleware.Metadata, error, | ||
) { | ||
defer timeSpan(ctx, "operation", m.msgHandler)() | ||
return next.HandleInitialize(ctx, in) | ||
} | ||
|
||
type emitMetadataMiddleware struct { | ||
msgHandler func(string) | ||
} | ||
|
||
func (*emitMetadataMiddleware) ID() string { return "emitMetadata" } | ||
|
||
func (m *emitMetadataMiddleware) HandleInitialize(ctx context.Context, in middleware.InitializeInput, next middleware.InitializeHandler) ( | ||
middleware.InitializeOutput, middleware.Metadata, error, | ||
) { | ||
out, md, err := next.HandleInitialize(ctx, in) | ||
|
||
invokeID := getInvokeID(ctx) | ||
requestID, _ := awsmiddleware.GetRequestIDMetadata(md) | ||
service := awsmiddleware.GetServiceID(ctx) | ||
operation := awsmiddleware.GetOperationName(ctx) | ||
m.msgHandler(fmt.Sprintf(`[%s] requestID = "%s"`, invokeID, requestID)) | ||
m.msgHandler(fmt.Sprintf(`[%s] service = "%s"`, invokeID, service)) | ||
m.msgHandler(fmt.Sprintf(`[%s] operation = "%s"`, invokeID, operation)) | ||
|
||
return out, md, err | ||
} | ||
|
||
func addTimingMiddlewares(mh func(string)) func(*middleware.Stack) error { | ||
return func(s *middleware.Stack) error { | ||
if err := s.Initialize.Add(&timeOperationMiddleware{msgHandler: mh}, middleware.Before); err != nil { | ||
return fmt.Errorf("add time operation middleware: %v", err) | ||
} | ||
if err := s.Initialize.Add(&addInvokeIDMiddleware{msgHandler: mh}, middleware.Before); err != nil { | ||
return fmt.Errorf("add invoke id middleware: %v", err) | ||
} | ||
if err := s.Initialize.Insert(&emitMetadataMiddleware{msgHandler: mh}, "RegisterServiceMetadata", middleware.After); err != nil { | ||
return fmt.Errorf("add emit metadata middleware: %v", err) | ||
} | ||
return nil | ||
} | ||
} | ||
|
||
func main() { | ||
cfg, err := config.LoadDefaultConfig(context.Background()) | ||
if err != nil { | ||
log.Fatal(fmt.Errorf("load default config: %v", err)) | ||
} | ||
|
||
svc := sqs.NewFromConfig(cfg, WithOperationTiming(PrintfMSGHandler)) | ||
|
||
var wg sync.WaitGroup | ||
|
||
for i := 0; i < 6; i++ { | ||
wg.Add(1) | ||
go func() { | ||
defer wg.Done() | ||
|
||
_, err = svc.ListQueues(context.Background(), nil) | ||
if err != nil { | ||
fmt.Println(fmt.Errorf("list queues: %v", err)) | ||
} | ||
}() | ||
} | ||
wg.Wait() | ||
} | ||
``` | ||
|
||
A sample output of this program: | ||
|
||
``` | ||
[e9a801bb-c51d-45c8-8e9f-a202e263fde8] http: 192.24067ms | ||
[e9a801bb-c51d-45c8-8e9f-a202e263fde8] requestID = "dbee3082-96a3-5b23-adca-6d005696fa94" | ||
[e9a801bb-c51d-45c8-8e9f-a202e263fde8] service = "SQS" | ||
[e9a801bb-c51d-45c8-8e9f-a202e263fde8] operation = "ListQueues" | ||
[e9a801bb-c51d-45c8-8e9f-a202e263fde8] operation: 193.098393ms | ||
[0740f0e0-953e-4328-94fc-830a5052e763] http: 195.185732ms | ||
[0740f0e0-953e-4328-94fc-830a5052e763] requestID = "48b301fa-fc9f-5f1f-9007-5c783caa9322" | ||
[0740f0e0-953e-4328-94fc-830a5052e763] service = "SQS" | ||
[0740f0e0-953e-4328-94fc-830a5052e763] operation = "ListQueues" | ||
[0740f0e0-953e-4328-94fc-830a5052e763] operation: 195.725491ms | ||
[c0589832-f351-4cc7-84f1-c656eb79dbd7] http: 200.52383ms | ||
[444030d0-6743-4de5-bd91-bc40b2b94c55] http: 200.525919ms | ||
[c0589832-f351-4cc7-84f1-c656eb79dbd7] requestID = "4a73cc82-b47b-56e1-b327-9100744e1b1f" | ||
[c0589832-f351-4cc7-84f1-c656eb79dbd7] service = "SQS" | ||
[c0589832-f351-4cc7-84f1-c656eb79dbd7] operation = "ListQueues" | ||
[c0589832-f351-4cc7-84f1-c656eb79dbd7] operation: 201.214365ms | ||
[444030d0-6743-4de5-bd91-bc40b2b94c55] requestID = "ca1523ed-1879-5610-bf5d-7e6fd84cabee" | ||
[444030d0-6743-4de5-bd91-bc40b2b94c55] service = "SQS" | ||
[444030d0-6743-4de5-bd91-bc40b2b94c55] operation = "ListQueues" | ||
[444030d0-6743-4de5-bd91-bc40b2b94c55] operation: 201.197071ms | ||
[079e8dbd-bb93-43ab-89e5-a7bb392b86a5] http: 206.449568ms | ||
[12b2b39d-df86-4648-a436-ff0482d13340] http: 206.526603ms | ||
[079e8dbd-bb93-43ab-89e5-a7bb392b86a5] requestID = "64229710-b552-56ed-8f96-ca927567ec7b" | ||
[079e8dbd-bb93-43ab-89e5-a7bb392b86a5] service = "SQS" | ||
[079e8dbd-bb93-43ab-89e5-a7bb392b86a5] operation = "ListQueues" | ||
[079e8dbd-bb93-43ab-89e5-a7bb392b86a5] operation: 207.252357ms | ||
[12b2b39d-df86-4648-a436-ff0482d13340] requestID = "76d9cbc0-07aa-58aa-98b7-9642c79f9851" | ||
[12b2b39d-df86-4648-a436-ff0482d13340] service = "SQS" | ||
[12b2b39d-df86-4648-a436-ff0482d13340] operation = "ListQueues" | ||
[12b2b39d-df86-4648-a436-ff0482d13340] operation: 207.360621ms | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: i think
consumed from the wire directly
may be more clarified if you contrasted it with the normal case