You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using the lambda proxy to discover my cluster topology with ClusterEndpointRefreshAgent. Sometimes when invoking the lambda during its initialization period, it will timeout. When this happens, the refresh agent throws an error which seems to indicate that it is not properly handling/parsing the error response from the lambda service. There is no indication of what the error actually was in the client logs. Determining that it was a timeout required looking into the CloudWatch log stream for the lambda proxy.
Server-side lambda errors (of any type, not just the timeout error I observed) are reported/logged on the client
Timeout errors are retried
Actual behavior
The client fails to handle the response and the observed error does not indicate the error reported by the server. It appears this is because the refresh agent is not checking for an error response. It immediately tries to parse any response into a NeptuneClusterMetadata object without examining InvokeResult.statusCode or InvokeResult.functionError
It does not look like the Neptune Gremlin client retry configuration would retry on a timeout. The lambda documentation suggests that the AWS SDK may retry this so it may not be necessary for the Neptune Gremlin client to also retry. It was unclear from just the client logs whether any retry occurred - the server-side CloudWatch logs did not show another request until 15s after the timed out request ended.
It's not clear what error is actually being returned. From looking at documentation it looks like a 408 RequestTimeoutException is a possibility.
The text was updated successfully, but these errors were encountered:
Thanks for reporting this. I've added a check in the client to catch the timeout and retry. (Timeouts still return 200 OK, so I had to check the functionError and payload for the details.) See 852843e
I'd be interested to know what was causing the Lambda to timeout during its initialization period. Do you have any insights into what might be causing the timeout?
@iansrobinson That's great to hear, thanks for the quick response.
Unfortunately I don't have any insight into the cause of the timeout. I kind of assumed that it may just be typical when a request comes in to a lambda that needs to be initialized before responding. I can say that when we first created the refresh lambda, we missed the timeout value (15s) in the recommended CloudFormation template. We thus ended up with the default lambda timeout of 3s. When we updated to the 15s timeout, we observed this error much less frequently (just three occurrences in the past several days) and this fix looks like it will eliminate the impact of even the few remaining timeouts.
Here is what we saw in our lambda logs in CloudWatch - not much detail:
2024-04-14T22:21:57.525Z f72b4258-adaa-4fea-8644-527d6142effe Task timed out after 15.02 seconds
And this was always associated with:
INIT_START Runtime Version: java:11.v36 Runtime Version ARN: arn:aws:lambda:xxx
I am using the lambda proxy to discover my cluster topology with
ClusterEndpointRefreshAgent
. Sometimes when invoking the lambda during its initialization period, it will timeout. When this happens, the refresh agent throws an error which seems to indicate that it is not properly handling/parsing the error response from the lambda service. There is no indication of what the error actually was in the client logs. Determining that it was a timeout required looking into the CloudWatch log stream for the lambda proxy.Software versions used
software.amazon.neptune:gremlin-client:2.0.5
Expected behavior
Actual behavior
NeptuneClusterMetadata
object without examiningInvokeResult.statusCode
orInvokeResult.functionError
It's not clear what error is actually being returned. From looking at documentation it looks like a 408
RequestTimeoutException
is a possibility.The text was updated successfully, but these errors were encountered: