Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClusterEndpointsRefreshAgent does not handle lambda errors #4

Open
derekguist opened this issue Apr 11, 2024 · 2 comments
Open

ClusterEndpointsRefreshAgent does not handle lambda errors #4

derekguist opened this issue Apr 11, 2024 · 2 comments

Comments

@derekguist
Copy link

I am using the lambda proxy to discover my cluster topology with ClusterEndpointRefreshAgent. Sometimes when invoking the lambda during its initialization period, it will timeout. When this happens, the refresh agent throws an error which seems to indicate that it is not properly handling/parsing the error response from the lambda service. There is no indication of what the error actually was in the client logs. Determining that it was a timeout required looking into the CloudWatch log stream for the lambda proxy.

Software versions used

Expected behavior

  • Server-side lambda errors (of any type, not just the timeout error I observed) are reported/logged on the client
  • Timeout errors are retried

Actual behavior

  • The client fails to handle the response and the observed error does not indicate the error reported by the server. It appears this is because the refresh agent is not checking for an error response. It immediately tries to parse any response into a NeptuneClusterMetadata object without examining InvokeResult.statusCode or InvokeResult.functionError
  • It does not look like the Neptune Gremlin client retry configuration would retry on a timeout. The lambda documentation suggests that the AWS SDK may retry this so it may not be necessary for the Neptune Gremlin client to also retry. It was unclear from just the client logs whether any retry occurred - the server-side CloudWatch logs did not show another request until 15s after the timed out request ended.

It's not clear what error is actually being returned. From looking at documentation it looks like a 408 RequestTimeoutException is a possibility.

@iansrobinson
Copy link

iansrobinson commented Apr 15, 2024

Hi @derekguist

Thanks for reporting this. I've added a check in the client to catch the timeout and retry. (Timeouts still return 200 OK, so I had to check the functionError and payload for the details.) See 852843e

I'd be interested to know what was causing the Lambda to timeout during its initialization period. Do you have any insights into what might be causing the timeout?

Thanks

ian

@derekguist
Copy link
Author

@iansrobinson That's great to hear, thanks for the quick response.

Unfortunately I don't have any insight into the cause of the timeout. I kind of assumed that it may just be typical when a request comes in to a lambda that needs to be initialized before responding. I can say that when we first created the refresh lambda, we missed the timeout value (15s) in the recommended CloudFormation template. We thus ended up with the default lambda timeout of 3s. When we updated to the 15s timeout, we observed this error much less frequently (just three occurrences in the past several days) and this fix looks like it will eliminate the impact of even the few remaining timeouts.

Here is what we saw in our lambda logs in CloudWatch - not much detail:

2024-04-14T22:21:57.525Z f72b4258-adaa-4fea-8644-527d6142effe Task timed out after 15.02 seconds

And this was always associated with:

INIT_START Runtime Version: java:11.v36 Runtime Version ARN: arn:aws:lambda:xxx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants