ClusterEndpointsRefreshAgent does not handle lambda errors #4

derekguist · 2024-04-11T19:19:08Z

I am using the lambda proxy to discover my cluster topology with ClusterEndpointRefreshAgent. Sometimes when invoking the lambda during its initialization period, it will timeout. When this happens, the refresh agent throws an error which seems to indicate that it is not properly handling/parsing the error response from the lambda service. There is no indication of what the error actually was in the client logs. Determining that it was a timeout required looking into the CloudWatch log stream for the lambda proxy.

Software versions used

Neptune Endpoints Info Lambda: neptune-endpoints-info-lambda-2.0.5.jar
Neptune Gremlin Client: software.amazon.neptune:gremlin-client:2.0.5

Expected behavior

Server-side lambda errors (of any type, not just the timeout error I observed) are reported/logged on the client
Timeout errors are retried

Actual behavior

The client fails to handle the response and the observed error does not indicate the error reported by the server. It appears this is because the refresh agent is not checking for an error response. It immediately tries to parse any response into a NeptuneClusterMetadata object without examining InvokeResult.statusCode or InvokeResult.functionError
It does not look like the Neptune Gremlin client retry configuration would retry on a timeout. The lambda documentation suggests that the AWS SDK may retry this so it may not be necessary for the Neptune Gremlin client to also retry. It was unclear from just the client logs whether any retry occurred - the server-side CloudWatch logs did not show another request until 15s after the timed out request ended.

It's not clear what error is actually being returned. From looking at documentation it looks like a 408 RequestTimeoutException is a possibility.

The text was updated successfully, but these errors were encountered:

iansrobinson · 2024-04-15T08:54:02Z

Hi @derekguist

Thanks for reporting this. I've added a check in the client to catch the timeout and retry. (Timeouts still return 200 OK, so I had to check the functionError and payload for the details.) See 852843e

I'd be interested to know what was causing the Lambda to timeout during its initialization period. Do you have any insights into what might be causing the timeout?

Thanks

ian

derekguist · 2024-04-15T17:00:43Z

@iansrobinson That's great to hear, thanks for the quick response.

Unfortunately I don't have any insight into the cause of the timeout. I kind of assumed that it may just be typical when a request comes in to a lambda that needs to be initialized before responding. I can say that when we first created the refresh lambda, we missed the timeout value (15s) in the recommended CloudFormation template. We thus ended up with the default lambda timeout of 3s. When we updated to the 15s timeout, we observed this error much less frequently (just three occurrences in the past several days) and this fix looks like it will eliminate the impact of even the few remaining timeouts.

Here is what we saw in our lambda logs in CloudWatch - not much detail:

2024-04-14T22:21:57.525Z f72b4258-adaa-4fea-8644-527d6142effe Task timed out after 15.02 seconds

And this was always associated with:

INIT_START Runtime Version: java:11.v36 Runtime Version ARN: arn:aws:lambda:xxx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ClusterEndpointsRefreshAgent does not handle lambda errors #4

ClusterEndpointsRefreshAgent does not handle lambda errors #4

derekguist commented Apr 11, 2024

iansrobinson commented Apr 15, 2024 •

edited

Loading

derekguist commented Apr 15, 2024

ClusterEndpointsRefreshAgent does not handle lambda errors #4

ClusterEndpointsRefreshAgent does not handle lambda errors #4

Comments

derekguist commented Apr 11, 2024

Software versions used

Expected behavior

Actual behavior

iansrobinson commented Apr 15, 2024 • edited Loading

derekguist commented Apr 15, 2024

iansrobinson commented Apr 15, 2024 •

edited

Loading