fix(assetlibrary): change connection life cycle between Lambda and Neptune #168

yuma124 · 2023-10-06T08:15:43Z

Description

Modified Asset Library Lambda function to use a single connection between Lambda function and Neptune throughout the Lambda lifecycle.

According to Neptune’s documentation, when accessing Neptune from Lambda, it is recommended to create one connection through the Lambda lifecycle and reuse it between invocations.
Opening and closing a Neptune connection per query causes performance issues in a state of high throughput.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Refactor (existing code being refactored)
This change includes a documentation update

Submission Checklist

CI dry-run passing
Integration tests passing locally
Change logs generated

Additional Notes:

I think it would be nice to have a mechanism to retry when there is an error. Now AssetLibrary users need to check the details of the error and decide if it can retry or not.

…ptune

ts-amz · 2023-10-06T16:58:59Z

Thank you for the PR. We are aware of the issue and will review it.

canavandl

I deployed this change and successfully ran the existing integ tests. That plus a passing dryrun (unit tests + lint) suggests that this change works as intended

ts-amz · 2023-10-20T17:59:13Z

source/packages/services/assetlibrary/src/data/base.full.dao.ts

+            logger.debug(`base.full.dao getConnection: create new connection:`);
+            this._conn = new driver.DriverRemoteConnection(this.neptuneUrl, {
+                mimeType: 'application/vnd.gremlin-v2.0+json',
+                pingEnabled: false,


I am trying to find further documentation on this. Though, we should enable ping/pong messages if this is possible. This is a common practice for validating that a connection is still active.

Since the AssetLibrary service is targeting Lambda and not a long-running instance, I'd argue that we want to maintain the existing pingEnabled: false behavior. Note that the the ping interval appears to be an hour (https://github.com/apache/tinkerpop/blob/master/gremlin-javascript/src/main/javascript/gremlin-javascript/lib/driver/connection.js#L46C27-L46C36)

Good point. I would usually err on the side of caution, since the timeouts are configurable and could conceivably be set to 15 minutes, but that's probably not necessary here.

ts-amz · 2023-10-20T17:59:28Z

source/packages/services/assetlibrary/src/data/base.full.dao.ts

+                pingEnabled: false,
+                connectOnStartup: false,
+            });
+            this._conn.addListener('close', (code: number, message: string) => {


I notice there's only a "close" event here. As far as I can tell from the docs is that it's using WebSockets (though they don't link to it, so I'm not 100%). That lib also emits an "error" event that we should account for. Unless, of course, GremlinJS is handling that event somehow and doesn't emit it. Are you able to confirm/deny whether or not this is the case?

ts-amz · 2023-10-20T20:41:40Z

source/packages/services/assetlibrary/src/data/base.full.dao.ts

+            this._conn.addListener('close', (code: number, message: string) => {
+                logger.info(`base.full.dao connection close: code: ${code}, message: ${message}`);
+                this._conn = null;
+                if (code === 1006) {


I'm curious why we call out this specific error and none of the others? https://www.rfc-editor.org/rfc/rfc6455#section-7.4.1

(good reference btw) my reading of the error codes is that 1006 is the only error that seems unrecoverable (i.e. the Gremlin server shut down for abnormal reasons).

Other close codes indicate results like the user intentionally closing the connection, but 1006 is a code that occurs due to an unintended disconnection. Even when this happens, it's not handled by gremlin, and it ends up waiting until the Lambda times out. Therefore, an error has to be thrown in this case.

Sorry to comment on a closed pull request, but after upgrading our non-prod environment, I'm seeing a lot of these 1006 error codes. It's causing 500 errors in our rest-api:

Any thoughts at to why? Is this something we can add better recovery from?

@aaronatbissell sorry that this is happening. Can you share more info about what you're seeing? Error 1006 is an underlying websocket issue that cannot be recovered from. As @yuma124 mentions, an error is thrown to fail this faster, but ideally, this should not be a common issue.

Does it fail 100% of the time or is it inconsistent?

Can you add logging to discover which calls cause the error?

It may be possible to get more info on where it's stuck by commenting out the throw so it hangs.

Are you seeing any errors from Neptune?

On the rest API side, you should catch this and retry it, which also might help identifying which call is triggering it.
Also, it might be helpful to get any information off the socket connection that might tell you the state of it (e.g. you should be able to get the current number of connections https://stackoverflow.com/questions/10275667/socket-io-connected-user-count ).

Yes, it appears to be the same error as described in #178.

@aaronatbissell have you used the fix in #178 and does it solve this connection issue?

I'm just getting back to this. We haven't tried to fix this problem but would really like to see this fixed so we can upgrade asset library.

@aaronatbissell have you tried the update in #178 and validated whether or not it fixes the issues you are seeing? If not, are you able to test it?

I haven't tested #178 but, as I commented in the PR, I think a better solution would be to open the neptune connection outside of the lambda handler.

As far as testing goes, I don't plan on getting around to test this any time soon....sorry!

canavandl · 2023-10-23T18:07:16Z

Thanks for the PR @yuma124 !

yuma124 · 2023-10-30T00:25:50Z

Thank you for your reviews.
I'm sorry, I was down with a cold all of last week and couldn't respond to the Pull Request review.

fix(assetlibrary): change connection life cycle between Lambda and Ne…

2b32125

…ptune

yuma124 force-pushed the fix_assetlibrary_connection_lifecycle branch from 4c0d1c3 to 2b32125 Compare October 6, 2023 08:43

canavandl approved these changes Oct 20, 2023

View reviewed changes

ts-amz reviewed Oct 20, 2023

View reviewed changes

ts-amz approved these changes Oct 23, 2023

View reviewed changes

canavandl merged commit 6f0d69a into aws:main Oct 23, 2023

yuma124 mentioned this pull request Oct 30, 2023

fix(assetlibrary): add retry mechanism for connection error #178

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(assetlibrary): change connection life cycle between Lambda and Neptune #168

fix(assetlibrary): change connection life cycle between Lambda and Neptune #168

yuma124 commented Oct 6, 2023 •

edited

Loading

ts-amz commented Oct 6, 2023

canavandl left a comment

ts-amz Oct 20, 2023

canavandl Oct 23, 2023 •

edited

Loading

ts-amz Oct 23, 2023

ts-amz Oct 20, 2023

ts-amz Oct 20, 2023

canavandl Oct 23, 2023

yuma124 Oct 30, 2023

aaronatbissell Jan 24, 2024 •

edited

Loading

ts-amz Jan 24, 2024

yuma124 Jan 29, 2024

ts-amz May 21, 2024

aaronatbissell Aug 28, 2024

ts-amz Sep 3, 2024

aaronatbissell Sep 4, 2024

canavandl commented Oct 23, 2023

yuma124 commented Oct 30, 2023

fix(assetlibrary): change connection life cycle between Lambda and Neptune #168

fix(assetlibrary): change connection life cycle between Lambda and Neptune #168

Conversation

yuma124 commented Oct 6, 2023 • edited Loading

Description

Type of change

Submission Checklist

Additional Notes:

ts-amz commented Oct 6, 2023

canavandl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

canavandl Oct 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaronatbissell Jan 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

canavandl commented Oct 23, 2023

yuma124 commented Oct 30, 2023

yuma124 commented Oct 6, 2023 •

edited

Loading

canavandl Oct 23, 2023 •

edited

Loading

aaronatbissell Jan 24, 2024 •

edited

Loading