Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(assetlibrary): change connection life cycle between Lambda and Neptune #168

Merged
merged 1 commit into from
Oct 23, 2023

Conversation

yuma124
Copy link
Contributor

@yuma124 yuma124 commented Oct 6, 2023

Description

Modified Asset Library Lambda function to use a single connection between Lambda function and Neptune throughout the Lambda lifecycle.

According to Neptune’s documentation, when accessing Neptune from Lambda, it is recommended to create one connection through the Lambda lifecycle and reuse it between invocations.
Opening and closing a Neptune connection per query causes performance issues in a state of high throughput.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactor (existing code being refactored)
  • This change includes a documentation update

Submission Checklist

  • CI dry-run passing
  • Integration tests passing locally
  • Change logs generated
Additional Notes:

I think it would be nice to have a mechanism to retry when there is an error. Now AssetLibrary users need to check the details of the error and decide if it can retry or not.

@yuma124 yuma124 force-pushed the fix_assetlibrary_connection_lifecycle branch from 4c0d1c3 to 2b32125 Compare October 6, 2023 08:43
@ts-amz
Copy link
Contributor

ts-amz commented Oct 6, 2023

Thank you for the PR. We are aware of the issue and will review it.

Copy link
Contributor

@canavandl canavandl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deployed this change and successfully ran the existing integ tests. That plus a passing dryrun (unit tests + lint) suggests that this change works as intended

logger.debug(`base.full.dao getConnection: create new connection:`);
this._conn = new driver.DriverRemoteConnection(this.neptuneUrl, {
mimeType: 'application/vnd.gremlin-v2.0+json',
pingEnabled: false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to find further documentation on this. Though, we should enable ping/pong messages if this is possible. This is a common practice for validating that a connection is still active.

Copy link
Contributor

@canavandl canavandl Oct 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the AssetLibrary service is targeting Lambda and not a long-running instance, I'd argue that we want to maintain the existing pingEnabled: false behavior. Note that the the ping interval appears to be an hour (https://github.com/apache/tinkerpop/blob/master/gremlin-javascript/src/main/javascript/gremlin-javascript/lib/driver/connection.js#L46C27-L46C36)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I would usually err on the side of caution, since the timeouts are configurable and could conceivably be set to 15 minutes, but that's probably not necessary here.

pingEnabled: false,
connectOnStartup: false,
});
this._conn.addListener('close', (code: number, message: string) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice there's only a "close" event here. As far as I can tell from the docs is that it's using WebSockets (though they don't link to it, so I'm not 100%). That lib also emits an "error" event that we should account for. Unless, of course, GremlinJS is handling that event somehow and doesn't emit it. Are you able to confirm/deny whether or not this is the case?

this._conn.addListener('close', (code: number, message: string) => {
logger.info(`base.full.dao connection close: code: ${code}, message: ${message}`);
this._conn = null;
if (code === 1006) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why we call out this specific error and none of the others? https://www.rfc-editor.org/rfc/rfc6455#section-7.4.1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(good reference btw) my reading of the error codes is that 1006 is the only error that seems unrecoverable (i.e. the Gremlin server shut down for abnormal reasons).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other close codes indicate results like the user intentionally closing the connection, but 1006 is a code that occurs due to an unintended disconnection. Even when this happens, it's not handled by gremlin, and it ends up waiting until the Lambda times out. Therefore, an error has to be thrown in this case.

Copy link
Contributor

@aaronatbissell aaronatbissell Jan 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to comment on a closed pull request, but after upgrading our non-prod environment, I'm seeing a lot of these 1006 error codes. It's causing 500 errors in our rest-api:

image

Any thoughts at to why? Is this something we can add better recovery from?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aaronatbissell sorry that this is happening. Can you share more info about what you're seeing? Error 1006 is an underlying websocket issue that cannot be recovered from. As @yuma124 mentions, an error is thrown to fail this faster, but ideally, this should not be a common issue.

  • Does it fail 100% of the time or is it inconsistent?
  • Can you add logging to discover which calls cause the error?
  • It may be possible to get more info on where it's stuck by commenting out the throw so it hangs.
  • Are you seeing any errors from Neptune?
  • On the rest API side, you should catch this and retry it, which also might help identifying which call is triggering it.
    Also, it might be helpful to get any information off the socket connection that might tell you the state of it (e.g. you should be able to get the current number of connections https://stackoverflow.com/questions/10275667/socket-io-connected-user-count ).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it appears to be the same error as described in #178.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aaronatbissell have you used the fix in #178 and does it solve this connection issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just getting back to this. We haven't tried to fix this problem but would really like to see this fixed so we can upgrade asset library.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aaronatbissell have you tried the update in #178 and validated whether or not it fixes the issues you are seeing? If not, are you able to test it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tested #178 but, as I commented in the PR, I think a better solution would be to open the neptune connection outside of the lambda handler.

As far as testing goes, I don't plan on getting around to test this any time soon....sorry!

@canavandl canavandl merged commit 6f0d69a into aws:main Oct 23, 2023
@canavandl
Copy link
Contributor

Thanks for the PR @yuma124 !

@yuma124
Copy link
Contributor Author

yuma124 commented Oct 30, 2023

Thank you for your reviews.
I'm sorry, I was down with a cold all of last week and couldn't respond to the Pull Request review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants