-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(assetlibrary): change connection life cycle between Lambda and Neptune #168
fix(assetlibrary): change connection life cycle between Lambda and Neptune #168
Conversation
4c0d1c3
to
2b32125
Compare
Thank you for the PR. We are aware of the issue and will review it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I deployed this change and successfully ran the existing integ tests. That plus a passing dryrun (unit tests + lint) suggests that this change works as intended
logger.debug(`base.full.dao getConnection: create new connection:`); | ||
this._conn = new driver.DriverRemoteConnection(this.neptuneUrl, { | ||
mimeType: 'application/vnd.gremlin-v2.0+json', | ||
pingEnabled: false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to find further documentation on this. Though, we should enable ping/pong messages if this is possible. This is a common practice for validating that a connection is still active.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the AssetLibrary service is targeting Lambda and not a long-running instance, I'd argue that we want to maintain the existing pingEnabled: false
behavior. Note that the the ping
interval appears to be an hour (https://github.com/apache/tinkerpop/blob/master/gremlin-javascript/src/main/javascript/gremlin-javascript/lib/driver/connection.js#L46C27-L46C36)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I would usually err on the side of caution, since the timeouts are configurable and could conceivably be set to 15 minutes, but that's probably not necessary here.
pingEnabled: false, | ||
connectOnStartup: false, | ||
}); | ||
this._conn.addListener('close', (code: number, message: string) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice there's only a "close" event here. As far as I can tell from the docs is that it's using WebSockets (though they don't link to it, so I'm not 100%). That lib also emits an "error" event that we should account for. Unless, of course, GremlinJS is handling that event somehow and doesn't emit it. Are you able to confirm/deny whether or not this is the case?
this._conn.addListener('close', (code: number, message: string) => { | ||
logger.info(`base.full.dao connection close: code: ${code}, message: ${message}`); | ||
this._conn = null; | ||
if (code === 1006) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious why we call out this specific error and none of the others? https://www.rfc-editor.org/rfc/rfc6455#section-7.4.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(good reference btw) my reading of the error codes is that 1006 is the only error that seems unrecoverable (i.e. the Gremlin server shut down for abnormal reasons).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other close codes indicate results like the user intentionally closing the connection, but 1006 is a code that occurs due to an unintended disconnection. Even when this happens, it's not handled by gremlin, and it ends up waiting until the Lambda times out. Therefore, an error has to be thrown in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aaronatbissell sorry that this is happening. Can you share more info about what you're seeing? Error 1006 is an underlying websocket issue that cannot be recovered from. As @yuma124 mentions, an error is thrown to fail this faster, but ideally, this should not be a common issue.
- Does it fail 100% of the time or is it inconsistent?
- Can you add logging to discover which calls cause the error?
- It may be possible to get more info on where it's stuck by commenting out the
throw
so it hangs. - Are you seeing any errors from Neptune?
- On the rest API side, you should catch this and retry it, which also might help identifying which call is triggering it.
Also, it might be helpful to get any information off the socket connection that might tell you the state of it (e.g. you should be able to get the current number of connections https://stackoverflow.com/questions/10275667/socket-io-connected-user-count ).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it appears to be the same error as described in #178.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aaronatbissell have you used the fix in #178 and does it solve this connection issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just getting back to this. We haven't tried to fix this problem but would really like to see this fixed so we can upgrade asset library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aaronatbissell have you tried the update in #178 and validated whether or not it fixes the issues you are seeing? If not, are you able to test it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @yuma124 ! |
Thank you for your reviews. |
Description
Modified Asset Library Lambda function to use a single connection between Lambda function and Neptune throughout the Lambda lifecycle.
According to Neptune’s documentation, when accessing Neptune from Lambda, it is recommended to create one connection through the Lambda lifecycle and reuse it between invocations.
Opening and closing a Neptune connection per query causes performance issues in a state of high throughput.
Type of change
Submission Checklist
Additional Notes:
I think it would be nice to have a mechanism to retry when there is an error. Now AssetLibrary users need to check the details of the error and decide if it can retry or not.