Node crashes/exits ungracefully on uncaught exceptions/events relating to QUIC connections or QUIC streams #198

tegefaulkes · 2024-06-11T04:46:49Z

Describe the bug

The node seems to be randomly crashing with ErrorQUICConnectionIdleTimeout.

INFO:polykey.PolykeyAgent.NodeConnectionManager.QUICServer.QUICConnection f4ba6d8b49c528090fee5cab01c73326e6b46e58.QUICStream 92:Destroy QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.QUICServer.QUICConnection f4ba6d8b49c528090fee5cab01c73326e6b46e58.QUICStream 92:Destroyed QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.QUICServer.QUICConnection f4ba6d8b49c528090fee5cab01c73326e6b46e58.QUICStream 96:Create QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.QUICServer.QUICConnection f4ba6d8b49c528090fee5cab01c73326e6b46e58.QUICStream 96:Created QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.RPCServer:Handling stream with method (nodesConnectionSignalInitial)
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.126.185:51410].QUICClient.QUICConnection 0bb985a87456834e1683140788ee89bfc5ef4f05.QUICStream 144:Create QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.126.185:51410].QUICClient.QUICConnection 0bb985a87456834e1683140788ee89bfc5ef4f05.QUICStream 144:Created QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.RPCServer:Handled stream with method (nodesConnectionSignalInitial)
INFO:polykey.PolykeyAgent.NodeConnectionManager.QUICServer.QUICConnection f4ba6d8b49c528090fee5cab01c73326e6b46e58.QUICStream 96:Destroy QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.QUICServer.QUICConnection f4ba6d8b49c528090fee5cab01c73326e6b46e58.QUICStream 96:Destroyed QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.126.185:51410].QUICClient:ErrorQUICConnectionIdleTimeout: QUIC Connection max idle timeout exhausted - 
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.126.185:51410].QUICClient.QUICConnection 0bb985a87456834e1683140788ee89bfc5ef4f05:ErrorQUICConnectionIdleTimeout: QUIC Connection max idle timeout exhausted - 
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.126.185:51410].QUICClient.QUICConnection 0bb985a87456834e1683140788ee89bfc5ef4f05:Stop QUICConnection
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnection [v2kgfdi6dutt0mmv78q58la30mafige86igvpch0dbtnfmb8l70sg@::ffff:159.196.126.185:51410]:Destroying NodeConnection
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.126.185:51410].QUICClient:Destroy QUICClient to [::ffff:159.196.126.185]:51410
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.126.185:51410].QUICClient.QUICConnection 0bb985a87456834e1683140788ee89bfc5ef4f05.QUICStream 144:Destroy QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.126.185:51410].QUICClient.QUICConnection 0bb985a87456834e1683140788ee89bfc5ef4f05:ErrorQUICConnectionIdleTimeout: QUIC Connection max idle timeout exhausted - 
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.126.185:51410].QUICClient.QUICConnection 0bb985a87456834e1683140788ee89bfc5ef4f05:ErrorQUICConnectionIdleTimeout: QUIC Connection max idle timeout exhausted - 
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.126.185:51410].QUICClient.QUICConnection 0bb985a87456834e1683140788ee89bfc5ef4f05.QUICStream 144:Destroyed QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.126.185:51410].QUICClient.QUICConnection 0bb985a87456834e1683140788ee89bfc5ef4f05:Stopped QUICConnection
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.126.185:51410].QUICClient:Destroyed QUICClient to [::ffff:159.196.126.185]:51410
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnection [v2kgfdi6dutt0mmv78q58la30mafige86igvpch0dbtnfmb8l70sg@::ffff:159.196.126.185:51410]:Destroyed NodeConnection
ErrorQUICConnectionIdleTimeout

cmcdragonkai ➜ matrix-framework-13-ryzen-7040  ➜ ~/Projects/Polykey-CLI
 $

It also can produce - any of these things are being leaked up to the process for it to do an ungraceful exit/crash:

ErrorQUICConnectionIdleTimeout
ErrorNodeConnectionTransportGenericError
TypeError: Invalid state: WritableStream is closed

To Reproduce

Run a Polykey agent.
It will crash on its own after a while.

Expected behavior

Should not crash with ErrorQUICConnectionIdleTimeout

Screenshots

Platform (please complete the following information)

Version: Polykey-CLI 0.4.1.

Additional context

This is related to Polykey-CLI#185 because it appears that a node that is leaking a connection will fail to shutdown gracefully, and can then propagate a problem to a remote node too, causing it to shutdown ungracefully.
Sometimes, the node will also crash with ErrorNodeConnectionTransportGenericError.
The agent will sometimes also fail with ErrorQUICStreamInternal: Failed to prime local stream state with a 0-length message. (source: Polykey-CLI#198 (comment))
Create tests for PolykeyAgent stability during random agent-agent interactions #8

Notify maintainers

@tegefaulkes

The text was updated successfully, but these errors were encountered:

linear · 2024-06-11T04:46:55Z

ENG-334 Node crashing with `ErrorQUICConnectionIdleTimeout`

CMCDragonkai · 2024-08-23T03:40:37Z

I've seen this twice now.

It particularly happens when there's a failure on a QUIC connection from the other remote side node. Like if they force close their node, our node then fails.

INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [79.103.253.41:44353].QUICClient:Destroyed QUICClient to [::ffff:79.103.253.41]:44353
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnection [veu6q4pppbeoumv9n3uj819t6c5thr8kb9jshpfco85l4cvst4c6g@::ffff:79.103.253.41:44353]:Destroyed NodeConnection
ErrorQUICConnectionIdleTimeout

Twice when Alexi who is in Greece from the IP:

79.103.253.41 	N/A     	veu6q4pppbeoumv9n3uj819t6c5thr8kb9jshpfco85l4cvst4c6g	58143	55780  	0

Who tried to clone my vault with polykey vaults clone resulted in ErrorRPCTimeout. This is probably due to a slow clone or something.

Then he had to force shutdown his agent, because graceful agent stop was not possible. This is likely due to QUIC again as there are possibly open fds or open connections that hold the process open.

Finally after force closing their agent, and waiting for some time… my agent ended up failing.

You can see the prior connection is exactly the NodeID of the node that was force closed from Greece.

So this error is connected to:

Possibly a slow clone causing an ErrorRPCTimeout, which left a connection IO continue to run without being cancelled.
This "leaked" connection IO is probably causing the agent to not shutdown gracefully. We know this because nodejs will not shutdown when there's IO/fds that are kept open. Otherwise nodejs always shutsdown when there is not an infinite loop.
Then a leaked connection remotely seems to create a problem on the other side, thus causing a ErrorQUICConnectionIdleTimeout. This appears to be an uncaught exception, and results in the entire agent to ungracefully shutdown.

CMCDragonkai · 2024-08-23T03:41:42Z

On the prior attempt:

Cloning on Windows results in ErrorRPCTimeout, everything else worked

Windows agent had to be force closed, it looks like IO is keeping the process open.

The nodes connections:

79.103.253.41 N/A veu6q4pppbeoumv9n3uj819t6c5thr8kb9jshpfco85l4cvst4c6g 33226 52481 0

->

79.103.253.41 N/A veu6q4pppbeoumv9n3uj819t6c5thr8kb9jshpfco85l4cvst4c6g 33226 56108 0
79.103.253.41 N/A veu6q4pppbeoumv9n3uj819t6c5thr8kb9jshpfco85l4cvst4c6g 33226 55531 0
79.103.253.41 N/A veu6q4pppbeoumv9n3uj819t6c5thr8kb9jshpfco85l4cvst4c6g 33226 55695 0

Then upon restarting the agent, and reconnecting to my Linux agent, the vaults clone still failed with ErrorRPCTimeout.

And then, a few minutes later, I get this error on my Linux the agent is now crashed and no longer running:

ErrorNodeConnectionTransportGenericError: Transport received a generic error

What's interesting, is that I find that sometimes… even this is also an uncaught exception:

ErrorNodeConnectionTransportGenericError: Transport received a generic error

This also causes the agent to ungracefully exit. So the first time what I described happen, this was the exception. But the second time, was the ErrorQUICConnectionIdleTimeout.

CMCDragonkai · 2024-08-23T03:42:09Z

I'm going to rename this issue, given that there is now evidence of at least 2 different errors that is being uncaught and causing ungraceful exit.

CMCDragonkai · 2024-08-23T03:45:40Z

This is connected to #185, because that is likely caused by the one side leaking connections. When this occurs, it's a resource/IO/fd leak. And that will cause the graceful shutdown to halt forever.

tegefaulkes · 2024-08-23T04:32:15Z

ErrorQUICConnectionIdleTimeout is the expected error here and should be handled just fine. When an connection is orphaned by one side of the connection then it is cleaned up by the idle timeout.

The ErrorNodeConnectionTransportGenericError error is very likely the problem here. So we'll need to trace why that's thrown and what's causing it exactly.

CMCDragonkai · 2024-08-23T05:51:43Z

Pull error handling always uses standard exception handlers to deal with the problem. This we understand.

Push error handling looks underimplemented. It relies on object-system context error handlers. We need to make sure that QUIC errors are being "captured" by the nodes domain fully so that it doesn't just bubble to the process context error handlers which is likely being defined in the ExitHandler.

This is because we're still working out the architecture of push-flow in MatrixAI/Polykey#444.

The fact that these exceptions are all bubbling up to the process means that errors are not properly handled within NodeConnectionManager.

aryanjassal · 2024-11-20T23:13:00Z

I had left my PC running for over 2 days with Polykey agent running in the background, and I got this error after about 5 hours of running.

ErrorQUICStreamInternal: Failed to prime local stream state with a 0-length message

aryanjassal · 2024-11-21T01:59:32Z

I just observed this exact error when I ran this command.

$ npm run polykey -- secrets mkdir vault:aslkfj/a

> [email protected] polykey
> ts-node src/polykey.ts secrets mkdir vault:aslkfj/a

mkdir: cannot create directory aslkfj/a: No such file or directory
ErrorPolykeyCLIMakeDirectory: Failed to create one or more directories - Failed to create one or more directories

This command took about 10 seconds, which is way too long for this, and then the agent crashed with the 0-length message error. I tried this and got a RPC timeout error.

$ npm run polykey -- secrets mkdir vault:aslkfj/asf/asd/fa

> [email protected] polykey
> ts-node src/polykey.ts secrets mkdir vault:aslkfj/asf/asd/fa

ErrorPolykeyCLIUnexpectedError: An unexpected error occured - Thrown 'ErrorRPCTimedOut'
  cause: ErrorRPCTimedOut: RPC has timed out

However, this also seems inconsistent, as trying to replicate this issue again yielded the command taking a while, but succeeding in the end without the agent crashing with QUIC.

CMCDragonkai · 2024-12-05T23:32:21Z

This is now happening immediately upon startup.

I also sometimes get TypeError: Invalid state: WritableStream is closed.

CMCDragonkai · 2024-12-05T23:34:48Z

Upon starting a new agent in a new node path, within a few seconds of starting up, there will be a MASS amount of logs coming from the node connection manager.

Then one of the 2 things happen which causes immediate ungraceful crash:

c0cd44c5f45ce48.QUICStream 57:Destroy QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [2600:1f16:1f71:7c00:4aa5:9c9d:57f5:a649:1314].QUICClient.QUICConnection a0596496781aabfc4645e86b8c0cd44c5f45ce48.QUICStream 57:Destroyed QUICStream
TypeError: Invalid state: WritableStream is closed

Or:

INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [2600:1f16:1f71:7c00:4aa5:9c9d:57f5:a649:1314].QUICClient.QUICConnection c7b36c81341015360b2e52fad3f36c5cbac553b4.QUICStream 24:Destroyed QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [2600:1f16:1f71:7c00:4aa5:9c9d:57f5:a649:1314].QUICClient.QUICConnection c7b36c81341015360b2e52fad3f36c5cbac553b4:Stopped QUICConnection
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [2600:1f16:1f71:7c00:4aa5:9c9d:57f5:a649:1314].QUICClient:Destroyed QUICClient to [2600:1f16:1f71:7c00:4aa5:9c9d:57f5:a649]:1314
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnection [v6p14qcvvftnnscuavsehu37t22vfvnhse054pbkb3ehemmjsrdh0@2600:1f16:1f71:7c00:4aa5:9c9d:57f5:a649:1314]:Destroyed NodeConnection
ErrorQUICConnectionIdleTimeout

CMCDragonkai · 2024-12-05T23:36:05Z

To replicate all you need to do:

# check out the latest staging
npm run build
./dist/polykey.js agent start --verbose --node-path ./tmp/x
# just wait

It will definitely fail.

CMCDragonkai · 2024-12-05T23:38:28Z

I have another problem and it's a regression that's been created from recent changes to PK CLI too.

When the agent has crashed, and it is ungraceful, it leaves the status.json saying its LIVE.

When you try to run agent status, this isn't able to realize this and it results in:

ERROR:polykey.PolykeyClient.WebSocketClient:ErrorWebSocketConnectionLocal: WebSocket Connection local error - WebSocket could not open due to internal error
ERROR:polykey.PolykeyClient.WebSocketClient.WebSocketConnection 0:ErrorWebSocketConnectionLocal: WebSocket Connection local error - WebSocket could not open due to internal error
ErrorPolykeyCLIUnexpectedError: An unexpected error occured - Thrown 'ErrorWebSocketConnectionLocal'
  cause: ErrorWebSocketConnectionLocal: WebSocket could not open due to internal error

Commands should be aware when the agent isn't live and properly report on this.

CMCDragonkai · 2024-12-05T23:43:22Z

I can corroborate this with an initial installation on MacOS.

tegefaulkes · 2024-12-05T23:44:09Z

To replicate all you need to do:
# check out the latest staging
npm run build
./dist/polykey.js agent start --verbose --node-path ./tmp/x
# just wait
It will definitely fail.

I just did this, on my dev laptop. It's working fine.

CMCDragonkai · 2024-12-05T23:45:33Z

Both myself and a friend with an initial installation on MacOS also fails too. Even if you don't see it now, you will see it eventually.

CMCDragonkai · 2024-12-05T23:47:53Z

Again I don't think this has anything to do with the node state. As this is the current temporary node state I created: node.tar.gz.

After several agent start attempts... eventually it doesn't immediately break. But it happens enough times for this to be repeatable.

The event emitted for ErrorQUICConnectionIdleTimeout is leaking through the process.

tegefaulkes · 2024-12-05T23:49:49Z

If its failing on a fresh node then no, it shouldn't be due to state. I'll dig through the js-quic source code. But without being able to reproduce on my end it will take time to find.

CMCDragonkai · 2024-12-05T23:50:21Z

I had left my PC running for over 2 days with Polykey agent running in the background, and I got this error after about 5 hours of running.

ErrorQUICStreamInternal: Failed to prime local stream state with a 0-length message

Is this related to MatrixAI/js-rpc#74?

CMCDragonkai · 2024-12-05T23:50:57Z

If its failing on a fresh node then no, it shouldn't be due to state. I'll dig through the js-quic source code. But without being able to reproduce on my end it will take time to find.

It's always obvious when it's about to happen is when there's a LARGE amount of logs.

CMCDragonkai · 2024-12-05T23:51:27Z

Pull error handling always uses standard exception handlers to deal with the problem. This we understand.

Push error handling looks underimplemented. It relies on object-system context error handlers. We need to make sure that QUIC errors are being "captured" by the nodes domain fully so that it doesn't just bubble to the process context error handlers which is likely being defined in the ExitHandler.

This is because we're still working out the architecture of push-flow in MatrixAI/Polykey#444.

The fact that these exceptions are all bubbling up to the process means that errors are not properly handled within NodeConnectionManager.

Read this.

CMCDragonkai · 2024-12-05T23:59:08Z

Because this problem occurs even when the process is idling, it means it does not has relationship with the PK CLI's commands. But primarily through the node to node network, that just starts automatically due to the P2P network needing to do synchronization.

CMCDragonkai · 2024-12-06T00:09:01Z

This shows the craziness of node connection formation and destruction. IT results in over 100% CPU usage over time.

Peek.2024-12-05.18-06.mp4

This is crazy. It should not be this much.

CMCDragonkai · 2024-12-06T00:09:48Z

Why are there so many connections constantly being formed over and over again and being destroyed...

After a long as time... it finally calms down.

CMCDragonkai · 2024-12-06T00:10:52Z

And after all that busy work... it ends up crashing at the end:

INFO:polykey.PolykeyAgent.NodeConnectionManager.QUICServer.QUICConnection e4a1bb3ad9845729f703e0bf0ce5dcd633e239e9.QUICStream 101:Create QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.QUICServer.QUICConnection e4a1bb3ad9845729f703e0bf0ce5dcd633e239e9.QUICStream 101:Created QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.RPCServer:Handled stream with method (nodesConnectionSignalInitial)
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.171.119:50858].QUICClient.QUICConnection a2d69bb72e649e5bf9def4df6270b2d69a7077ee.QUICStream 45:Destroy QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.171.119:50858].QUICClient.QUICConnection a2d69bb72e649e5bf9def4df6270b2d69a7077ee.QUICStream 45:Destroyed QUICStream
TypeError: Invalid state: WritableStream is closed

tegefaulkes · 2024-12-06T00:14:34Z

Few questions.

was that the node with existing state or a new one
How many cores do you have?
are you running on power saving or performance mode
Is the laptop plugged in?
What logs do you get when you start and then immediately shutdown?

CMCDragonkai · 2024-12-06T00:17:38Z

It happens with both.

I have 16 cores.

Performance.

Plugged in.

It calms down but then fails after some time.

CMCDragonkai · 2024-12-06T00:18:07Z

It's always the same 2 errors as above.

tegefaulkes · 2024-12-06T00:20:08Z

What's the output of polykey nodes getall?

CMCDragonkai · 2024-12-06T00:25:04Z

Why you need to focus on the leak first: https://chatgpt.com/share/e/6752434d-fbec-8004-aa20-87c4d692b1b8

CMCDragonkai · 2024-12-06T00:25:33Z

What's the output of polykey nodes getall?

nodeIdEncoded                                        	nodeAddress                                                                   	bucketIndex
vddemcs4feqmih9ehuvo8eot1b17m9evv5fb9mi1co4j6aecl4tcg	49.180.228.124-39644                                                          	251
vet1dhoduhkmn4749r8usiopivklr8i4qoh9kjnhtgeg5rie4nvh0	193.82.229.247-51303                                                          	252
vet1dhoduhkmn4749r8usiopivklr8i4qoh9kjnhtgeg5rie4nvh0	193.82.229.247-54979                                                          	252
vet1dhoduhkmn4749r8usiopivklr8i4qoh9kjnhtgeg5rie4nvh0	193.82.229.247-59308                                                          	252
vet1dhoduhkmn4749r8usiopivklr8i4qoh9kjnhtgeg5rie4nvh0	2001:4479:2708:f000:da39:c4bd:cbca:c21e-51303                                 	252
vet1dhoduhkmn4749r8usiopivklr8i4qoh9kjnhtgeg5rie4nvh0	2001:4479:2708:f000:da39:c4bd:cbca:c21e-59308                                 	252
v8k5smi8a5o7j4mhdka50q679op9lctdcepgkuslrhumc0ur793og	13.210.66.119-1314                                                            	253
v8k5smi8a5o7j4mhdka50q679op9lctdcepgkuslrhumc0ur793og	2406:da1c:67c:c000:dfc:12bd:58d8:23dc-1314                                    	253
v8snhcsi5ufh81i9h7sh34kigi74i292i8ecdvo6pacom27rvvk0g	120.18.147.14-1433                                                            	253
v8snhcsi5ufh81i9h7sh34kigi74i292i8ecdvo6pacom27rvvk0g	2405:6e00:459:97b4:1d9a:c50a:438c:cbca-49561                                  	253
v94sr6sjjg2sdvgfm580bihvh8u1cokm9jdlpsc91qdp4j91i1810	159.196.32.250-52503                                                          	253
v94sr6sjjg2sdvgfm580bihvh8u1cokm9jdlpsc91qdp4j91i1810	193.82.229.247-48350                                                          	253
v94sr6sjjg2sdvgfm580bihvh8u1cokm9jdlpsc91qdp4j91i1810	193.82.229.247-59042                                                          	253
v94sr6sjjg2sdvgfm580bihvh8u1cokm9jdlpsc91qdp4j91i1810	2001:4479:2708:f000:da39:c4bd:cbca:c21e-55875                                 	253
vair3u8kqfqoheulk277bp71j5vi80vd2bnavki02ekeounp72e80	13.58.98.187-1314                                                             	253
vair3u8kqfqoheulk277bp71j5vi80vd2bnavki02ekeounp72e80	2600:1f16:1f71:7c00:3593:3a22:674f:f33b-1314                                  	253
vato7n2pt09qji7l985fgi5top4g4sp6j1n2muvccn6jqpgsctdjg	120.18.28.115-2155                                                            	253
vato7n2pt09qji7l985fgi5top4g4sp6j1n2muvccn6jqpgsctdjg	159.196.32.250-44216                                                          	253
vbuotq03fkbngvggf7viv93djfqkt8h6r7ff9f6dln1t19n3d45d0	49.180.228.124-39960                                                          	253
v1curcdsvk3b7jlosp4nps58eo8639vs7ooouq4li9th96nmnidrg	159.196.32.250-47198                                                          	254
v2kgfdi6dutt0mmv78q58la30mafige86igvpch0dbtnfmb8l70sg	159.196.126.185-51410                                                         	254
v2kgfdi6dutt0mmv78q58la30mafige86igvpch0dbtnfmb8l70sg	159.196.126.185-52094                                                         	254
v2kgfdi6dutt0mmv78q58la30mafige86igvpch0dbtnfmb8l70sg	159.196.126.185-57311                                                         	254
v2kgfdi6dutt0mmv78q58la30mafige86igvpch0dbtnfmb8l70sg	159.196.32.250-43642                                                          	254
v3e03reesk64fbasbeun8i5kij6kcmvkbmm7eun36hf2bngs6vn50	159.196.32.250-44038                                                          	254
v3e03reesk64fbasbeun8i5kij6kcmvkbmm7eun36hf2bngs6vn50	159.196.32.250-60661                                                          	254
v4scaap53dun4fv7t5n9om9e1dd9a54dmkmdobnl8a6tqcrcst8v0	159.196.32.250-44893                                                          	254
v6p14qcvvftnnscuavsehu37t22vfvnhse054pbkb3ehemmjsrdh0	2600:1f16:1f71:7c00:4aa5:9c9d:57f5:a649-1314                                  	254
v6p14qcvvftnnscuavsehu37t22vfvnhse054pbkb3ehemmjsrdh0	3.145.86.40-1314                                                              	254
v6p14qcvvftnnscuavsehu37t22vfvnhse054pbkb3ehemmjsrdh0	v6p14qcvvftnnscuavsehu37t22vfvnhse054pbkb3ehemmjsrdh0.mainnet.polykey.com-1314	254
v7i562oegsgl7eorq4f5cqsqu2uo4lkl6dc9q4o96fo2noo6de0c0	125.209.145.77-50148                                                          	254
v7i562oegsgl7eorq4f5cqsqu2uo4lkl6dc9q4o96fo2noo6de0c0	125.209.145.77-52403                                                          	254
v7i562oegsgl7eorq4f5cqsqu2uo4lkl6dc9q4o96fo2noo6de0c0	125.209.145.77-55902                                                          	254
v7i562oegsgl7eorq4f5cqsqu2uo4lkl6dc9q4o96fo2noo6de0c0	125.209.145.77-60968                                                          	254
vg6gldhfdstju8frtbguav2p2svmev85dvpdb34gffmiagpgjf2pg	1.145.84.113-3558                                                             	255
vg6gldhfdstju8frtbguav2p2svmev85dvpdb34gffmiagpgjf2pg	159.196.120.34-56081                                                          	255
vg6gldhfdstju8frtbguav2p2svmev85dvpdb34gffmiagpgjf2pg	159.196.32.250-37722                                                          	255
vg6gldhfdstju8frtbguav2p2svmev85dvpdb34gffmiagpgjf2pg	159.196.32.250-57595                                                          	255
vg6gldhfdstju8frtbguav2p2svmev85dvpdb34gffmiagpgjf2pg	2403:5807:9005:0:0:0:0:a-56081                                                	255
vgcujggsaaq6n55il0ieu31f9bin1mlbrfovn83ph10dtvujj5lrg	159.196.32.250-56392                                                          	255
vgijtpv0h8m1eajeir77g73muq88n5kj0413t6fjdqsv9kt8dq4pg	12.162.124.34-49540                                                           	255
vgijtpv0h8m1eajeir77g73muq88n5kj0413t6fjdqsv9kt8dq4pg	172.56.89.82-7850                                                             	255
vgijtpv0h8m1eajeir77g73muq88n5kj0413t6fjdqsv9kt8dq4pg	75.33.93.225-63978                                                            	255
vncm2mkk41vgp2fmplqiu1je7b2l3v6fhgltlqf5f3f85923ve0j0	13.239.117.143-1314                                                           	255
vncm2mkk41vgp2fmplqiu1je7b2l3v6fhgltlqf5f3f85923ve0j0	2406:da1c:67c:c000:3ed6:cc4b:b6e5:3745-1314                                   	255
vncm2mkk41vgp2fmplqiu1je7b2l3v6fhgltlqf5f3f85923ve0j0	vncm2mkk41vgp2fmplqiu1je7b2l3v6fhgltlqf5f3f85923ve0j0.mainnet.polykey.com-1314	255
vqp7gqai1qhdugsbiun8rv70mmng7rp8hf74kflr7s4b3ed3tvoag	159.196.171.119-50759                                                         	255
vqp7gqai1qhdugsbiun8rv70mmng7rp8hf74kflr7s4b3ed3tvoag	159.196.171.119-50834                                                         	255
vqp7gqai1qhdugsbiun8rv70mmng7rp8hf74kflr7s4b3ed3tvoag	159.196.171.119-50858                                                         	255
vqp7gqai1qhdugsbiun8rv70mmng7rp8hf74kflr7s4b3ed3tvoag	159.196.32.250-39378                                                          	255
vqp7gqai1qhdugsbiun8rv70mmng7rp8hf74kflr7s4b3ed3tvoag	159.196.32.250-56400                                                          	255
vqsq2ca7vjbsiv3paletajdduecm0v36j00jbnbg5eg084bm46g6g	1.40.225.41-34140                                                             	255
vqsq2ca7vjbsiv3paletajdduecm0v36j00jbnbg5eg084bm46g6g	1.40.225.41-46076                                                             	255
vqsq2ca7vjbsiv3paletajdduecm0v36j00jbnbg5eg084bm46g6g	1.40.225.41-52527                                                             	255
vqsq2ca7vjbsiv3paletajdduecm0v36j00jbnbg5eg084bm46g6g	159.196.32.250-52990                                                          	255

aryanjassal · 2024-12-06T06:02:24Z

Brian and I ran a bunch of tests and we added a console.error() statement next to every throw statement which re-throws an error. We did not do this for any statements which create a new error. The resulting log is as follows.

...
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.32.250:59598]:Creating forward NodeConnection
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.32.250:59598].QUICClient:Create QUICClient to 159.196.32.250:59598
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [13.239.117.143:1314].QUICClient.QUICConnection 35de6309b2d397b0d2ec955c7410118089649d94.QUICStream 184:Create QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [13.239.117.143:1314].QUICClient.QUICConnection 35de6309b2d397b0d2ec955c7410118089649d94.QUICStream 184:Created QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [13.239.117.143:1314].QUICClient.QUICConnection 35de6309b2d397b0d2ec955c7410118089649d94.QUICStream 188:Create QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [13.239.117.143:1314].QUICClient.QUICConnection 35de6309b2d397b0d2ec955c7410118089649d94.QUICStream 188:Created QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [13.239.117.143:1314].QUICClient.QUICConnection 35de6309b2d397b0d2ec955c7410118089649d94.QUICStream 192:Create QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [13.239.117.143:1314].QUICClient.QUICConnection 35de6309b2d397b0d2ec955c7410118089649d94.QUICStream 192:Created QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [13.239.117.143:1314].QUICClient.QUICConnection 35de6309b2d397b0d2ec955c7410118089649d94.QUICStream 196:Create QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [13.239.117.143:1314].QUICClient.QUICConnection 35de6309b2d397b0d2ec955c7410118089649d94.QUICStream 196:Created QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [13.239.117.143:1314].QUICClient.QUICConnection 35de6309b2d397b0d2ec955c7410118089649d94.QUICStream 200:Create QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [13.239.117.143:1314].QUICClient.QUICConnection 35de6309b2d397b0d2ec955c7410118089649d94.QUICStream 200:Created QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [13.239.117.143:1314].QUICClient.QUICConnection 35de6309b2d397b0d2ec955c7410118089649d94.QUICStream 204:Create QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [13.239.117.143:1314].QUICClient.QUICConnection 35de6309b2d397b0d2ec955c7410118089649d94.QUICStream 204:Created QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.32.250:52648].QUICClient.QUICConnection fe93fe89989d4bc16707e8fb84717a0de7516b09:Connect QUICConnection
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.32.250:53113].QUICClient.QUICConnection 897151cfed8ff7282c45304758cf154a1a033541:Connect QUICConnection
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.32.250:53508].QUICClient.QUICConnection cdc381408d56849015fc99a77efa1fed77a54fa0:Connect QUICConnection
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.32.250:53935].QUICClient.QUICConnection aad5537fca5a5088e1409ccf1b364d0bec5b0314:Connect QUICConnection
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.32.250:59598].QUICClient.QUICConnection f357d83f38be5d73c12574fe9c967c0c8db1e3fa:Connect QUICConnection
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionReverse [v3e03reesk64fbasbeun8i5kij6kcmvkbmm7eun36hf2bngs6vn50@::ffff:1.40.219.171:57401]:Destroyed NodeConnection
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.32.250:52648].QUICClient.QUICConnection fe93fe89989d4bc16707e8fb84717a0de7516b09:Start QUICConnection
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.32.250:53113].QUICClient.QUICConnection 897151cfed8ff7282c45304758cf154a1a033541:Start QUICConnection
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.32.250:53508].QUICClient.QUICConnection cdc381408d56849015fc99a77efa1fed77a54fa0:Start QUICConnection
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.32.250:53935].QUICClient.QUICConnection aad5537fca5a5088e1409ccf1b364d0bec5b0314:Start QUICConnection
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.32.250:59598].QUICClient.QUICConnection f357d83f38be5d73c12574fe9c967c0c8db1e3fa:Start QUICConnection
ErrorQUICConnectionIdleTimeout

This reveals that no errors were thrown from js-quic, as no error was listed out to the console before getting the ErrorQUICConnectionIdleTimeout. This is likely a leak in Polykey itself.

CMCDragonkai · 2024-12-06T19:34:52Z

Yes, cause the event is emitted from js-quic, but it's Polykey that's ultimately encapsulating the objects, so of course it's leaking inside PK.

CMCDragonkai · 2024-12-06T19:36:20Z

This is a blocker for doing any sort of demos. It happens too often, I want to make sure this investigated and fixed asap.

CMCDragonkai · 2024-12-09T02:15:21Z

@tegefaulkes can you write down your debugging discovery and process here?

Here's another log of the agent breaking after some hours:

INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [120.152.182.211:56840].QUICClient.QUICConnection 26397ebee525a70e2b01c685e2a18117c9569ad4.QUICStream 42889:Destroy QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [120.152.182.211:56840].QUICClient.QUICConnection 26397ebee525a70e2b01c685e2a18117c9569ad4.QUICStream 42889:Destroyed QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [12.162.124.34:49540].QUICClient:ErrorQUICConnectionLocal: QUIC Connection local error - Failed to start QUIC connection due to start timeout
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [120.152.182.211:56840].QUICClient.QUICConnection 26397ebee525a70e2b01c685e2a18117c9569ad4.QUICStream 25096:Destroy QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [120.152.182.211:56840].QUICClient.QUICConnection 26397ebee525a70e2b01c685e2a18117c9569ad4.QUICStream 25096:Destroyed QUICStream
TypeError: Invalid state: WritableStream is closed

Notice here it's I believe it's the fact that something is still trying to use a particular stream even though the stream object it's associated with the QUICStream is already destroyed and thus closed.

I think there's a lifecycle issue here.

tegefaulkes · 2024-12-09T04:01:52Z

Debugging so far.

I've traced how the error in question gets passed around. I found some potential promise leaks that I patched up but otherwise nothing stood out as the cause.

I did a sanity check and the error itself is almost certainly getting leaked out of a async context that isn't being awaited properly. We couldn't trace this to anywhere in the quic code and after patching the promises I think the quic code is good. I'm going to did through the Polykey code to see if there are any problems with the code there.

I'm adding a shotgun check that every time we throw an error I'm logging out when and where. It generates a lot of noise but hopefully if we catch it in the act we can trace it back to the last point we saw the error and narrow it down from there.

I've modified the code to force the emission of the error. Doing so doesn't trigger the problem at all. So it seems to purely be a race condition causing the leak. This lends evidence to improper handling of a promise rejecting somewhere in the code.

I've tried messing with timeout configuration to trigger the problem but no such luck there.

I'm continuing with

Shotgun logging of throw errors and waiting to catch it in the act.
Extending the scope of tracing the error through Polykey code.
Globally checking any promise created that can reject potentially leaking the error.

CMCDragonkai · 2024-12-09T17:49:35Z

Can you explain how some of your recent commits is addressing this.

tegefaulkes · 2024-12-09T23:44:55Z

The following changes have been made so far.

Added .catch(() => {}) to certain promises in @matrixai/quic to prevent a rejection leak if they rejected before being handled.
Added .catch(() => {}) to certain promises in polykey to prevent rejection leaks. The main problem was found in the nodes domain and fixed up.
Added ErrorQUICConnectionIdleTimeout to the list of expected connection related errors.
Went through the code and cleaned up the names of some promises while I was reviewing and fixing things.

tegefaulkes · 2024-12-10T01:28:05Z

The ErrorQUICConnectionIdleTimeout should be fully addressed now. I'll close out this issue in a moment. The other details brought up here will be moved to new issues.

tegefaulkes · 2024-12-10T02:47:54Z

Ok, i've split up all of the remaining tasks here into new issues. Closing this issue now.

CMCDragonkai · 2024-12-10T05:06:26Z

The following changes have been made so far.

Added .catch(() => {}) to certain promises in @matrixai/quic to prevent a rejection leak if they rejected before being handled.

Added .catch(() => {}) to certain promises in polykey to prevent rejection leaks. The main problem was found in the nodes domain and fixed up.

Added ErrorQUICConnectionIdleTimeout to the list of expected connection related errors.

Went through the code and cleaned up the names of some promises while I was reviewing and fixing things.

Isn't 1. and 2. a sort of problematic? Whenever I see that, it looks like you're discarding exceptions. That may prevent a leak, but then you're just hiding the error. Surely it needs to be bubbled up to a handler for it to be "handled" which may just mean printing it out in some way as a warning or info message.

CMCDragonkai · 2024-12-10T05:11:56Z

I haven't seen this for a while now:

ErrorNodeConnectionTransportGenericError: Transport received a generic error

Was this ever addressed @tegefaulkes?

tegefaulkes · 2024-12-10T05:26:00Z

Its a void p.catch(() => {}); handler to prevent a rejection of the promise crashing the program before we add a handler for the promise. There are a few places where that could happen.

the ErrorNodeConnectionTransportGenericError: Transport received a generic error has not been addressed. I'll add it to my current issue.

CMCDragonkai · 2024-12-15T21:18:33Z

You did write comments explaining why you added those right? They look weird without context.

tegefaulkes · 2024-12-16T03:20:25Z

I did add comments explaining that they were for preventing unhandled promise rejection leaking errors.

tegefaulkes added the bug Something isn't working label Jun 11, 2024

CMCDragonkai added the r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices label Aug 15, 2024

CMCDragonkai changed the title ~~Node crashing with ErrorQUICConnectionIdleTimeout~~ Node Crashes (Ungraceful Exit) on Uncaught Exceptions relating to Connections ErrorQUICConnectionIdleTimeout and ErrorNodeConnectionTransportGenericError. Aug 23, 2024

CMCDragonkai mentioned this issue Aug 23, 2024

polykey agent stop command not terminating properly #185

Open

tegefaulkes mentioned this issue Aug 23, 2024

Server stream progress updates for RPC calls that have long running async operations #264

Open

CMCDragonkai mentioned this issue Nov 1, 2024

Failed vault clone or vault pull on N1 causes N2 to crash #324

Open

aryanjassal changed the title ~~Node Crashes (Ungraceful Exit) on Uncaught Exceptions relating to Connections ErrorQUICConnectionIdleTimeout and ErrorNodeConnectionTransportGenericError.~~ Node exits ungracefully on uncaught exceptions relating to connections Nov 21, 2024

CMCDragonkai assigned tegefaulkes Dec 5, 2024

CMCDragonkai changed the title ~~Node exits ungracefully on uncaught exceptions relating to connections~~ Node exits ungracefully on uncaught exceptions relating to connections or streams Dec 5, 2024

CMCDragonkai assigned aryanjassal Dec 6, 2024

aryanjassal removed their assignment Dec 10, 2024

This was referenced Dec 10, 2024

Stream Lifecycle Bugs causes Process Exit/Panic MatrixAI/js-quic#128

Open

Better feedback when agent isn't running after ungraceful exit #344

Open

Dashboard needs to include IPv6 addresses for seed nodes MatrixAI/Polykey-Network-Dashboard#29

Open

tegefaulkes closed this as completed Dec 10, 2024

Node crashes/exits ungracefully on uncaught exceptions/events relating to QUIC connections or QUIC streams #198

Node crashes/exits ungracefully on uncaught exceptions/events relating to QUIC connections or QUIC streams #198

Comments

tegefaulkes commented Jun 11, 2024 • edited by CMCDragonkai Loading

Describe the bug

To Reproduce

Expected behavior

Screenshots

Platform (please complete the following information)

Additional context

Notify maintainers

linear bot commented Jun 11, 2024

CMCDragonkai commented Aug 23, 2024

CMCDragonkai commented Aug 23, 2024 • edited Loading

CMCDragonkai commented Aug 23, 2024

CMCDragonkai commented Aug 23, 2024

tegefaulkes commented Aug 23, 2024

CMCDragonkai commented Aug 23, 2024 • edited Loading

aryanjassal commented Nov 20, 2024

aryanjassal commented Nov 21, 2024

CMCDragonkai commented Dec 5, 2024

CMCDragonkai commented Dec 5, 2024

CMCDragonkai commented Dec 5, 2024

CMCDragonkai commented Dec 5, 2024 • edited Loading

CMCDragonkai commented Dec 5, 2024

tegefaulkes commented Dec 5, 2024

CMCDragonkai commented Dec 5, 2024

CMCDragonkai commented Dec 5, 2024 • edited Loading

tegefaulkes commented Dec 5, 2024

CMCDragonkai commented Dec 5, 2024

CMCDragonkai commented Dec 5, 2024

CMCDragonkai commented Dec 5, 2024

CMCDragonkai commented Dec 5, 2024

CMCDragonkai commented Dec 6, 2024

CMCDragonkai commented Dec 6, 2024

CMCDragonkai commented Dec 6, 2024

tegefaulkes commented Dec 6, 2024

CMCDragonkai commented Dec 6, 2024 • edited Loading

CMCDragonkai commented Dec 6, 2024

tegefaulkes commented Dec 6, 2024

CMCDragonkai commented Dec 6, 2024 • edited Loading

CMCDragonkai commented Dec 6, 2024 • edited Loading

aryanjassal commented Dec 6, 2024

CMCDragonkai commented Dec 6, 2024

CMCDragonkai commented Dec 6, 2024

CMCDragonkai commented Dec 9, 2024 • edited Loading

tegefaulkes commented Dec 9, 2024

CMCDragonkai commented Dec 9, 2024

tegefaulkes commented Dec 9, 2024

tegefaulkes commented Dec 10, 2024

tegefaulkes commented Dec 10, 2024

CMCDragonkai commented Dec 10, 2024

CMCDragonkai commented Dec 10, 2024

tegefaulkes commented Dec 10, 2024

CMCDragonkai commented Dec 15, 2024

tegefaulkes commented Dec 16, 2024

tegefaulkes commented Jun 11, 2024 •

edited by CMCDragonkai

Loading

CMCDragonkai commented Aug 23, 2024 •

edited

Loading

CMCDragonkai commented Aug 23, 2024 •

edited

Loading

CMCDragonkai commented Dec 5, 2024 •

edited

Loading

CMCDragonkai commented Dec 5, 2024 •

edited

Loading

CMCDragonkai commented Dec 6, 2024 •

edited

Loading

CMCDragonkai commented Dec 6, 2024 •

edited

Loading

CMCDragonkai commented Dec 6, 2024 •

edited

Loading

CMCDragonkai commented Dec 9, 2024 •

edited

Loading