Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node crashes/exits ungracefully on uncaught exceptions/events relating to QUIC connections or QUIC streams #198

Closed
tegefaulkes opened this issue Jun 11, 2024 · 49 comments
Assignees
Labels
bug Something isn't working r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices

Comments

@tegefaulkes
Copy link
Contributor

tegefaulkes commented Jun 11, 2024

Describe the bug

The node seems to be randomly crashing with ErrorQUICConnectionIdleTimeout.

INFO:polykey.PolykeyAgent.NodeConnectionManager.QUICServer.QUICConnection f4ba6d8b49c528090fee5cab01c73326e6b46e58.QUICStream 92:Destroy QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.QUICServer.QUICConnection f4ba6d8b49c528090fee5cab01c73326e6b46e58.QUICStream 92:Destroyed QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.QUICServer.QUICConnection f4ba6d8b49c528090fee5cab01c73326e6b46e58.QUICStream 96:Create QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.QUICServer.QUICConnection f4ba6d8b49c528090fee5cab01c73326e6b46e58.QUICStream 96:Created QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.RPCServer:Handling stream with method (nodesConnectionSignalInitial)
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.126.185:51410].QUICClient.QUICConnection 0bb985a87456834e1683140788ee89bfc5ef4f05.QUICStream 144:Create QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.126.185:51410].QUICClient.QUICConnection 0bb985a87456834e1683140788ee89bfc5ef4f05.QUICStream 144:Created QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.RPCServer:Handled stream with method (nodesConnectionSignalInitial)
INFO:polykey.PolykeyAgent.NodeConnectionManager.QUICServer.QUICConnection f4ba6d8b49c528090fee5cab01c73326e6b46e58.QUICStream 96:Destroy QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.QUICServer.QUICConnection f4ba6d8b49c528090fee5cab01c73326e6b46e58.QUICStream 96:Destroyed QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.126.185:51410].QUICClient:ErrorQUICConnectionIdleTimeout: QUIC Connection max idle timeout exhausted - 
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.126.185:51410].QUICClient.QUICConnection 0bb985a87456834e1683140788ee89bfc5ef4f05:ErrorQUICConnectionIdleTimeout: QUIC Connection max idle timeout exhausted - 
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.126.185:51410].QUICClient.QUICConnection 0bb985a87456834e1683140788ee89bfc5ef4f05:Stop QUICConnection
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnection [v2kgfdi6dutt0mmv78q58la30mafige86igvpch0dbtnfmb8l70sg@::ffff:159.196.126.185:51410]:Destroying NodeConnection
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.126.185:51410].QUICClient:Destroy QUICClient to [::ffff:159.196.126.185]:51410
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.126.185:51410].QUICClient.QUICConnection 0bb985a87456834e1683140788ee89bfc5ef4f05.QUICStream 144:Destroy QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.126.185:51410].QUICClient.QUICConnection 0bb985a87456834e1683140788ee89bfc5ef4f05:ErrorQUICConnectionIdleTimeout: QUIC Connection max idle timeout exhausted - 
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.126.185:51410].QUICClient.QUICConnection 0bb985a87456834e1683140788ee89bfc5ef4f05:ErrorQUICConnectionIdleTimeout: QUIC Connection max idle timeout exhausted - 
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.126.185:51410].QUICClient.QUICConnection 0bb985a87456834e1683140788ee89bfc5ef4f05.QUICStream 144:Destroyed QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.126.185:51410].QUICClient.QUICConnection 0bb985a87456834e1683140788ee89bfc5ef4f05:Stopped QUICConnection
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.126.185:51410].QUICClient:Destroyed QUICClient to [::ffff:159.196.126.185]:51410
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnection [v2kgfdi6dutt0mmv78q58la30mafige86igvpch0dbtnfmb8l70sg@::ffff:159.196.126.185:51410]:Destroyed NodeConnection
ErrorQUICConnectionIdleTimeout

cmcdragonkai ➜ matrix-framework-13-ryzen-7040  ➜ ~/Projects/Polykey-CLI
 $ 

It also can produce - any of these things are being leaked up to the process for it to do an ungraceful exit/crash:

  • ErrorQUICConnectionIdleTimeout
  • ErrorNodeConnectionTransportGenericError
  • TypeError: Invalid state: WritableStream is closed

To Reproduce

  1. Run a Polykey agent.
  2. It will crash on its own after a while.

Expected behavior

Should not crash with ErrorQUICConnectionIdleTimeout

Screenshots

Platform (please complete the following information)

  • Version: Polykey-CLI 0.4.1.

Additional context

Notify maintainers

@tegefaulkes

@tegefaulkes tegefaulkes added the bug Something isn't working label Jun 11, 2024
Copy link

linear bot commented Jun 11, 2024

@CMCDragonkai CMCDragonkai added the r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices label Aug 15, 2024
Copy link
Member

I've seen this twice now.

It particularly happens when there's a failure on a QUIC connection from the other remote side node. Like if they force close their node, our node then fails.

INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [79.103.253.41:44353].QUICClient:Destroyed QUICClient to [::ffff:79.103.253.41]:44353
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnection [veu6q4pppbeoumv9n3uj819t6c5thr8kb9jshpfco85l4cvst4c6g@::ffff:79.103.253.41:44353]:Destroyed NodeConnection
ErrorQUICConnectionIdleTimeout

Twice when Alexi who is in Greece from the IP:

79.103.253.41 	N/A     	veu6q4pppbeoumv9n3uj819t6c5thr8kb9jshpfco85l4cvst4c6g	58143	55780  	0

Who tried to clone my vault with polykey vaults clone resulted in ErrorRPCTimeout. This is probably due to a slow clone or something.

Then he had to force shutdown his agent, because graceful agent stop was not possible. This is likely due to QUIC again as there are possibly open fds or open connections that hold the process open.

Finally after force closing their agent, and waiting for some time… my agent ended up failing.

You can see the prior connection is exactly the NodeID of the node that was force closed from Greece.

So this error is connected to:

  1. Possibly a slow clone causing an ErrorRPCTimeout, which left a connection IO continue to run without being cancelled.
  2. This "leaked" connection IO is probably causing the agent to not shutdown gracefully. We know this because nodejs will not shutdown when there's IO/fds that are kept open. Otherwise nodejs always shutsdown when there is not an infinite loop.
  3. Then a leaked connection remotely seems to create a problem on the other side, thus causing a ErrorQUICConnectionIdleTimeout. This appears to be an uncaught exception, and results in the entire agent to ungracefully shutdown.

Copy link
Member

CMCDragonkai commented Aug 23, 2024

On the prior attempt:

Cloning on Windows results in ErrorRPCTimeout, everything else worked

Windows agent had to be force closed, it looks like IO is keeping the process open.

The nodes connections:

79.103.253.41 N/A veu6q4pppbeoumv9n3uj819t6c5thr8kb9jshpfco85l4cvst4c6g 33226 52481 0

->

79.103.253.41 N/A veu6q4pppbeoumv9n3uj819t6c5thr8kb9jshpfco85l4cvst4c6g 33226 56108 0
79.103.253.41 N/A veu6q4pppbeoumv9n3uj819t6c5thr8kb9jshpfco85l4cvst4c6g 33226 55531 0
79.103.253.41 N/A veu6q4pppbeoumv9n3uj819t6c5thr8kb9jshpfco85l4cvst4c6g 33226 55695 0

Then upon restarting the agent, and reconnecting to my Linux agent, the vaults clone still failed with ErrorRPCTimeout.

And then, a few minutes later, I get this error on my Linux the agent is now crashed and no longer running:

ErrorNodeConnectionTransportGenericError: Transport received a generic error

What's interesting, is that I find that sometimes… even this is also an uncaught exception:

ErrorNodeConnectionTransportGenericError: Transport received a generic error

This also causes the agent to ungracefully exit. So the first time what I described happen, this was the exception. But the second time, was the ErrorQUICConnectionIdleTimeout.

@CMCDragonkai
Copy link
Member

I'm going to rename this issue, given that there is now evidence of at least 2 different errors that is being uncaught and causing ungraceful exit.

@CMCDragonkai CMCDragonkai changed the title Node crashing with ErrorQUICConnectionIdleTimeout Node Crashes (Ungraceful Exit) on Uncaught Exceptions relating to Connections ErrorQUICConnectionIdleTimeout and ErrorNodeConnectionTransportGenericError. Aug 23, 2024
@CMCDragonkai
Copy link
Member

This is connected to #185, because that is likely caused by the one side leaking connections. When this occurs, it's a resource/IO/fd leak. And that will cause the graceful shutdown to halt forever.

@tegefaulkes
Copy link
Contributor Author

ErrorQUICConnectionIdleTimeout is the expected error here and should be handled just fine. When an connection is orphaned by one side of the connection then it is cleaned up by the idle timeout.

The ErrorNodeConnectionTransportGenericError error is very likely the problem here. So we'll need to trace why that's thrown and what's causing it exactly.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Aug 23, 2024

Pull error handling always uses standard exception handlers to deal with the problem. This we understand.

Push error handling looks underimplemented. It relies on object-system context error handlers. We need to make sure that QUIC errors are being "captured" by the nodes domain fully so that it doesn't just bubble to the process context error handlers which is likely being defined in the ExitHandler.

This is because we're still working out the architecture of push-flow in MatrixAI/Polykey#444.

The fact that these exceptions are all bubbling up to the process means that errors are not properly handled within NodeConnectionManager.

Copy link
Member

I had left my PC running for over 2 days with Polykey agent running in the background, and I got this error after about 5 hours of running.

ErrorQUICStreamInternal: Failed to prime local stream state with a 0-length message

Copy link
Member

I just observed this exact error when I ran this command.

$ npm run polykey -- secrets mkdir vault:aslkfj/a

> [email protected] polykey
> ts-node src/polykey.ts secrets mkdir vault:aslkfj/a

mkdir: cannot create directory aslkfj/a: No such file or directory
ErrorPolykeyCLIMakeDirectory: Failed to create one or more directories - Failed to create one or more directories

This command took about 10 seconds, which is way too long for this, and then the agent crashed with the 0-length message error. I tried this and got a RPC timeout error.

$ npm run polykey -- secrets mkdir vault:aslkfj/asf/asd/fa

> [email protected] polykey
> ts-node src/polykey.ts secrets mkdir vault:aslkfj/asf/asd/fa

ErrorPolykeyCLIUnexpectedError: An unexpected error occured - Thrown 'ErrorRPCTimedOut'
  cause: ErrorRPCTimedOut: RPC has timed out

However, this also seems inconsistent, as trying to replicate this issue again yielded the command taking a while, but succeeding in the end without the agent crashing with QUIC.

@aryanjassal aryanjassal changed the title Node Crashes (Ungraceful Exit) on Uncaught Exceptions relating to Connections ErrorQUICConnectionIdleTimeout and ErrorNodeConnectionTransportGenericError. Node exits ungracefully on uncaught exceptions relating to connections Nov 21, 2024
@CMCDragonkai
Copy link
Member

This is now happening immediately upon startup.

I also sometimes get TypeError: Invalid state: WritableStream is closed.

@CMCDragonkai
Copy link
Member

Upon starting a new agent in a new node path, within a few seconds of starting up, there will be a MASS amount of logs coming from the node connection manager.

Then one of the 2 things happen which causes immediate ungraceful crash:

c0cd44c5f45ce48.QUICStream 57:Destroy QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [2600:1f16:1f71:7c00:4aa5:9c9d:57f5:a649:1314].QUICClient.QUICConnection a0596496781aabfc4645e86b8c0cd44c5f45ce48.QUICStream 57:Destroyed QUICStream
TypeError: Invalid state: WritableStream is closed

Or:

INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [2600:1f16:1f71:7c00:4aa5:9c9d:57f5:a649:1314].QUICClient.QUICConnection c7b36c81341015360b2e52fad3f36c5cbac553b4.QUICStream 24:Destroyed QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [2600:1f16:1f71:7c00:4aa5:9c9d:57f5:a649:1314].QUICClient.QUICConnection c7b36c81341015360b2e52fad3f36c5cbac553b4:Stopped QUICConnection
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [2600:1f16:1f71:7c00:4aa5:9c9d:57f5:a649:1314].QUICClient:Destroyed QUICClient to [2600:1f16:1f71:7c00:4aa5:9c9d:57f5:a649]:1314
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnection [v6p14qcvvftnnscuavsehu37t22vfvnhse054pbkb3ehemmjsrdh0@2600:1f16:1f71:7c00:4aa5:9c9d:57f5:a649:1314]:Destroyed NodeConnection
ErrorQUICConnectionIdleTimeout

@CMCDragonkai
Copy link
Member

To replicate all you need to do:

# check out the latest staging
npm run build
./dist/polykey.js agent start --verbose --node-path ./tmp/x
# just wait

It will definitely fail.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Dec 5, 2024

I have another problem and it's a regression that's been created from recent changes to PK CLI too.

When the agent has crashed, and it is ungraceful, it leaves the status.json saying its LIVE.

When you try to run agent status, this isn't able to realize this and it results in:

ERROR:polykey.PolykeyClient.WebSocketClient:ErrorWebSocketConnectionLocal: WebSocket Connection local error - WebSocket could not open due to internal error
ERROR:polykey.PolykeyClient.WebSocketClient.WebSocketConnection 0:ErrorWebSocketConnectionLocal: WebSocket Connection local error - WebSocket could not open due to internal error
ErrorPolykeyCLIUnexpectedError: An unexpected error occured - Thrown 'ErrorWebSocketConnectionLocal'
  cause: ErrorWebSocketConnectionLocal: WebSocket could not open due to internal error

Commands should be aware when the agent isn't live and properly report on this.

@CMCDragonkai CMCDragonkai changed the title Node exits ungracefully on uncaught exceptions relating to connections Node exits ungracefully on uncaught exceptions relating to connections or streams Dec 5, 2024
@CMCDragonkai
Copy link
Member

I can corroborate this with an initial installation on MacOS.

@tegefaulkes
Copy link
Contributor Author

To replicate all you need to do:

# check out the latest staging
npm run build
./dist/polykey.js agent start --verbose --node-path ./tmp/x
# just wait

It will definitely fail.

I just did this, on my dev laptop. It's working fine.

@CMCDragonkai
Copy link
Member

Both myself and a friend with an initial installation on MacOS also fails too. Even if you don't see it now, you will see it eventually.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Dec 5, 2024

Again I don't think this has anything to do with the node state. As this is the current temporary node state I created: node.tar.gz.

After several agent start attempts... eventually it doesn't immediately break. But it happens enough times for this to be repeatable.

The event emitted for ErrorQUICConnectionIdleTimeout is leaking through the process.

@tegefaulkes
Copy link
Contributor Author

If its failing on a fresh node then no, it shouldn't be due to state. I'll dig through the js-quic source code. But without being able to reproduce on my end it will take time to find.

@CMCDragonkai
Copy link
Member

I had left my PC running for over 2 days with Polykey agent running in the background, and I got this error after about 5 hours of running.

ErrorQUICStreamInternal: Failed to prime local stream state with a 0-length message

Is this related to MatrixAI/js-rpc#74?

@CMCDragonkai
Copy link
Member

If its failing on a fresh node then no, it shouldn't be due to state. I'll dig through the js-quic source code. But without being able to reproduce on my end it will take time to find.

It's always obvious when it's about to happen is when there's a LARGE amount of logs.

@CMCDragonkai
Copy link
Member

Pull error handling always uses standard exception handlers to deal with the problem. This we understand.

Push error handling looks underimplemented. It relies on object-system context error handlers. We need to make sure that QUIC errors are being "captured" by the nodes domain fully so that it doesn't just bubble to the process context error handlers which is likely being defined in the ExitHandler.

This is because we're still working out the architecture of push-flow in MatrixAI/Polykey#444.

The fact that these exceptions are all bubbling up to the process means that errors are not properly handled within NodeConnectionManager.

Read this.

@CMCDragonkai
Copy link
Member

Because this problem occurs even when the process is idling, it means it does not has relationship with the PK CLI's commands. But primarily through the node to node network, that just starts automatically due to the P2P network needing to do synchronization.

@CMCDragonkai
Copy link
Member

This shows the craziness of node connection formation and destruction. IT results in over 100% CPU usage over time.

Peek.2024-12-05.18-06.mp4

This is crazy. It should not be this much.

@CMCDragonkai
Copy link
Member

Why are there so many connections constantly being formed over and over again and being destroyed...

After a long as time... it finally calms down.

@CMCDragonkai
Copy link
Member

And after all that busy work... it ends up crashing at the end:

INFO:polykey.PolykeyAgent.NodeConnectionManager.QUICServer.QUICConnection e4a1bb3ad9845729f703e0bf0ce5dcd633e239e9.QUICStream 101:Create QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.QUICServer.QUICConnection e4a1bb3ad9845729f703e0bf0ce5dcd633e239e9.QUICStream 101:Created QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.RPCServer:Handled stream with method (nodesConnectionSignalInitial)
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.171.119:50858].QUICClient.QUICConnection a2d69bb72e649e5bf9def4df6270b2d69a7077ee.QUICStream 45:Destroy QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.171.119:50858].QUICClient.QUICConnection a2d69bb72e649e5bf9def4df6270b2d69a7077ee.QUICStream 45:Destroyed QUICStream
TypeError: Invalid state: WritableStream is closed

@tegefaulkes
Copy link
Contributor Author

Few questions.

  • was that the node with existing state or a new one
  • How many cores do you have?
  • are you running on power saving or performance mode
  • Is the laptop plugged in?
  • What logs do you get when you start and then immediately shutdown?

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Dec 6, 2024

It happens with both.

I have 16 cores.

Performance.

Plugged in.

It calms down but then fails after some time.

@CMCDragonkai
Copy link
Member

It's always the same 2 errors as above.

@tegefaulkes
Copy link
Contributor Author

What's the output of polykey nodes getall?

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Dec 6, 2024

Why you need to focus on the leak first: https://chatgpt.com/share/e/6752434d-fbec-8004-aa20-87c4d692b1b8

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Dec 6, 2024

What's the output of polykey nodes getall?

nodeIdEncoded                                        	nodeAddress                                                                   	bucketIndex
vddemcs4feqmih9ehuvo8eot1b17m9evv5fb9mi1co4j6aecl4tcg	49.180.228.124-39644                                                          	251
vet1dhoduhkmn4749r8usiopivklr8i4qoh9kjnhtgeg5rie4nvh0	193.82.229.247-51303                                                          	252
vet1dhoduhkmn4749r8usiopivklr8i4qoh9kjnhtgeg5rie4nvh0	193.82.229.247-54979                                                          	252
vet1dhoduhkmn4749r8usiopivklr8i4qoh9kjnhtgeg5rie4nvh0	193.82.229.247-59308                                                          	252
vet1dhoduhkmn4749r8usiopivklr8i4qoh9kjnhtgeg5rie4nvh0	2001:4479:2708:f000:da39:c4bd:cbca:c21e-51303                                 	252
vet1dhoduhkmn4749r8usiopivklr8i4qoh9kjnhtgeg5rie4nvh0	2001:4479:2708:f000:da39:c4bd:cbca:c21e-59308                                 	252
v8k5smi8a5o7j4mhdka50q679op9lctdcepgkuslrhumc0ur793og	13.210.66.119-1314                                                            	253
v8k5smi8a5o7j4mhdka50q679op9lctdcepgkuslrhumc0ur793og	2406:da1c:67c:c000:dfc:12bd:58d8:23dc-1314                                    	253
v8snhcsi5ufh81i9h7sh34kigi74i292i8ecdvo6pacom27rvvk0g	120.18.147.14-1433                                                            	253
v8snhcsi5ufh81i9h7sh34kigi74i292i8ecdvo6pacom27rvvk0g	2405:6e00:459:97b4:1d9a:c50a:438c:cbca-49561                                  	253
v94sr6sjjg2sdvgfm580bihvh8u1cokm9jdlpsc91qdp4j91i1810	159.196.32.250-52503                                                          	253
v94sr6sjjg2sdvgfm580bihvh8u1cokm9jdlpsc91qdp4j91i1810	193.82.229.247-48350                                                          	253
v94sr6sjjg2sdvgfm580bihvh8u1cokm9jdlpsc91qdp4j91i1810	193.82.229.247-59042                                                          	253
v94sr6sjjg2sdvgfm580bihvh8u1cokm9jdlpsc91qdp4j91i1810	2001:4479:2708:f000:da39:c4bd:cbca:c21e-55875                                 	253
vair3u8kqfqoheulk277bp71j5vi80vd2bnavki02ekeounp72e80	13.58.98.187-1314                                                             	253
vair3u8kqfqoheulk277bp71j5vi80vd2bnavki02ekeounp72e80	2600:1f16:1f71:7c00:3593:3a22:674f:f33b-1314                                  	253
vato7n2pt09qji7l985fgi5top4g4sp6j1n2muvccn6jqpgsctdjg	120.18.28.115-2155                                                            	253
vato7n2pt09qji7l985fgi5top4g4sp6j1n2muvccn6jqpgsctdjg	159.196.32.250-44216                                                          	253
vbuotq03fkbngvggf7viv93djfqkt8h6r7ff9f6dln1t19n3d45d0	49.180.228.124-39960                                                          	253
v1curcdsvk3b7jlosp4nps58eo8639vs7ooouq4li9th96nmnidrg	159.196.32.250-47198                                                          	254
v2kgfdi6dutt0mmv78q58la30mafige86igvpch0dbtnfmb8l70sg	159.196.126.185-51410                                                         	254
v2kgfdi6dutt0mmv78q58la30mafige86igvpch0dbtnfmb8l70sg	159.196.126.185-52094                                                         	254
v2kgfdi6dutt0mmv78q58la30mafige86igvpch0dbtnfmb8l70sg	159.196.126.185-57311                                                         	254
v2kgfdi6dutt0mmv78q58la30mafige86igvpch0dbtnfmb8l70sg	159.196.32.250-43642                                                          	254
v3e03reesk64fbasbeun8i5kij6kcmvkbmm7eun36hf2bngs6vn50	159.196.32.250-44038                                                          	254
v3e03reesk64fbasbeun8i5kij6kcmvkbmm7eun36hf2bngs6vn50	159.196.32.250-60661                                                          	254
v4scaap53dun4fv7t5n9om9e1dd9a54dmkmdobnl8a6tqcrcst8v0	159.196.32.250-44893                                                          	254
v6p14qcvvftnnscuavsehu37t22vfvnhse054pbkb3ehemmjsrdh0	2600:1f16:1f71:7c00:4aa5:9c9d:57f5:a649-1314                                  	254
v6p14qcvvftnnscuavsehu37t22vfvnhse054pbkb3ehemmjsrdh0	3.145.86.40-1314                                                              	254
v6p14qcvvftnnscuavsehu37t22vfvnhse054pbkb3ehemmjsrdh0	v6p14qcvvftnnscuavsehu37t22vfvnhse054pbkb3ehemmjsrdh0.mainnet.polykey.com-1314	254
v7i562oegsgl7eorq4f5cqsqu2uo4lkl6dc9q4o96fo2noo6de0c0	125.209.145.77-50148                                                          	254
v7i562oegsgl7eorq4f5cqsqu2uo4lkl6dc9q4o96fo2noo6de0c0	125.209.145.77-52403                                                          	254
v7i562oegsgl7eorq4f5cqsqu2uo4lkl6dc9q4o96fo2noo6de0c0	125.209.145.77-55902                                                          	254
v7i562oegsgl7eorq4f5cqsqu2uo4lkl6dc9q4o96fo2noo6de0c0	125.209.145.77-60968                                                          	254
vg6gldhfdstju8frtbguav2p2svmev85dvpdb34gffmiagpgjf2pg	1.145.84.113-3558                                                             	255
vg6gldhfdstju8frtbguav2p2svmev85dvpdb34gffmiagpgjf2pg	159.196.120.34-56081                                                          	255
vg6gldhfdstju8frtbguav2p2svmev85dvpdb34gffmiagpgjf2pg	159.196.32.250-37722                                                          	255
vg6gldhfdstju8frtbguav2p2svmev85dvpdb34gffmiagpgjf2pg	159.196.32.250-57595                                                          	255
vg6gldhfdstju8frtbguav2p2svmev85dvpdb34gffmiagpgjf2pg	2403:5807:9005:0:0:0:0:a-56081                                                	255
vgcujggsaaq6n55il0ieu31f9bin1mlbrfovn83ph10dtvujj5lrg	159.196.32.250-56392                                                          	255
vgijtpv0h8m1eajeir77g73muq88n5kj0413t6fjdqsv9kt8dq4pg	12.162.124.34-49540                                                           	255
vgijtpv0h8m1eajeir77g73muq88n5kj0413t6fjdqsv9kt8dq4pg	172.56.89.82-7850                                                             	255
vgijtpv0h8m1eajeir77g73muq88n5kj0413t6fjdqsv9kt8dq4pg	75.33.93.225-63978                                                            	255
vncm2mkk41vgp2fmplqiu1je7b2l3v6fhgltlqf5f3f85923ve0j0	13.239.117.143-1314                                                           	255
vncm2mkk41vgp2fmplqiu1je7b2l3v6fhgltlqf5f3f85923ve0j0	2406:da1c:67c:c000:3ed6:cc4b:b6e5:3745-1314                                   	255
vncm2mkk41vgp2fmplqiu1je7b2l3v6fhgltlqf5f3f85923ve0j0	vncm2mkk41vgp2fmplqiu1je7b2l3v6fhgltlqf5f3f85923ve0j0.mainnet.polykey.com-1314	255
vqp7gqai1qhdugsbiun8rv70mmng7rp8hf74kflr7s4b3ed3tvoag	159.196.171.119-50759                                                         	255
vqp7gqai1qhdugsbiun8rv70mmng7rp8hf74kflr7s4b3ed3tvoag	159.196.171.119-50834                                                         	255
vqp7gqai1qhdugsbiun8rv70mmng7rp8hf74kflr7s4b3ed3tvoag	159.196.171.119-50858                                                         	255
vqp7gqai1qhdugsbiun8rv70mmng7rp8hf74kflr7s4b3ed3tvoag	159.196.32.250-39378                                                          	255
vqp7gqai1qhdugsbiun8rv70mmng7rp8hf74kflr7s4b3ed3tvoag	159.196.32.250-56400                                                          	255
vqsq2ca7vjbsiv3paletajdduecm0v36j00jbnbg5eg084bm46g6g	1.40.225.41-34140                                                             	255
vqsq2ca7vjbsiv3paletajdduecm0v36j00jbnbg5eg084bm46g6g	1.40.225.41-46076                                                             	255
vqsq2ca7vjbsiv3paletajdduecm0v36j00jbnbg5eg084bm46g6g	1.40.225.41-52527                                                             	255
vqsq2ca7vjbsiv3paletajdduecm0v36j00jbnbg5eg084bm46g6g	159.196.32.250-52990                                                          	255

Copy link
Member

Brian and I ran a bunch of tests and we added a console.error() statement next to every throw statement which re-throws an error. We did not do this for any statements which create a new error. The resulting log is as follows.

...
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.32.250:59598]:Creating forward NodeConnection
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.32.250:59598].QUICClient:Create QUICClient to 159.196.32.250:59598
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [13.239.117.143:1314].QUICClient.QUICConnection 35de6309b2d397b0d2ec955c7410118089649d94.QUICStream 184:Create QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [13.239.117.143:1314].QUICClient.QUICConnection 35de6309b2d397b0d2ec955c7410118089649d94.QUICStream 184:Created QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [13.239.117.143:1314].QUICClient.QUICConnection 35de6309b2d397b0d2ec955c7410118089649d94.QUICStream 188:Create QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [13.239.117.143:1314].QUICClient.QUICConnection 35de6309b2d397b0d2ec955c7410118089649d94.QUICStream 188:Created QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [13.239.117.143:1314].QUICClient.QUICConnection 35de6309b2d397b0d2ec955c7410118089649d94.QUICStream 192:Create QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [13.239.117.143:1314].QUICClient.QUICConnection 35de6309b2d397b0d2ec955c7410118089649d94.QUICStream 192:Created QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [13.239.117.143:1314].QUICClient.QUICConnection 35de6309b2d397b0d2ec955c7410118089649d94.QUICStream 196:Create QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [13.239.117.143:1314].QUICClient.QUICConnection 35de6309b2d397b0d2ec955c7410118089649d94.QUICStream 196:Created QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [13.239.117.143:1314].QUICClient.QUICConnection 35de6309b2d397b0d2ec955c7410118089649d94.QUICStream 200:Create QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [13.239.117.143:1314].QUICClient.QUICConnection 35de6309b2d397b0d2ec955c7410118089649d94.QUICStream 200:Created QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [13.239.117.143:1314].QUICClient.QUICConnection 35de6309b2d397b0d2ec955c7410118089649d94.QUICStream 204:Create QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [13.239.117.143:1314].QUICClient.QUICConnection 35de6309b2d397b0d2ec955c7410118089649d94.QUICStream 204:Created QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.32.250:52648].QUICClient.QUICConnection fe93fe89989d4bc16707e8fb84717a0de7516b09:Connect QUICConnection
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.32.250:53113].QUICClient.QUICConnection 897151cfed8ff7282c45304758cf154a1a033541:Connect QUICConnection
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.32.250:53508].QUICClient.QUICConnection cdc381408d56849015fc99a77efa1fed77a54fa0:Connect QUICConnection
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.32.250:53935].QUICClient.QUICConnection aad5537fca5a5088e1409ccf1b364d0bec5b0314:Connect QUICConnection
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.32.250:59598].QUICClient.QUICConnection f357d83f38be5d73c12574fe9c967c0c8db1e3fa:Connect QUICConnection
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionReverse [v3e03reesk64fbasbeun8i5kij6kcmvkbmm7eun36hf2bngs6vn50@::ffff:1.40.219.171:57401]:Destroyed NodeConnection
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.32.250:52648].QUICClient.QUICConnection fe93fe89989d4bc16707e8fb84717a0de7516b09:Start QUICConnection
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.32.250:53113].QUICClient.QUICConnection 897151cfed8ff7282c45304758cf154a1a033541:Start QUICConnection
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.32.250:53508].QUICClient.QUICConnection cdc381408d56849015fc99a77efa1fed77a54fa0:Start QUICConnection
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.32.250:53935].QUICClient.QUICConnection aad5537fca5a5088e1409ccf1b364d0bec5b0314:Start QUICConnection
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [159.196.32.250:59598].QUICClient.QUICConnection f357d83f38be5d73c12574fe9c967c0c8db1e3fa:Start QUICConnection
ErrorQUICConnectionIdleTimeout

This reveals that no errors were thrown from js-quic, as no error was listed out to the console before getting the ErrorQUICConnectionIdleTimeout. This is likely a leak in Polykey itself.

@CMCDragonkai
Copy link
Member

Yes, cause the event is emitted from js-quic, but it's Polykey that's ultimately encapsulating the objects, so of course it's leaking inside PK.

@CMCDragonkai
Copy link
Member

This is a blocker for doing any sort of demos. It happens too often, I want to make sure this investigated and fixed asap.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Dec 9, 2024

@tegefaulkes can you write down your debugging discovery and process here?

Here's another log of the agent breaking after some hours:

INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [120.152.182.211:56840].QUICClient.QUICConnection 26397ebee525a70e2b01c685e2a18117c9569ad4.QUICStream 42889:Destroy QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [120.152.182.211:56840].QUICClient.QUICConnection 26397ebee525a70e2b01c685e2a18117c9569ad4.QUICStream 42889:Destroyed QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [12.162.124.34:49540].QUICClient:ErrorQUICConnectionLocal: QUIC Connection local error - Failed to start QUIC connection due to start timeout
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [120.152.182.211:56840].QUICClient.QUICConnection 26397ebee525a70e2b01c685e2a18117c9569ad4.QUICStream 25096:Destroy QUICStream
INFO:polykey.PolykeyAgent.NodeConnectionManager.NodeConnectionForward [120.152.182.211:56840].QUICClient.QUICConnection 26397ebee525a70e2b01c685e2a18117c9569ad4.QUICStream 25096:Destroyed QUICStream
TypeError: Invalid state: WritableStream is closed

Notice here it's I believe it's the fact that something is still trying to use a particular stream even though the stream object it's associated with the QUICStream is already destroyed and thus closed.

I think there's a lifecycle issue here.

@tegefaulkes
Copy link
Contributor Author

Debugging so far.

I've traced how the error in question gets passed around. I found some potential promise leaks that I patched up but otherwise nothing stood out as the cause.

I did a sanity check and the error itself is almost certainly getting leaked out of a async context that isn't being awaited properly. We couldn't trace this to anywhere in the quic code and after patching the promises I think the quic code is good. I'm going to did through the Polykey code to see if there are any problems with the code there.

I'm adding a shotgun check that every time we throw an error I'm logging out when and where. It generates a lot of noise but hopefully if we catch it in the act we can trace it back to the last point we saw the error and narrow it down from there.

I've modified the code to force the emission of the error. Doing so doesn't trigger the problem at all. So it seems to purely be a race condition causing the leak. This lends evidence to improper handling of a promise rejecting somewhere in the code.

I've tried messing with timeout configuration to trigger the problem but no such luck there.

I'm continuing with

  1. Shotgun logging of throw errors and waiting to catch it in the act.
  2. Extending the scope of tracing the error through Polykey code.
  3. Globally checking any promise created that can reject potentially leaking the error.

@CMCDragonkai
Copy link
Member

Can you explain how some of your recent commits is addressing this.

@tegefaulkes
Copy link
Contributor Author

The following changes have been made so far.

  1. Added .catch(() => {}) to certain promises in @matrixai/quic to prevent a rejection leak if they rejected before being handled.
  2. Added .catch(() => {}) to certain promises in polykey to prevent rejection leaks. The main problem was found in the nodes domain and fixed up.
  3. Added ErrorQUICConnectionIdleTimeout to the list of expected connection related errors.
  4. Went through the code and cleaned up the names of some promises while I was reviewing and fixing things.

@aryanjassal aryanjassal removed their assignment Dec 10, 2024
@tegefaulkes
Copy link
Contributor Author

The ErrorQUICConnectionIdleTimeout should be fully addressed now. I'll close out this issue in a moment. The other details brought up here will be moved to new issues.

@tegefaulkes
Copy link
Contributor Author

Ok, i've split up all of the remaining tasks here into new issues. Closing this issue now.

@CMCDragonkai
Copy link
Member

The following changes have been made so far.

  1. Added .catch(() => {}) to certain promises in @matrixai/quic to prevent a rejection leak if they rejected before being handled.
  2. Added .catch(() => {}) to certain promises in polykey to prevent rejection leaks. The main problem was found in the nodes domain and fixed up.
  3. Added ErrorQUICConnectionIdleTimeout to the list of expected connection related errors.
  4. Went through the code and cleaned up the names of some promises while I was reviewing and fixing things.

Isn't 1. and 2. a sort of problematic? Whenever I see that, it looks like you're discarding exceptions. That may prevent a leak, but then you're just hiding the error. Surely it needs to be bubbled up to a handler for it to be "handled" which may just mean printing it out in some way as a warning or info message.

@CMCDragonkai
Copy link
Member

I haven't seen this for a while now:

ErrorNodeConnectionTransportGenericError: Transport received a generic error

Was this ever addressed @tegefaulkes?

@tegefaulkes
Copy link
Contributor Author

Its a void p.catch(() => {}); handler to prevent a rejection of the promise crashing the program before we add a handler for the promise. There are a few places where that could happen.

the ErrorNodeConnectionTransportGenericError: Transport received a generic error has not been addressed. I'll add it to my current issue.

@CMCDragonkai
Copy link
Member

You did write comments explaining why you added those right? They look weird without context.

@tegefaulkes
Copy link
Contributor Author

I did add comments explaining that they were for preventing unhandled promise rejection leaking errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working r&d:polykey:core activity 4 End to End Networking behind Consumer NAT Devices
Development

No branches or pull requests

3 participants