Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Registry Agent sometimes does not recover from crossbar disconnect #417

Open
BrianJKoopman opened this issue Dec 13, 2024 · 2 comments
Open
Labels
bug Something isn't working needs triage Cause of bug still unknown, needs investigation.

Comments

@BrianJKoopman
Copy link
Member

BrianJKoopman commented Dec 13, 2024

This was encountered at the SO site recently. The logs below include the crash and recovery, which just involved starting up main again.

2024-11-14T16:02:22+0000 start called for main
2024-11-14T16:02:22+0000 main:0 Status is now "starting".
2024-11-14T16:02:22+0000 main:0 Status is now "running".
2024-12-12T23:02:12+0000 dropping connection to peer tcp4:172.43.1.2:8002 with abort=True: WebSocket ping timeout (peer did not respond with pong in time)
2024-12-12T23:02:22+0000 Unhandled error in Deferred:
2024-12-12T23:02:26+0000
2024-12-12T23:02:26+0000 main:0 CRASH: [Failure instance: Traceback: <class 'autobahn.exception.Disconnected'>: Attempt to send on a closed protocol
/opt/venv/lib/python3.10/site-packages/twisted/internet/defer.py:873:callback
/opt/venv/lib/python3.10/site-packages/twisted/internet/defer.py:980:_startRunCallbacks
/opt/venv/lib/python3.10/site-packages/twisted/internet/defer.py:1074:_runCallbacks
/opt/venv/lib/python3.10/site-packages/twisted/internet/defer.py:1960:_gotResultInlineCallbacks
--- <exception caught here> ---
/opt/venv/lib/python3.10/site-packages/twisted/internet/defer.py:2014:_inlineCallbacks
/opt/venv/lib/python3.10/site-packages/ocs/agents/registry/agent.py:215:main
/opt/venv/lib/python3.10/site-packages/ocs/agents/registry/agent.py:149:_publish_agent_ops
/opt/venv/lib/python3.10/site-packages/ocs/ocs_agent.py:608:publish_to_feed
/opt/venv/lib/python3.10/site-packages/ocs/ocs_feed.py:235:publish_message
/opt/venv/lib/python3.10/site-packages/ocs/ocs_feed.py:135:flush_buffer
/opt/venv/lib/python3.10/site-packages/autobahn/wamp/protocol.py:1543:publish
/opt/venv/lib/python3.10/site-packages/autobahn/wamp/protocol.py:1539:publish
/opt/venv/lib/python3.10/site-packages/autobahn/wamp/websocket.py:137:send
/opt/venv/lib/python3.10/site-packages/autobahn/websocket/protocol.py:2275:sendMessage
]
2024-12-12T23:02:26+0000 main:0 Status is now "done".
2024-12-12T23:02:26+0000 session left: CloseDetails(reason=<wamp.close.transport_lost>, message='WAMP transport was lost without closing the session 5776151733951188 before')
2024-12-12T23:02:26+0000 OCSAgent.onUserError(): "While firing onLeave"
Traceback (most recent call last):
  File "/opt/venv/lib/python3.10/site-packages/txaio/tx.py", line 369, in as_future
    return maybeDeferred(fun, *args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/twisted/internet/defer.py", line 212, in maybeDeferred
    result = f(*args, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/twisted/internet/defer.py", line 2287, in unwindGenerator
    return _cancellableInlineCallbacks(gen)
  File "/opt/venv/lib/python3.10/site-packages/twisted/internet/defer.py", line 2197, in _cancellableInlineCallbacks
    _inlineCallbacks(None, gen, status, _copy_context())
--- <exception caught here> ---
  File "/opt/venv/lib/python3.10/site-packages/twisted/internet/defer.py", line 2014, in _inlineCallbacks
    result = context.run(gen.send, result)
  File "/opt/venv/lib/python3.10/site-packages/ocs/ocs_agent.py", line 268, in onLeave
    self.heartbeat_call.stop()
  File "/opt/venv/lib/python3.10/site-packages/twisted/internet/task.py", line 213, in stop
    assert self.running, "Tried to stop a LoopingCall that was " "not running."
builtins.AssertionError: Tried to stop a LoopingCall that was not running.

2024-12-12T23:02:26+0000 transport disconnected
2024-12-12T23:02:26+0000 waiting for reconnection
2024-12-12T23:02:26+0000 Scheduling retry 1 to connect <twisted.internet.endpoints.TCP4ClientEndpoint object at 0x7f424d6028f0> in 2.277512851549683 seconds.
2024-12-12T23:02:29+0000 transport connected
2024-12-12T23:02:29+0000 session joined: {'authextra': {'x_cb_node': '642fee3f5f32-7',
               'x_cb_peer': 'tcp4:172.43.1.3:37706',
               'x_cb_pid': 14,
               'x_cb_worker': 'worker001'},
 'authid': 'CWGG-99HM-NCGG-YEFC-RAVK-XNQN',
 'authmethod': 'anonymous',
 'authprovider': 'static',
 'authrole': 'iocs_agent',
 'realm': 'test_realm',
 'resumable': False,
 'resume_token': None,
 'resumed': False,
 'serializer': 'msgpack.batched',
 'session': 5483724262252447,
 'transport': {'channel_framing': 'websocket',
               'channel_id': {},
               'channel_serializer': None,
               'channel_type': 'tcp',
               'http_cbtid': None,
               'http_headers_received': None,
               'http_headers_sent': None,
               'is_secure': False,
               'is_server': False,
               'own': None,
               'own_fd': -1,
               'own_pid': 7,
               'own_tid': 7,
               'peer': 'tcp4:172.43.1.2:8002',
               'peer_cert': None,
               'websocket_extensions_in_use': None,
               'websocket_protocol': None}}
2024-12-13T15:18:38+0000 start called for main
2024-12-13T15:18:38+0000 main:1 Status is now "starting".
2024-12-13T15:18:38+0000 main:1 Status is now "running".

ocs version: v0.11.2

@BrianJKoopman BrianJKoopman added bug Something isn't working needs triage Cause of bug still unknown, needs investigation. labels Dec 13, 2024
@jlashner
Copy link
Collaborator

jlashner commented Dec 13, 2024

So I think the problem here is that the WAMP publish function threw because of the disconnect, which is not usually protected in a try/catch.

I could add a try/catch around the _publish_agent_ops function to try to protect against this, but I think this is a potential issue for any agent with an acq or main function that has unprotected publish calls.

I think a better solution might be to implement some behavior like "restart: always" in the OCSAgent class, where we can tell an OCS agent to always try to restart important processes if they fail for whatever reason.

@BrianJKoopman
Copy link
Member Author

So I think the problem here is that the WAMP publish function threw because of the disconnect, which is not usually protected in a try/catch.

I could add a try/catch around the _publish_agent_ops function to try to protect against this, but I think this is a potential issue for any agent with an acq or main function that has unprotected publish calls.

I think the problem might actually be in the heartbeat publication's publish_to_feed() call. It seems to be throwing the AssertionError because in onLeave() we try to call self.heartbeat_call.stop(), but the task.LoopingCall() is no longer running, likely due to the dropped connection.

I don't think we really care if the looping call was running or not when we try to call stop. We should have it ignore this error. It looks like task.LoopingCall.running indicates whether the callable is scheduled to be called or not. I need to recreate the error, but perhaps a fix is as simple as:

diff --git a/ocs/ocs_agent.py b/ocs/ocs_agent.py
index 5f17a00..d13366a 100644
--- a/ocs/ocs_agent.py
+++ b/ocs/ocs_agent.py
@@ -265,7 +265,8 @@ class OCSAgent(ApplicationSession):
     def onLeave(self, details):
         self.log.info('session left: {}'.format(details))
         if self.heartbeat_call is not None:
-            self.heartbeat_call.stop()
+            if self.heartbeat_call.running:
+                self.heartbeat_call.stop()

         # Normal shutdown
         if details.reason == "wamp.close.normal":

I'll try to reproduce and test this.

I think a better solution might be to implement some behavior like "restart: always" in the OCSAgent class, where we can tell an OCS agent to always try to restart important processes if they fail for whatever reason.

That's an interesting, but separate, idea. While it would restart the process fine in this case, it doesn't really address the underlying problem of gracefully handling the dropped crossbar connection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Cause of bug still unknown, needs investigation.
Projects
None yet
Development

No branches or pull requests

2 participants