You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This was encountered at the SO site recently. The logs below include the crash and recovery, which just involved starting up main again.
2024-11-14T16:02:22+0000 start called for main
2024-11-14T16:02:22+0000 main:0 Status is now "starting".
2024-11-14T16:02:22+0000 main:0 Status is now "running".
2024-12-12T23:02:12+0000 dropping connection to peer tcp4:172.43.1.2:8002 with abort=True: WebSocket ping timeout (peer did not respond with pong in time)
2024-12-12T23:02:22+0000 Unhandled error in Deferred:
2024-12-12T23:02:26+0000
2024-12-12T23:02:26+0000 main:0 CRASH: [Failure instance: Traceback: <class 'autobahn.exception.Disconnected'>: Attempt to send on a closed protocol
/opt/venv/lib/python3.10/site-packages/twisted/internet/defer.py:873:callback
/opt/venv/lib/python3.10/site-packages/twisted/internet/defer.py:980:_startRunCallbacks
/opt/venv/lib/python3.10/site-packages/twisted/internet/defer.py:1074:_runCallbacks
/opt/venv/lib/python3.10/site-packages/twisted/internet/defer.py:1960:_gotResultInlineCallbacks
--- <exception caught here> ---
/opt/venv/lib/python3.10/site-packages/twisted/internet/defer.py:2014:_inlineCallbacks
/opt/venv/lib/python3.10/site-packages/ocs/agents/registry/agent.py:215:main
/opt/venv/lib/python3.10/site-packages/ocs/agents/registry/agent.py:149:_publish_agent_ops
/opt/venv/lib/python3.10/site-packages/ocs/ocs_agent.py:608:publish_to_feed
/opt/venv/lib/python3.10/site-packages/ocs/ocs_feed.py:235:publish_message
/opt/venv/lib/python3.10/site-packages/ocs/ocs_feed.py:135:flush_buffer
/opt/venv/lib/python3.10/site-packages/autobahn/wamp/protocol.py:1543:publish
/opt/venv/lib/python3.10/site-packages/autobahn/wamp/protocol.py:1539:publish
/opt/venv/lib/python3.10/site-packages/autobahn/wamp/websocket.py:137:send
/opt/venv/lib/python3.10/site-packages/autobahn/websocket/protocol.py:2275:sendMessage
]
2024-12-12T23:02:26+0000 main:0 Status is now "done".
2024-12-12T23:02:26+0000 session left: CloseDetails(reason=<wamp.close.transport_lost>, message='WAMP transport was lost without closing the session 5776151733951188 before')
2024-12-12T23:02:26+0000 OCSAgent.onUserError(): "While firing onLeave"
Traceback (most recent call last):
File "/opt/venv/lib/python3.10/site-packages/txaio/tx.py", line 369, in as_future
return maybeDeferred(fun, *args, **kwargs)
File "/opt/venv/lib/python3.10/site-packages/twisted/internet/defer.py", line 212, in maybeDeferred
result = f(*args, **kwargs)
File "/opt/venv/lib/python3.10/site-packages/twisted/internet/defer.py", line 2287, in unwindGenerator
return _cancellableInlineCallbacks(gen)
File "/opt/venv/lib/python3.10/site-packages/twisted/internet/defer.py", line 2197, in _cancellableInlineCallbacks
_inlineCallbacks(None, gen, status, _copy_context())
--- <exception caught here> ---
File "/opt/venv/lib/python3.10/site-packages/twisted/internet/defer.py", line 2014, in _inlineCallbacks
result = context.run(gen.send, result)
File "/opt/venv/lib/python3.10/site-packages/ocs/ocs_agent.py", line 268, in onLeave
self.heartbeat_call.stop()
File "/opt/venv/lib/python3.10/site-packages/twisted/internet/task.py", line 213, in stop
assert self.running, "Tried to stop a LoopingCall that was " "not running."
builtins.AssertionError: Tried to stop a LoopingCall that was not running.
2024-12-12T23:02:26+0000 transport disconnected
2024-12-12T23:02:26+0000 waiting for reconnection
2024-12-12T23:02:26+0000 Scheduling retry 1 to connect <twisted.internet.endpoints.TCP4ClientEndpoint object at 0x7f424d6028f0> in 2.277512851549683 seconds.
2024-12-12T23:02:29+0000 transport connected
2024-12-12T23:02:29+0000 session joined: {'authextra': {'x_cb_node': '642fee3f5f32-7',
'x_cb_peer': 'tcp4:172.43.1.3:37706',
'x_cb_pid': 14,
'x_cb_worker': 'worker001'},
'authid': 'CWGG-99HM-NCGG-YEFC-RAVK-XNQN',
'authmethod': 'anonymous',
'authprovider': 'static',
'authrole': 'iocs_agent',
'realm': 'test_realm',
'resumable': False,
'resume_token': None,
'resumed': False,
'serializer': 'msgpack.batched',
'session': 5483724262252447,
'transport': {'channel_framing': 'websocket',
'channel_id': {},
'channel_serializer': None,
'channel_type': 'tcp',
'http_cbtid': None,
'http_headers_received': None,
'http_headers_sent': None,
'is_secure': False,
'is_server': False,
'own': None,
'own_fd': -1,
'own_pid': 7,
'own_tid': 7,
'peer': 'tcp4:172.43.1.2:8002',
'peer_cert': None,
'websocket_extensions_in_use': None,
'websocket_protocol': None}}
2024-12-13T15:18:38+0000 start called for main
2024-12-13T15:18:38+0000 main:1 Status is now "starting".
2024-12-13T15:18:38+0000 main:1 Status is now "running".
ocs version: v0.11.2
The text was updated successfully, but these errors were encountered:
So I think the problem here is that the WAMP publish function threw because of the disconnect, which is not usually protected in a try/catch.
I could add a try/catch around the _publish_agent_ops function to try to protect against this, but I think this is a potential issue for any agent with an acq or main function that has unprotected publish calls.
I think a better solution might be to implement some behavior like "restart: always" in the OCSAgent class, where we can tell an OCS agent to always try to restart important processes if they fail for whatever reason.
So I think the problem here is that the WAMP publish function threw because of the disconnect, which is not usually protected in a try/catch.
I could add a try/catch around the _publish_agent_ops function to try to protect against this, but I think this is a potential issue for any agent with an acq or main function that has unprotected publish calls.
I don't think we really care if the looping call was running or not when we try to call stop. We should have it ignore this error. It looks like task.LoopingCall.running indicates whether the callable is scheduled to be called or not. I need to recreate the error, but perhaps a fix is as simple as:
diff --git a/ocs/ocs_agent.py b/ocs/ocs_agent.py
index 5f17a00..d13366a 100644
--- a/ocs/ocs_agent.py+++ b/ocs/ocs_agent.py@@ -265,7 +265,8 @@ class OCSAgent(ApplicationSession):
def onLeave(self, details):
self.log.info('session left: {}'.format(details))
if self.heartbeat_call is not None:
- self.heartbeat_call.stop()+ if self.heartbeat_call.running:+ self.heartbeat_call.stop()
# Normal shutdown
if details.reason == "wamp.close.normal":
I'll try to reproduce and test this.
I think a better solution might be to implement some behavior like "restart: always" in the OCSAgent class, where we can tell an OCS agent to always try to restart important processes if they fail for whatever reason.
That's an interesting, but separate, idea. While it would restart the process fine in this case, it doesn't really address the underlying problem of gracefully handling the dropped crossbar connection.
This was encountered at the SO site recently. The logs below include the crash and recovery, which just involved starting up
main
again.ocs version: v0.11.2
The text was updated successfully, but these errors were encountered: