[WIP] fix deadlock in concurrent connection disconnection #302

doudou · 2019-06-17T17:25:31Z

This PR aims at fixing the issue identified in #300. The deadlock happens essentially when disconnecting the same ports from two directions. The easiest way is to disconnect the same connection from both directions at the same time, which leads reliably to deadlocking (associated test in bb6fbc1).

So far, the PR fixed the few easy places. However, there are a few very large codepaths still executed under lock, as removing the lock was breaking the tests without an obvious reason from my side. I figured that you guys might have more insight.

the whole connection creation codepath is done under lock (for instance https://github.com/orocos-toolchain/rtt/blob/master/rtt/internal/ConnFactory.cpp#L216, https://github.com/orocos-toolchain/rtt/blob/master/rtt/internal/ConnFactory.hpp#L460). The original design was making sure that the channel is created first and then added to the ports, which would remove the need for such a wide-ranging lock.
the endpoint disconnection is done under lock for the port. I am also not sure why, as the connection manager is meant to be thread-safe already. But tests start failing without that lock, so it probably is there for a good reason: https://github.com/orocos-toolchain/rtt/blob/master/rtt/internal/ConnOutputEndPoint.hpp#L133 and https://github.com/orocos-toolchain/rtt/blob/master/rtt/internal/ConnInputEndPoint.hpp#L76.

In general, my gut feeling is that the general design should be to connect/disconnect a channel from a port under the port lock, but wait until the lock is released to actually dismantle the channel. This would ensure that the large latency part (hitting the transports) is done without affecting the component.

…ctions

… set

meyerj · 2019-06-24T12:32:52Z

Thanks for your patch!

the whole connection creation codepath is done under lock (for instance https://github.com/orocos-toolchain/rtt/blob/master/rtt/internal/ConnFactory.cpp#L216, https://github.com/orocos-toolchain/rtt/blob/master/rtt/internal/ConnFactory.hpp#L460). The original design was making sure that the channel is created first and then added to the ports, which would remove the need for such a wide-ranging lock.

The new problem introduced with #114 is that some buffers might only have to be created once per input port or once per output port. The first implementation of the new dataflow semantics was implemented without that port-wide connection lock, but that caused race conditions when connecting concurrently because the new shared buffer or data object instances were eventually created by two connections, such that one of them would end up in a de-facto disconnected state. This was fixed by #283, which introduced the per-port connection locks.

The ConnectionManager actually plays a subordinate now with that patch: It still keeps track of the port's connection, but is not involved at all in the actual dataflow anymore. It is purely for management purposes.

the endpoint disconnection is done under lock for the port. I am also not sure why, as the connection manager is meant to be thread-safe already. But tests start failing without that lock, so it probably is there for a good reason: https://github.com/orocos-toolchain/rtt/blob/master/rtt/internal/ConnOutputEndPoint.hpp#L133 and https://github.com/orocos-toolchain/rtt/blob/master/rtt/internal/ConnInputEndPoint.hpp#L76.

Same issue here: Without that lock it can happen that one thread disconnects the port resulting in the destruction of a shared buffer (ConnInputEndPoint.hpp:91-95), while another thread creates a new connection. That also could lead to race conditions and a channel pipeline not consistent with the port connections stored in the ConnectionManager.

In general, my gut feeling is that the general design should be to connect/disconnect a channel from a port under the port lock, but wait until the lock is released to actually dismantle the channel. This would ensure that the large latency part (hitting the transports) is done without affecting the component.

I agree. Your patch looks good at first sight. I will do some more tests and check why it fails to build on Travis.

meyerj · 2019-06-24T18:24:17Z

rtt/base/ChannelInterface.cpp

-            const ChannelElementBase::shared_ptr &input = *it++;
-            input->disconnect(this, false);
-            removeInput(input.get()); // invalidates input
+        Inputs::const_iterator found = std::find(inputs.begin(), inputs.end(), channel);


Apparently the iterator needs to be non-const for std::list::splice for compatibility with C++98.

meyerj · 2019-06-24T18:24:31Z

rtt/base/ChannelInterface.cpp

+    Outputs removedOutput;
+    {
+        RTT::os::MutexLock lock(outputs_lock);
+        Outputs::const_iterator found = std::find(outputs.begin(), outputs.end(), channel);


Apparently the iterator needs to be non-const for std::list::splice for compatibility with C++98.

meyerj · 2019-06-24T18:58:32Z

rtt/base/ChannelElement.hpp

@@ -243,6 +243,12 @@ namespace RTT { namespace base {
            MultipleInputsChannelElementBase::removeInput(input);
        }

+        virtual void removedInputs(Inputs const& inputs)
+        {
+            if (find(inputs.begin(), inputs.end(), last) != inputs.end())


missing namespace qualifier: std::find?

meyerj · 2019-07-02T20:23:02Z

rtt/base/ChannelInterface.cpp

+        RTT::os::MutexLock lock(outputs_lock);
+        for (Outputs::iterator it = outputs.begin(); it != outputs.end(); ++it) {
+            if (it->disconnected)
+                disconnectedOutputs.splice(disconnectedOutputs.end(), this->outputs, it);


This loop is invalid because removing element it from outputs in splice() invalidates the iterator, at least as an iterator of outputs. it would need to be incremented before the element is removed from this->outputs.

…ring This refactoring led to massive deadlocking on a lot of dynamic disconnection scenarios. See orocos-toolchain/rtt#302

doudou added 4 commits June 16, 2019 21:05

add test case for deadlock in concurrent disconnection from both dire…

bb6fbc1

…ctions

test: disable port object tests if ORO_DISABLE_PORT_DATA_SCRIPTING is…

9e2cce0

… set

make locking more granular in ConnectionManager

066ecc5

make locking more fine-grained in MultipleOutputsChannelElementBase

6b8fee0

doudou mentioned this pull request Jun 17, 2019

Massive deadlocking when disconnecting connections in parallel #300

Open

doudou requested a review from meyerj June 17, 2019 17:29

meyerj reviewed Jun 24, 2019

View reviewed changes

meyerj requested changes Oct 21, 2019

View reviewed changes

doudou mentioned this pull request Nov 11, 2019

pin rtt commit to before the merge of the last major dataflow refactoring rock-core/package_set#148

Merged

This was referenced Mar 23, 2020

cleanupHook called twice after in runtime -> exception -> cleanup #318

Open

sample given to setDataSample seem to sometimes "leak" to the other side #319

Open

doudou closed this Jun 8, 2022

doudou deleted the fix_deadlock_in_concurrent_connection_disconnection branch June 8, 2022 17:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] fix deadlock in concurrent connection disconnection #302

[WIP] fix deadlock in concurrent connection disconnection #302

doudou commented Jun 17, 2019

meyerj commented Jun 24, 2019

meyerj Jun 24, 2019

meyerj Jun 24, 2019

meyerj Jun 24, 2019

meyerj Jul 2, 2019

[WIP] fix deadlock in concurrent connection disconnection #302

[WIP] fix deadlock in concurrent connection disconnection #302

Conversation

doudou commented Jun 17, 2019

meyerj commented Jun 24, 2019

meyerj Jun 24, 2019

Choose a reason for hiding this comment

meyerj Jun 24, 2019

Choose a reason for hiding this comment

meyerj Jun 24, 2019

Choose a reason for hiding this comment

meyerj Jul 2, 2019

Choose a reason for hiding this comment