-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] fix deadlock in concurrent connection disconnection #302
[WIP] fix deadlock in concurrent connection disconnection #302
Conversation
Thanks for your patch!
The new problem introduced with #114 is that some buffers might only have to be created once per input port or once per output port. The first implementation of the new dataflow semantics was implemented without that port-wide connection lock, but that caused race conditions when connecting concurrently because the new shared buffer or data object instances were eventually created by two connections, such that one of them would end up in a de-facto disconnected state. This was fixed by #283, which introduced the per-port connection locks. The
Same issue here: Without that lock it can happen that one thread disconnects the port resulting in the destruction of a shared buffer (ConnInputEndPoint.hpp:91-95), while another thread creates a new connection. That also could lead to race conditions and a channel pipeline not consistent with the port connections stored in the
I agree. Your patch looks good at first sight. I will do some more tests and check why it fails to build on Travis. |
const ChannelElementBase::shared_ptr &input = *it++; | ||
input->disconnect(this, false); | ||
removeInput(input.get()); // invalidates input | ||
Inputs::const_iterator found = std::find(inputs.begin(), inputs.end(), channel); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently the iterator needs to be non-const for std::list::splice for compatibility with C++98.
Outputs removedOutput; | ||
{ | ||
RTT::os::MutexLock lock(outputs_lock); | ||
Outputs::const_iterator found = std::find(outputs.begin(), outputs.end(), channel); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently the iterator needs to be non-const for std::list::splice for compatibility with C++98.
@@ -243,6 +243,12 @@ namespace RTT { namespace base { | |||
MultipleInputsChannelElementBase::removeInput(input); | |||
} | |||
|
|||
virtual void removedInputs(Inputs const& inputs) | |||
{ | |||
if (find(inputs.begin(), inputs.end(), last) != inputs.end()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing namespace qualifier: std::find
?
RTT::os::MutexLock lock(outputs_lock); | ||
for (Outputs::iterator it = outputs.begin(); it != outputs.end(); ++it) { | ||
if (it->disconnected) | ||
disconnectedOutputs.splice(disconnectedOutputs.end(), this->outputs, it); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop is invalid because removing element it
from outputs
in splice()
invalidates the iterator, at least as an iterator of outputs
. it
would need to be incremented before the element is removed from this->outputs
.
…ring This refactoring led to massive deadlocking on a lot of dynamic disconnection scenarios. See orocos-toolchain/rtt#302
…ring This refactoring led to massive deadlocking on a lot of dynamic disconnection scenarios. See orocos-toolchain/rtt#302
This PR aims at fixing the issue identified in #300. The deadlock happens essentially when disconnecting the same ports from two directions. The easiest way is to disconnect the same connection from both directions at the same time, which leads reliably to deadlocking (associated test in bb6fbc1).
So far, the PR fixed the few easy places. However, there are a few very large codepaths still executed under lock, as removing the lock was breaking the tests without an obvious reason from my side. I figured that you guys might have more insight.
In general, my gut feeling is that the general design should be to connect/disconnect a channel from a port under the port lock, but wait until the lock is released to actually dismantle the channel. This would ensure that the large latency part (hitting the transports) is done without affecting the component.