-
Notifications
You must be signed in to change notification settings - Fork 613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Occasional hang due to deadlock when HALSimWS connection is present while SimDevice
is being created.
#6842
Comments
AsyncFunction needs a mutex because its entire purpose is to signal the loop thread from some other thread, and several member variables are modified on both caller thread(s) and the loop thread. It does look like it should have a recursive_mutex to handle this case (or not hold the lock during the std::apply call, but that's trickier to get right). |
A recursive_mutex would only help if there was only one thread involved in the deadlock, right? Don't the stack traces above I posted indicate that there are 2 threads involved? |
Good point, yeah, it must be a lock inversion issue. In which case we need to release the lock when running the callback to prevent it (either in AsyncFunction or in SimDevice). It's a little unclear if the same AsyncFunction is being used in both threads, but that's the only thing that makes sense. From the stack trace, it looks like thread A has the following locks held, trying to lock AsyncFunction.m_mutex: and Thread B has the following locks held, trying to lock SimDeviceData.m_mutex |
Here's a different pair of stack traces for a different lock inversion deadlock created in the same way: Thread calling
Thread processing new HALSimWS client connection is waiting on
|
My sense is that the best fix will be to fix SimDeviceData to not hold its mutex during callbacks. |
Should ensure a reliable workaround of wpilibsuite/allwpilib#6842. Also refactor some connection closing logic.
SimDevice
is being created.SimDevice
is being created.
SimDevice
is being created.SimDevice
is being created.
Further investigation indicates that the first deadlock above appears to be occurring sometime after the HALSimWS connection has been established, so I've updated the title and comment to reflect that. This also makes the issue harder to workaround because it means that one can't just delay creating Side note: Is there any chance of this getting fixed in a future 2024.x release or will there not be any more 2024.x releases because the season is over? (Also, presumably this should be marked as a bug.) |
No, we will not be making any more 2024.x releases. Our next release will be for 2025 beta. |
Describe the bug
If the HALSimWS server extension is enabled and a client
connectsconnection is present while the robot is creating aSimDevice
, a deadlock can occur.To Reproduce
Due to the race condition involved, it might not be possible to reproduce this reliably, but see below for stack traces showing the deadlock.
Steps to reproduce the behavior:
SimDevice
. (To increase the chances of triggering the issue, it might help to create manySimDevice
s and to create them after some delay to ensure that a HALSimWS client has connected.)While aSimDevice
is being created, connect a HALSimWS client.Expected behavior
The code should continue executing normally (and the device should be created and be visible to the HALSimWS client).
Desktop (please complete the following information):
WPILib Information:
Project Version: unknown
VS Code Version: 1.85.1
WPILib Extension Version: 2024.3.1
C++ Extension Version: 1.19.1
Java Extension Version: 1.32.0
Java Debug Extension Version: 0.57.2024041008
Java Dependencies Extension Version 0.23.6
Java Version: 17
Java Location: /home/brettle/wpilib/2024/jdk
Vendor Libraries:
Additional context
After a deadlock occured, I attached GDB to the java process. Here is the stack trace for the thread attempting to create the
SimDevice
. It appears to be blocking waiting for theAsyncFunction
mutex while running HALSimWS'screateDevice
callback. Note that at this point it is holding a lock on theSimDeviceData
mutex that it acquired inSimDeviceData::CreateDevice
The
AsyncFunction
mutex that the above thread is waiting for is locked by the HALSimWS server thread which is running theDeviceCreated
callback it registered during it's initialization. It appears to be blocked waiting for theSimDeviceData
mutex held by the earlier thread. Here is its stack trace:In short, this appears to be an issue of lock ordering inversion.
As a side note: I'm probably missing something, but it doesn't seem like
AsyncFunction
should need it's own mutex if all of the calls are made from the same thread/loop. If they aren't, what is the reasoning behind that?The text was updated successfully, but these errors were encountered: