-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Service Call Hangup #143
Comments
For sanity check, does the system work as expected with rmw_fastrtps_cpp and rmw_cyclonedds_cpp? |
I ran the test files in a Iron setup (not in Docker) with Could it be a temporary issue with |
Apologies for not getting back to you for so long. I've done some more testing and have confirmed on both Rolling and Jazzy that the issue is still present and can be triggered by the same steps as described in the issue. I will say that it may take a fair bit of service calls to trigger the hang up, so I would recommend doing something like:
Also it does not appear to occur w/ other RMW's. I don't have any other updates on this issue, but just wanted to get back to you guys on this. |
@bagelbytes61 thanks for following up. Could I confirm that you tested after #134 was merged? (You'll need to do a clean build and also ensure you kill the daemon with |
@bagelbytes61 I'm unable to reproduce the issue with the latest I've been continuously making the service call for over 15mins now and am able to get a successful response each time. See video below for a snippet from this test (the warning can be ignored as it is a known issue and caused by something else). ros2_simple_tests.mp4 |
Hmm. Interesting. Could I ask which version of rustc you are using to compile zenoh-c? I have been using 1.75.0, though I just tried 1.78.0 and experienced the same behavior. Also could I ask about your environment? I am using a fresh Ubuntu 24.04 virtual machine that I |
I'm compiling with I do have everything built in Release mode. When you start continuously making service calls, do you see the problem immediately or after several minutes? I could try running my test for longer... |
Okay I was able to reproduce it after running for a longer time. |
@bagelbytes61 I opened PR jackg0/ros2_simple_tests#3 to add a script that creates a node which makes the service calls indefinitely. I've had this script running for over 2 hours now without any hangup issues. |
We refactored a lot of this earlier this year, so I believe that this issue is fixed. Because of that, I'm going to close this out. If that is not the case, please feel free to reopen and we can debug some more. |
I'd like to preface this with I am not quite sure yet if this is an rmw_zenoh issue or an eclipse zenoh/zenoh-c issue.
Some basic system information :
Platform: Ubuntu 24.04
Rustc: 1.75.0
GCC: 9.5.0
ROS2 packages: latest rolling as of 03/27/2024, built from source (will try to reproduce w/ latest and greatest today)
The issue:
It seems that when making service calls to a ROS2 node that the service call will hangup and not return (or even just segfault as I have seen, but that was only reproducible w/ custom debug builds), however this has only been observed to occur under some specific circumstances. Firstly, the server node in this case must be logging via the
RCLCPP_*
calls (we useRCLCPP_INFO
for example). Secondly, there should be a publisher node that the server is subscribed to. We are publishing ~245 KiB at a rate of 50 Hz, so about 12 MiB/s. The second step isn't strictly necessary, but it seems to help trigger the issue. In my debug builds I didn't need to run the publisher node.Through my own debugging it seems that an indefinite wait occurs here (https://github.com/eclipse-zenoh/zenoh/blob/05b9cb459f77693bbf7c89d67265ba1519959814/zenoh/src/net/routing/dispatcher/pubsub.rs#L435) waiting for the writer to release its hold of the
RwLock
. I have been unsuccessful in determining where that writer lock is being acquired, if that is in fact what is going on here, however as acquiring said lock is done via anRwLockWriteGuard
object, whenever that object is dropped/goes out of scope, ownership should at that point be relinquished and the reader free to acquire the lock. I've also inspected the call stacks of other threads and did not observe any lingeringRwLockWriteGuard
's still on the stack.This problem occurs in both normal operation and shared_memory mode.
An interesting note too is that directly publishing to
/rosout
is not sufficient to trigger the bug -- logging MUST be via theRCLCPP_*
interface.I will update this issue as I make more debugging progress.
To reproduce:
Build and run the Docker container located here: https://github.com/jackg0/ros2_simple_tests
Launch the zenoh router via
ros2 run rmw_zenoh_cpp rmw_zenohd
Launch both the simple server and the simple publisher
Trigger the issue via making service calls to the server with
ros2 service call /simple_server/trigger std_srvs/srv/Trigger
I hope that this information can provide some insight and as I said above I will be actively debugging this myself and providing updates as they come.
The text was updated successfully, but these errors were encountered: