Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service Call Hangup #143

Closed
bagelbytes61 opened this issue Mar 29, 2024 · 10 comments
Closed

Service Call Hangup #143

bagelbytes61 opened this issue Mar 29, 2024 · 10 comments
Assignees

Comments

@bagelbytes61
Copy link

bagelbytes61 commented Mar 29, 2024

I'd like to preface this with I am not quite sure yet if this is an rmw_zenoh issue or an eclipse zenoh/zenoh-c issue.

Some basic system information :
Platform: Ubuntu 24.04
Rustc: 1.75.0
GCC: 9.5.0
ROS2 packages: latest rolling as of 03/27/2024, built from source (will try to reproduce w/ latest and greatest today)

The issue:
It seems that when making service calls to a ROS2 node that the service call will hangup and not return (or even just segfault as I have seen, but that was only reproducible w/ custom debug builds), however this has only been observed to occur under some specific circumstances. Firstly, the server node in this case must be logging via the RCLCPP_* calls (we use RCLCPP_INFO for example). Secondly, there should be a publisher node that the server is subscribed to. We are publishing ~245 KiB at a rate of 50 Hz, so about 12 MiB/s. The second step isn't strictly necessary, but it seems to help trigger the issue. In my debug builds I didn't need to run the publisher node.

Through my own debugging it seems that an indefinite wait occurs here (https://github.com/eclipse-zenoh/zenoh/blob/05b9cb459f77693bbf7c89d67265ba1519959814/zenoh/src/net/routing/dispatcher/pubsub.rs#L435) waiting for the writer to release its hold of the RwLock. I have been unsuccessful in determining where that writer lock is being acquired, if that is in fact what is going on here, however as acquiring said lock is done via an RwLockWriteGuard object, whenever that object is dropped/goes out of scope, ownership should at that point be relinquished and the reader free to acquire the lock. I've also inspected the call stacks of other threads and did not observe any lingering RwLockWriteGuard's still on the stack.

This problem occurs in both normal operation and shared_memory mode.

An interesting note too is that directly publishing to /rosout is not sufficient to trigger the bug -- logging MUST be via the RCLCPP_* interface.

I will update this issue as I make more debugging progress.

To reproduce:

Build and run the Docker container located here: https://github.com/jackg0/ros2_simple_tests

Launch the zenoh router via ros2 run rmw_zenoh_cpp rmw_zenohd

Launch both the simple server and the simple publisher

Trigger the issue via making service calls to the server with ros2 service call /simple_server/trigger std_srvs/srv/Trigger

I hope that this information can provide some insight and as I said above I will be actively debugging this myself and providing updates as they come.

@Yadunund
Copy link
Member

Yadunund commented Apr 2, 2024

For sanity check, does the system work as expected with rmw_fastrtps_cpp and rmw_cyclonedds_cpp?

@JEnoch
Copy link
Contributor

JEnoch commented Apr 5, 2024

I ran the test files in a Iron setup (not in Docker) with rmw_cyclonedds_cpp and rmw_zenoh_cpp (both rolling and #146 branches).
The ros2 service call /simple_server/trigger std_srvs/srv/Trigger was always successful.

Could it be a temporary issue with rclcpp in Rolling or with your Dockerized environment ?

@clalancette clalancette self-assigned this Apr 18, 2024
@bagelbytes61
Copy link
Author

Apologies for not getting back to you for so long.

I've done some more testing and have confirmed on both Rolling and Jazzy that the issue is still present and can be triggered by the same steps as described in the issue. I will say that it may take a fair bit of service calls to trigger the hang up, so I would recommend doing something like:

while true; do 
ros2 service call /simple_server/trigger std_srvs/srv/Trigger; 
done

Also it does not appear to occur w/ other RMW's.

I don't have any other updates on this issue, but just wanted to get back to you guys on this.

@Yadunund
Copy link
Member

@bagelbytes61 thanks for following up. Could I confirm that you tested after #134 was merged? (You'll need to do a clean build and also ensure you kill the daemon with ros2 daemon stop before that)

@Yadunund
Copy link
Member

@bagelbytes61 I'm unable to reproduce the issue with the latest rolling branch in rmw_zenoh.

I've been continuously making the service call for over 15mins now and am able to get a successful response each time. See video below for a snippet from this test (the warning can be ignored as it is a known issue and caused by something else).

ros2_simple_tests.mp4

@bagelbytes61
Copy link
Author

Hmm. Interesting. Could I ask which version of rustc you are using to compile zenoh-c? I have been using 1.75.0, though I just tried 1.78.0 and experienced the same behavior. Also could I ask about your environment? I am using a fresh Ubuntu 24.04 virtual machine that I apt install'd ros-rolling-desktop onto and built the rmw_zenoh_cpp package with GCC 13.2.0 and rustc 1.75/1.78.

@Yadunund
Copy link
Member

I'm compiling with 1.75.0 rustc.
Clean Ubuntu 24.04 but I have ROS 2 Rolling built from source. I don't see the problem with Iron binaries on 22.04 either.

I do have everything built in Release mode.

When you start continuously making service calls, do you see the problem immediately or after several minutes? I could try running my test for longer...

@Yadunund
Copy link
Member

Okay I was able to reproduce it after running for a longer time.

@Yadunund
Copy link
Member

Yadunund commented May 11, 2024

@bagelbytes61 I opened PR jackg0/ros2_simple_tests#3 to add a script that creates a node which makes the service calls indefinitely. I've had this script running for over 2 hours now without any hangup issues.
There is one big difference between between this script and running ros2 service call ... within a bash while loop. In the former, the ROS context and hence zenoh session is created once and then service calls are made indefinitely. In the latter, the context and session is created each time the ros2 service call command is run and more importantly destructed each time. There are some known issues related to shutting down the zenoh session eg. #170 which could be the underlying cause.

@clalancette
Copy link
Collaborator

We refactored a lot of this earlier this year, so I believe that this issue is fixed. Because of that, I'm going to close this out. If that is not the case, please feel free to reopen and we can debug some more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants