Trying to figure out which tests are flaky and fix, adding debug info… #140

Carter12s · 2023-10-05T19:46:59Z

Description

Trying to make CI more robust

Fixes

Closes: #122
Closes: #141

Checklist

Update CHANGELOG.md

…rmation to CI as some flakyness is not reproducing locally:

Carter12s · 2023-10-09T15:31:01Z

https://github.com/Carter12s/roslibrust/actions/runs/6451070088/job/17511214836?pr=140

Carter12s · 2023-10-09T15:32:41Z

https://github.com/Carter12s/roslibrust/actions/runs/6451070091/job/17511214844?pr=140

Carter12s · 2023-10-09T20:35:21Z

Okay @ssnover I feel like I'm just summoning ghosts at this point.

If you have a chance to review the most recent test failure here: https://github.com/Carter12s/roslibrust/actions/runs/6459122424/job/17540392614?pr=140

Examine the current design of that test on 4c858f0

And help me come up with theories on HOW THE FUCK this test could still be timing out without printing anything.

Here is my current mental model:

The log line test tests::verify_get_publications has been running for over 60 seconds indicates to me that this test must be the problem, and that the overall test runner is healthy.
Every single async operation within that test is wrapped in a timeout()
The overall test has a watchdog configured on it. I've manually confirmed the watchdog will trip anything in the body of the test is delayed
The test appears to run perfectly on my system and only fail in CI (ran it in a loop in bash, 100/100 times passed). Althou I wasn't re-building between runs.
The test only sometimes fails in CI
tokio::test is defaulting to a single threaded running. I'll try multi next, but that would mean that somehow some function we're calling within the test is blocking the only tokio thread preventing the watchdog?

Carter12s · 2023-10-09T21:12:38Z

Found this issue: ZcashFoundation/zebra#1631

.to_socket_addrs() could be the blocking culprit.

Carter12s · 2023-10-09T23:06:47Z

JESUS CHRIST HOW DOES IT STILL TIME OUT

Carter12s · 2023-10-13T20:20:17Z

Leaving a note here before I forget:

We are spawning a tokio task which is never dying, and isn't being killed on drop, panics! aren't propagating to it, and it is keeping the test process alive (We think this is the root cause).

Carter12s · 2023-10-22T20:06:20Z

Okay I think I've finally finally actually traced down the root problem, had to fall back to "log debugging" cause I could neither get a debugger to produce reliably, or get all my mucking about with timeouts to work (still unclear why timeouts aren't being effective), but whatever there is a root bug.

Inside of node.rs/registerPublisher:

            log::trace!("Created new publication for {topic:?}");
            let handle = channel.get_sender();
            self.publishers.insert(topic.clone(), channel);
            log::trace!("Inserted new publsiher into dashmap");

When the test verify_get_publications hangs it reliably publishes the first log message and not the second. The DashMap is what is causing a deadlock, and it makes sense now that it is probabilistically occurring since the deadlock will only occur (I believe) if two objects end up in the same DashMap shard...

Will continue debugging and try to solve root issue. Unclear what I'm ultimately going to do with this branch as it has turned in a giant set of changes as I randomly poked at things to find the root issue.

…itching away from dashmap

…removing --test-threads 1

Carter12s · 2023-10-23T00:31:30Z

Okay @ssnover I think this is finally ready to merge. Dashmap was the root of the problem as far as I can tell.

ssnover

A lot of cruft built up while debugging this test I think.

ssnover · 2023-10-23T21:40:38Z

CHANGELOG.md

@@ -30,6 +30,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

 ### Changed

+ - Removed `find_and_generate_ros_messages_relative_to_manifest_dir!` this proc_macro was changing the current working directory of the compilation job resulting in a variety of strange compilation behaviors. Build.rs scripts are recommended for use cases requiring fine


I'm not sure it was necessary to remove this macro entirely. It just didn't need to be changing the working directory. Like it just need to prepend the provided paths in the macro with the manifest directory path.

I want to remove because I do think it is just a bad API (like the method name is a joke) we could have fixed it, but I really think this macro came about from me not understanding aspects of when macros were expanded and what working directory / cargo manifest dir were under different circumstances.

I think ultimately we figure out a new API in the future for the same reason.

ssnover · 2023-10-23T21:41:29Z

roslibrust/examples/native_ros1.rs

@@ -4,7 +4,7 @@

 #[cfg(feature = "ros1")]
 #[tokio::main]
-async fn main() -> Result<(), Box<dyn std::error::Error>> {
+async fn main() -> Result<(), Box<dyn std::error::Error + Send + Sync>> {


Why make the return of main have these bounds? I think it's just noise.

I think we should punt to #118

I ended up having to add this bound in a number of places to make it valid to return the errors from tasks.

ssnover · 2023-10-23T21:42:48Z

roslibrust/src/ros1/publisher.rs

@@ -26,8 +28,10 @@ impl<T: RosMessageType> Publisher<T> {
        }
    }

-    pub async fn publish(&self, data: &T) -> Result<(), Box<dyn std::error::Error>> {
-        let data = serde_rosmsg::to_vec(&data)?;
+    pub async fn publish(&self, data: &T) -> Result<(), Box<dyn std::error::Error + Send + Sync>> {


If a RosLibRustError is being mapped here, just use that type instead of boxing it into something with dyn trait bounds.

Punt to #118

roslibrust/src/rosbridge/integration_tests.rs

roslibrust/tests/ros1_xmlrpc.rs

ssnover · 2023-10-23T21:58:55Z

I don't see any changes related to DashMap, can you explain?

Carter12s · 2023-10-23T22:26:55Z

roslibrust/src/ros1/node.rs

@@ -229,11 +243,11 @@ pub struct Node {
    // Receiver for requests to the Node actor
    node_msg_rx: mpsc::UnboundedReceiver<NodeMsg>,
    // Map of topic names to the publishing channels associated with the topic
-    publishers: DashMap<String, Publication>,
+    publishers: HashMap<String, Publication>,


@ssnover DashMap changes are here.

Carter12s · 2023-10-23T22:32:02Z

I don't see any changes related to DashMap, can you explain?

I think you missed reviewing node.rs (needed to expand the diff). Essentially I hand originally put DashMaps as the core data storage mechanism in the ros1::node to allow concurrent access. However, you changed the execution pattern so that node is only ever being accessed by the single "node task" that is receiving from the socket, so the concurrency protections of the DashMap isn't needed (no one actually needs simultaneous mutable access to Node).

Somehow, (and I really am not 100% sure how) DashMap and tokio were fighting with each other, and the DashMap was ending up in a deadlock, even thou its only ever being accessed by one "task". The DashMap deadlocking was (I believe) the root problem to the timeouts / watchdogs not working because Drop for Node was deadlocking as the DashMap couldn't be destructed...

Carter12s · 2023-10-23T22:32:38Z

A lot of cruft built up while debugging this test I think.

You don't even know man... This is after I cleaned up quite a bit...

ssnover · 2023-10-23T22:42:14Z

Makes sense to remove the DashMap in that context, though I'm not convinced we know the root cause fully.

If there isn't a pressing need to put out a new release, can you use test_log throughout and remove some of the changes that were sanity checks? I think the error stuff can be punted, but I don't think we need to add new cruft unnecessarily.

ssnover

Thanks for humoring me! Nice job tracking down the issues!

Carter added 3 commits October 5, 2023 13:46

Trying to figure out which tests are flaky and fix, adding debug info…

b411bea

…rmation to CI as some flakyness is not reproducing locally:

Add rust version confirmation to CI

8c32d47

Nuking rosapi, and the funky macro

869085d

Carter12s mentioned this pull request Oct 8, 2023

Abilty to correctly inform cargo of file dependnencies in build.rs #139

Merged

1 task

Carter and others added 6 commits October 8, 2023 15:26

Fix ros1_test

d8540c9

Merge in master

26c92d5

Lint

6bdceae

Trying to get more info on failure

269855a

Disabing previously ignored test so I can test what I really want to

bfea3c1

Ignore is messy as it turns out

ff6a7d3

Trying to get some debug out of these tests

4c858f0

carter added 2 commits October 9, 2023 14:39

Convert to multi-thread test

f876f67

WAS IT DNS THE WHOLE TIME

21dd3b0

carter added 6 commits October 9, 2023 15:20

Removing multi-threaded test

4410665

Port fix

131ab1d

try_bind instead of bind for xmlrpc server for better error propagation

e4623d6

Disable spurious test case

cb10c20

Adding back in multi-thread executor

1f73209

Add delay to ensure shutdown completes

eb0e4c7

FOUND IT BITCHES 💩

8b43d2e

Carter and others added 2 commits October 22, 2023 13:20

Commiting in-flight work to transfer between computers

55a7df7

Adding yet more debug

40a7015

Seemed to actually have fixed the error (at least for one test) by sw…

48fdc3a

…itching away from dashmap

carter added 6 commits October 22, 2023 14:27

Remove unneeded dashmaps

99d9e78

Undoing a lot of the sins committed while debugging

75354b2

Re-enabling rosapi

de44534

Rosapi fix

b27a872

Adding rosapi back in for ros1 integration tests, experimenting with …

70d7a67

…removing --test-threads 1

Turning single threaded execution back on

e060f30

Carter12s requested a review from ssnover October 23, 2023 00:31

Carter12s mentioned this pull request Oct 23, 2023

Add iron CI #143

Merged

ssnover reviewed Oct 23, 2023

View reviewed changes

Carter12s commented Oct 23, 2023

View reviewed changes

Cleanups from MR review

f1090e2

Carter12s requested a review from ssnover October 23, 2023 23:06

ssnover approved these changes Oct 24, 2023

View reviewed changes

Carter12s merged commit 923a6a6 into master Oct 24, 2023
4 checks passed

Carter12s deleted the improve-test-reliability branch May 21, 2024 21:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trying to figure out which tests are flaky and fix, adding debug info… #140

Trying to figure out which tests are flaky and fix, adding debug info… #140

Carter12s commented Oct 5, 2023 •

edited

Loading

Carter12s commented Oct 9, 2023

Carter12s commented Oct 9, 2023

Carter12s commented Oct 9, 2023

Carter12s commented Oct 9, 2023

Carter12s commented Oct 9, 2023

Carter12s commented Oct 13, 2023

Carter12s commented Oct 22, 2023

Carter12s commented Oct 23, 2023

ssnover left a comment

ssnover Oct 23, 2023

Carter12s Oct 23, 2023

ssnover Oct 23, 2023

Carter12s Oct 23, 2023

ssnover Oct 23, 2023

Carter12s Oct 23, 2023

ssnover commented Oct 23, 2023

Carter12s Oct 23, 2023

Carter12s commented Oct 23, 2023

Carter12s commented Oct 23, 2023

ssnover commented Oct 23, 2023

ssnover left a comment

		@@ -30,6 +30,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

		### Changed

		- Removed `find_and_generate_ros_messages_relative_to_manifest_dir!` this proc_macro was changing the current working directory of the compilation job resulting in a variety of strange compilation behaviors. Build.rs scripts are recommended for use cases requiring fine

Trying to figure out which tests are flaky and fix, adding debug info… #140

Trying to figure out which tests are flaky and fix, adding debug info… #140

Conversation

Carter12s commented Oct 5, 2023 • edited Loading

Description

Fixes

Checklist

Carter12s commented Oct 9, 2023

Carter12s commented Oct 9, 2023

Carter12s commented Oct 9, 2023

Carter12s commented Oct 9, 2023

Carter12s commented Oct 9, 2023

Carter12s commented Oct 13, 2023

Carter12s commented Oct 22, 2023

Carter12s commented Oct 23, 2023

ssnover left a comment

Choose a reason for hiding this comment

ssnover Oct 23, 2023

Choose a reason for hiding this comment

Carter12s Oct 23, 2023

Choose a reason for hiding this comment

ssnover Oct 23, 2023

Choose a reason for hiding this comment

Carter12s Oct 23, 2023

Choose a reason for hiding this comment

ssnover Oct 23, 2023

Choose a reason for hiding this comment

Carter12s Oct 23, 2023

Choose a reason for hiding this comment

ssnover commented Oct 23, 2023

Carter12s Oct 23, 2023

Choose a reason for hiding this comment

Carter12s commented Oct 23, 2023

Carter12s commented Oct 23, 2023

ssnover commented Oct 23, 2023

ssnover left a comment

Choose a reason for hiding this comment

Carter12s commented Oct 5, 2023 •

edited

Loading