Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trying to figure out which tests are flaky and fix, adding debug info… #140

Merged
merged 29 commits into from
Oct 24, 2023

Conversation

Carter12s
Copy link
Collaborator

@Carter12s Carter12s commented Oct 5, 2023

Description

Trying to make CI more robust

Fixes

Closes: #122
Closes: #141

Checklist

  • Update CHANGELOG.md

@Carter12s
Copy link
Collaborator Author

@Carter12s
Copy link
Collaborator Author

@Carter12s
Copy link
Collaborator Author

Okay @ssnover I feel like I'm just summoning ghosts at this point.

If you have a chance to review the most recent test failure here: https://github.com/Carter12s/roslibrust/actions/runs/6459122424/job/17540392614?pr=140

Examine the current design of that test on 4c858f0

And help me come up with theories on HOW THE FUCK this test could still be timing out without printing anything.

Here is my current mental model:

  • The log line test tests::verify_get_publications has been running for over 60 seconds indicates to me that this test must be the problem, and that the overall test runner is healthy.
  • Every single async operation within that test is wrapped in a timeout()
  • The overall test has a watchdog configured on it. I've manually confirmed the watchdog will trip anything in the body of the test is delayed
  • The test appears to run perfectly on my system and only fail in CI (ran it in a loop in bash, 100/100 times passed). Althou I wasn't re-building between runs.
  • The test only sometimes fails in CI
  • tokio::test is defaulting to a single threaded running. I'll try multi next, but that would mean that somehow some function we're calling within the test is blocking the only tokio thread preventing the watchdog?

@Carter12s
Copy link
Collaborator Author

Found this issue: ZcashFoundation/zebra#1631

.to_socket_addrs() could be the blocking culprit.

@Carter12s
Copy link
Collaborator Author

JESUS CHRIST HOW DOES IT STILL TIME OUT

@Carter12s
Copy link
Collaborator Author

Leaving a note here before I forget:

We are spawning a tokio task which is never dying, and isn't being killed on drop, panics! aren't propagating to it, and it is keeping the test process alive (We think this is the root cause).

@Carter12s
Copy link
Collaborator Author

Okay I think I've finally finally actually traced down the root problem, had to fall back to "log debugging" cause I could neither get a debugger to produce reliably, or get all my mucking about with timeouts to work (still unclear why timeouts aren't being effective), but whatever there is a root bug.

Inside of node.rs/registerPublisher:

            log::trace!("Created new publication for {topic:?}");
            let handle = channel.get_sender();
            self.publishers.insert(topic.clone(), channel);
            log::trace!("Inserted new publsiher into dashmap");

When the test verify_get_publications hangs it reliably publishes the first log message and not the second. The DashMap is what is causing a deadlock, and it makes sense now that it is probabilistically occurring since the deadlock will only occur (I believe) if two objects end up in the same DashMap shard...

Will continue debugging and try to solve root issue. Unclear what I'm ultimately going to do with this branch as it has turned in a giant set of changes as I randomly poked at things to find the root issue.

@Carter12s Carter12s requested a review from ssnover October 23, 2023 00:31
@Carter12s
Copy link
Collaborator Author

Okay @ssnover I think this is finally ready to merge. Dashmap was the root of the problem as far as I can tell.

@Carter12s Carter12s mentioned this pull request Oct 23, 2023
Copy link
Collaborator

@ssnover ssnover left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of cruft built up while debugging this test I think.

@@ -30,6 +30,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Changed

- Removed `find_and_generate_ros_messages_relative_to_manifest_dir!` this proc_macro was changing the current working directory of the compilation job resulting in a variety of strange compilation behaviors. Build.rs scripts are recommended for use cases requiring fine
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it was necessary to remove this macro entirely. It just didn't need to be changing the working directory. Like it just need to prepend the provided paths in the macro with the manifest directory path.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to remove because I do think it is just a bad API (like the method name is a joke) we could have fixed it, but I really think this macro came about from me not understanding aspects of when macros were expanded and what working directory / cargo manifest dir were under different circumstances.

I think ultimately we figure out a new API in the future for the same reason.

@@ -4,7 +4,7 @@

#[cfg(feature = "ros1")]
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
async fn main() -> Result<(), Box<dyn std::error::Error + Send + Sync>> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make the return of main have these bounds? I think it's just noise.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should punt to #118

I ended up having to add this bound in a number of places to make it valid to return the errors from tasks.

@@ -26,8 +28,10 @@ impl<T: RosMessageType> Publisher<T> {
}
}

pub async fn publish(&self, data: &T) -> Result<(), Box<dyn std::error::Error>> {
let data = serde_rosmsg::to_vec(&data)?;
pub async fn publish(&self, data: &T) -> Result<(), Box<dyn std::error::Error + Send + Sync>> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a RosLibRustError is being mapped here, just use that type instead of boxing it into something with dyn trait bounds.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Punt to #118

roslibrust/src/rosbridge/integration_tests.rs Show resolved Hide resolved
roslibrust/tests/ros1_xmlrpc.rs Outdated Show resolved Hide resolved
roslibrust/tests/ros1_xmlrpc.rs Outdated Show resolved Hide resolved
@ssnover
Copy link
Collaborator

ssnover commented Oct 23, 2023

I don't see any changes related to DashMap, can you explain?

@@ -229,11 +243,11 @@ pub struct Node {
// Receiver for requests to the Node actor
node_msg_rx: mpsc::UnboundedReceiver<NodeMsg>,
// Map of topic names to the publishing channels associated with the topic
publishers: DashMap<String, Publication>,
publishers: HashMap<String, Publication>,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ssnover DashMap changes are here.

@Carter12s
Copy link
Collaborator Author

I don't see any changes related to DashMap, can you explain?

I think you missed reviewing node.rs (needed to expand the diff). Essentially I hand originally put DashMaps as the core data storage mechanism in the ros1::node to allow concurrent access. However, you changed the execution pattern so that node is only ever being accessed by the single "node task" that is receiving from the socket, so the concurrency protections of the DashMap isn't needed (no one actually needs simultaneous mutable access to Node).

Somehow, (and I really am not 100% sure how) DashMap and tokio were fighting with each other, and the DashMap was ending up in a deadlock, even thou its only ever being accessed by one "task". The DashMap deadlocking was (I believe) the root problem to the timeouts / watchdogs not working because Drop for Node was deadlocking as the DashMap couldn't be destructed...

@Carter12s
Copy link
Collaborator Author

A lot of cruft built up while debugging this test I think.

You don't even know man... This is after I cleaned up quite a bit...

@ssnover
Copy link
Collaborator

ssnover commented Oct 23, 2023

Makes sense to remove the DashMap in that context, though I'm not convinced we know the root cause fully.

If there isn't a pressing need to put out a new release, can you use test_log throughout and remove some of the changes that were sanity checks? I think the error stuff can be punted, but I don't think we need to add new cruft unnecessarily.

@Carter12s Carter12s requested a review from ssnover October 23, 2023 23:06
Copy link
Collaborator

@ssnover ssnover left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for humoring me! Nice job tracking down the issues!

@Carter12s Carter12s merged commit 923a6a6 into master Oct 24, 2023
4 checks passed
@Carter12s Carter12s deleted the improve-test-reliability branch May 21, 2024 21:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

What is going on with doctest paths in Cargo Hung Integration Tests in Workflow
2 participants