-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trying to figure out which tests are flaky and fix, adding debug info… #140
Conversation
…rmation to CI as some flakyness is not reproducing locally:
Okay @ssnover I feel like I'm just summoning ghosts at this point. If you have a chance to review the most recent test failure here: https://github.com/Carter12s/roslibrust/actions/runs/6459122424/job/17540392614?pr=140 Examine the current design of that test on 4c858f0 And help me come up with theories on HOW THE FUCK this test could still be timing out without printing anything. Here is my current mental model:
|
Found this issue: ZcashFoundation/zebra#1631 .to_socket_addrs() could be the blocking culprit. |
JESUS CHRIST HOW DOES IT STILL TIME OUT |
Leaving a note here before I forget: We are spawning a tokio task which is never dying, and isn't being killed on drop, panics! aren't propagating to it, and it is keeping the test process alive (We think this is the root cause). |
Okay I think I've finally finally actually traced down the root problem, had to fall back to "log debugging" cause I could neither get a debugger to produce reliably, or get all my mucking about with timeouts to work (still unclear why timeouts aren't being effective), but whatever there is a root bug. Inside of node.rs/registerPublisher: log::trace!("Created new publication for {topic:?}");
let handle = channel.get_sender();
self.publishers.insert(topic.clone(), channel);
log::trace!("Inserted new publsiher into dashmap"); When the test verify_get_publications hangs it reliably publishes the first log message and not the second. The DashMap is what is causing a deadlock, and it makes sense now that it is probabilistically occurring since the deadlock will only occur (I believe) if two objects end up in the same DashMap shard... Will continue debugging and try to solve root issue. Unclear what I'm ultimately going to do with this branch as it has turned in a giant set of changes as I randomly poked at things to find the root issue. |
…itching away from dashmap
Okay @ssnover I think this is finally ready to merge. Dashmap was the root of the problem as far as I can tell. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of cruft built up while debugging this test I think.
@@ -30,6 +30,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 | |||
|
|||
### Changed | |||
|
|||
- Removed `find_and_generate_ros_messages_relative_to_manifest_dir!` this proc_macro was changing the current working directory of the compilation job resulting in a variety of strange compilation behaviors. Build.rs scripts are recommended for use cases requiring fine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it was necessary to remove this macro entirely. It just didn't need to be changing the working directory. Like it just need to prepend the provided paths in the macro with the manifest directory path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to remove because I do think it is just a bad API (like the method name is a joke) we could have fixed it, but I really think this macro came about from me not understanding aspects of when macros were expanded and what working directory / cargo manifest dir were under different circumstances.
I think ultimately we figure out a new API in the future for the same reason.
@@ -4,7 +4,7 @@ | |||
|
|||
#[cfg(feature = "ros1")] | |||
#[tokio::main] | |||
async fn main() -> Result<(), Box<dyn std::error::Error>> { | |||
async fn main() -> Result<(), Box<dyn std::error::Error + Send + Sync>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why make the return of main
have these bounds? I think it's just noise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should punt to #118
I ended up having to add this bound in a number of places to make it valid to return the errors from tasks.
@@ -26,8 +28,10 @@ impl<T: RosMessageType> Publisher<T> { | |||
} | |||
} | |||
|
|||
pub async fn publish(&self, data: &T) -> Result<(), Box<dyn std::error::Error>> { | |||
let data = serde_rosmsg::to_vec(&data)?; | |||
pub async fn publish(&self, data: &T) -> Result<(), Box<dyn std::error::Error + Send + Sync>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a RosLibRustError is being mapped here, just use that type instead of boxing it into something with dyn trait bounds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Punt to #118
I don't see any changes related to |
@@ -229,11 +243,11 @@ pub struct Node { | |||
// Receiver for requests to the Node actor | |||
node_msg_rx: mpsc::UnboundedReceiver<NodeMsg>, | |||
// Map of topic names to the publishing channels associated with the topic | |||
publishers: DashMap<String, Publication>, | |||
publishers: HashMap<String, Publication>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ssnover DashMap changes are here.
I think you missed reviewing node.rs (needed to expand the diff). Essentially I hand originally put DashMaps as the core data storage mechanism in the ros1::node to allow concurrent access. However, you changed the execution pattern so that node is only ever being accessed by the single "node task" that is receiving from the socket, so the concurrency protections of the DashMap isn't needed (no one actually needs simultaneous mutable access to Node). Somehow, (and I really am not 100% sure how) DashMap and tokio were fighting with each other, and the DashMap was ending up in a deadlock, even thou its only ever being accessed by one "task". The DashMap deadlocking was (I believe) the root problem to the timeouts / watchdogs not working because Drop for Node was deadlocking as the DashMap couldn't be destructed... |
You don't even know man... This is after I cleaned up quite a bit... |
Makes sense to remove the DashMap in that context, though I'm not convinced we know the root cause fully. If there isn't a pressing need to put out a new release, can you use test_log throughout and remove some of the changes that were sanity checks? I think the error stuff can be punted, but I don't think we need to add new cruft unnecessarily. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for humoring me! Nice job tracking down the issues!
Description
Trying to make CI more robust
Fixes
Closes: #122
Closes: #141
Checklist