Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iox-roudi throws POPO__CHUNK_LOCKING_ERROR when killing a process mid-publish #2304

Closed
hrudhansh opened this issue Jun 20, 2024 · 6 comments
Closed
Labels
needs info A bug report is waiting for more information

Comments

@hrudhansh
Copy link

Required information

Operating system:
Ubuntu 24.04 LTS

Compiler version:
12.3.0

Eclipse iceoryx version:
b2cd72b

Observed result or behaviour:
Killing an application that in the middle of a 'critical section' of publish causes POPO__CHUNK_LOCKING_ERROR in iox-roudi

Expected result or behaviour:
Upon calling the de-constructor, it is able to abruptly stop publish, exit the 'critical section', and exit gracefully.

Conditions where it occurred / Performed steps:
To reproduce -

  1. Run a pub-sub with no delay in between publishes.
  2. Register a SIGINT signal handler in your main like signal(SIGINT, SignalHandler);
  3. Upon ctrl+c on the pub process, you would see it stall.
  4. Upon ctrl+c again you would see the above error in iox-roudi.

Additional helpful information

On my end, I ran gdb on the pub process with -exec handle SIGINT nostop & -exec handle SIGINT pass, a breakpoint on the exit(sigint); and called pkill -SIGINT publisher in a separate terminal. I noticed:

  1. In cases where it fails, the signal handler seems to be called in the middle of a publish 'critical section' .
  2. Once this happens ^, the main loop seems to just be spinning.
  3. In cases where it fails, there is also another 'KeepAlive' background thread running.

So I assume what is happening is -

Publish thread starts critical section > triggers an 'is_started' state change in background thread > sends ack back to publish > publish moves ahead > publish is interrupted > background thread is waiting for an 'is_ended' trigger > it never gets it so keeps waiting > publish thread also waiting for background thread to ack 'is_ended'

Also:

  • Publish was called in the main loop (NOT a separate thread)
  • Can be resolved by running your publish loop in a separate thread with an atomic bool. This lets you finish the publish gracefully, join the thread and then exit. Avoiding this issue altogether.
@hrudhansh
Copy link
Author

This was posted originally in this issue #2193

@elBoberido
Copy link
Member

@hrudhansh do you have a minimal example which triggers the problem? Ideally targeting the iceoryx main branch.

If you look at our examples, they also register a signal handler and have no problem with ctrl+c. They use the signal handler either implicit via iox::waitForTerminationRequest(); and while (!iox::hasTerminationRequested()) or explicit with iox::registerSignalHandler.

@elBoberido elBoberido added the needs info A bug report is waiting for more information label Jun 25, 2024
@hrudhansh
Copy link
Author

@elBoberido You are correct! Adding "while (!iox::hasTerminationRequested())" seems to make the issue go away.

So the issue was essentially:

  • ctrl+c happens > Iceoryx has internally signaled termination > publish called after termination call > some kind of deadlock happens.

But this is great, I will potentially just add it in-front of every publish call if the overhead isn't too high. Works every time now, thank you!

@elBoberido
Copy link
Member

@hrudhansh you don't need to add it before every publish call. I guess you will have a loop where you publish or something similar. Just add it as part of the loop condition. Alternatively if you are blocking in the main thread, it might also be sufficient to just have the iox::waitForTerminationRequest(); call there.

If you are able to post a minimal example of your code, I might be able to tell you the ideal solution for iceoryx. The important thing is to handle the shutdown in a way to let all the destructors run.

@hrudhansh
Copy link
Author

So I am essentially making an opinionated wrapper library around Iceoryx for exactly our use-case. One of the "philosophies" of this library is having a very small footprint in our codebase. So ideally the flow is - bring in the header > instantiate > call publish... everything else is taken care of for you. So while I don't have a fixed minimal example, in this case I was just trying to push the boundaries by calling publish with no delays, and see how it holds up.
It holds well btw! I did not miss a single message on the sub side once the Options are set correctly.

But I see your point - better to optimize around the whole publish loop instead of every publish call.

@elBoberido
Copy link
Member

This example might be interesting for you
https://github.com/eclipse-iceoryx/iceoryx/blob/main/iceoryx_examples/request_response/client_cxx_waitset.cpp

It shows that you basically just have to register a signal handler and then notify your event loops to stop the execution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs info A bug report is waiting for more information
Projects
None yet
Development

No branches or pull requests

2 participants