-
Notifications
You must be signed in to change notification settings - Fork 398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mutex owner died -> POPO__CHUNK_LOCKING_ERROR #2193
Comments
@niclar this is related to #325. For the history feature there is a mutex involved. RouDi locks the mutex and can insert the samples into the queue without the interference of the publisher. This mutex is only contested when subscribers are added or removed but nevertheless, the publisher holds the lock for a short amount of time when publishing. If the application is terminated during this time and RouDi tries to clean up the resources of the process, it finds the corrupt mutex and terminates. I once tried to fix the problem but unfortunately, the data guarded by the mutex might also be corrupted and more refactoring would have been necessary to fully fix the problem. Other things came up and the issue never really happened in our setup. This should actually only happen if either an application died or if RouDi assumed that an application died. But then you should see something like |
Looking at the logs again it looks like roudi incorrectly assumed 4 processes died prior. Don't know why it would assume this though. -Is there a starvation issue in the detection ? |
-Looks like that fix is/was (2023-11-01) already in the code running (2023-11-23) |
Oh, I just saw v2.0.x and thought it came from the release branch. Is the system time changed abruptly so that jumps can occur? Other scenarios that lead to this error might be when the threads which update the heartbeat value are not scheduled. A workaround might be to massively increase the |
These thread(s) are part of the publisher(s) & not roudi right ? -I reckon they might have been starved in this instance. Some deadline monotonic scheduling or similar would be nice to be able to jack in here.. |
It is a thread of the runtime. I think it should not be too difficult to add a |
This morning we encountered this fatal error again during (post) the morning restart routine of some iceoryx client processes. |
Do you have more information, e.g. was the load during that time quite high? |
cpu core utilization is low overall with roudi running on a dedicated core interrupt isolated. we received this anew the other day seemingly "spontanously" a few hours after a mid day restart of ~20 roudi client processes. |
Do you have some logs of spikes in the CPU utilization when the issue occurred? |
...we've added core utilization logging now for next time it happens... |
@elBoberido I looked at the issue and was wondering if we could remove the mutex completely if we sacrifice the history? When on connecting a subscriber to a publisher no history has to be delivered, there should be also no need for a mutex. @niclar Do you require the history when a subscriber connects? Another solution to get rid of the mutex would be to transfer the samples without a lock (if this is possible) but then it is possible that a user receives the same sample twice. But only when connecting to a new publisher (and this can maybe filtered out with sequence numbers?!) |
@elfenpiff, good news, we do not. All pub/sub are instantiated with |
It is not only the history. AFAIK, the mutex also guards the adding of the subscriber queue to the publisher. Furthermore, the |
This we can mitigate with turning monitoring off. But a better alternative would be to have a liveliness QoS that enforces some contracts like that a subscriber as to collect a sample latest after X seconds and the publisher must publisher every Y seconds a sample. But this would require some time-consuming refactoring.
Yes, you are right. But this could be solved by bringing loffli into the play and a lock-free optional where the atomic signals that the thing is set or not. This is also some major refactoring but I think it is solvable. |
Well, the monitoring was explicitly turned on as far as I know 😅 |
@niclar did you encounter this problem again? In the meantime I'm a bit hesitant on exposing the heartbeat triggering as public API. An alternative would be to increase the timeout for the detection of died processes. Currently it is about 1.5 seconds and the heartbeat should be sent every 300ms. If this happens again, can you check the RouDi log output for messages like |
As a matter of fact we did friday evening;
|
Okay, the deadline was missed only by a little bit. For some reasons the runtime thread which sends the heartbeat was not triggered for a long time. Is this on Windows or on Linux? As a quick workaround I would suggest to try to change the
It does not fix the problem but should make it much more unrealistic. You could increase the time even more. With e.g. The more I think about it, the more I come to the conclusion to just increase this timeout to eliminate this source for leaving a corrupt mutex behind. |
This is on linux, dedicated cores. -I applied the workaround and will get back if we encounter it again. Thanks! |
Still encountering the issue 2024-04-22 11:32:58.571 [Warn ]: Application X not responding (last response 3046 milliseconds ago) --> removing it |
This is weird, especially in combination with the high socket usage from #2268. Just to exclude other possibilities. You are using the plain upstream version from the commit mentioned in the description and not a custom branch with that commit cherry-picked? |
It might very well be related to the IPC queues yes. -Yes, the only difference being the attached "patches"; Build options; Roudi conf; iox::config::CmdLineArgs_t cmdLineArgs; |
We'll update to head in a day or two and see how that plays out... |
Those patches should not have any negative effect on the heartbeat. Do you run the iceoryx tests after the build? Since you changed some of the compile time constants it is strongly recommended to run the tests. I think we need to put some emphasis on this in the documentation. Oh, and could you increase the timeout to to |
It seems like HEAD might have solved the ipc queue and this cleanup issue , let us run it a while longer, to be sure |
1 minute timeout is set/compiled however. No idea why this happens..
|
Can you set it to |
Oh, since you are using the main branch, can you also run RouDi with |
Having the same issue here aswell, on commit f6dc4e1 (latest). @niclar @elBoberido any way to make it less frequent or solve it completely? |
@rbdm-qnt if the monitoring is turned off, then this will not happen but in the case the application crashes, the resources are only cleaned up when RouDi shuts down or when the application re-registers. Can you try the following changes in -- constexpr units::Duration PROCESS_KEEP_ALIVE_TIMEOUT = 5 * PROCESS_KEEP_ALIVE_INTERVAL;
++ constexpr units::Duration PROCESS_KEEP_ALIVE_TIMEOUT = 60_s; My guess is that this issue is related to jumps in the system time like mentioned further above. I was just able to trigger the issue by changing the system time with Please check if your system time jumps and if increasing the The proper solution would probably be to use a counter instead of a timestamp and then we could have a heuristic like when the counter was not incremented for 5 consecutive times RouDi checks for the liveliness, the application can be considered to be in a degraded state. |
@rbdm-qnt @PMSRGroup @niclar did you have a chance to check whether my theory with the jumps in system time is correct and the workaround works? |
@elBoberido Can the timestamp used by the heartbeat be the system boot time, so that it is not affected by the system time |
@passchaos are you also experiencing this issue? I just had a look at the code and we are already using
At least in one of three runs, RouDi assumes the application did not send the heartbeat. The workaround would be to increase the A proper fix would be to rework the heartbeat handling and I also have an idea how to do it but I'm currently quite busy with other topics. |
@elBoberido I have found the cause of the problem, and it is not roudi's problem, but the problem of the sender sending the keepalive signal The code https://github.com/eclipse-iceoryx/iceoryx/blob/main/iceoryx_hoofs/time/source/duration.cpp#L55 here use the CLOCK_REALTIME. On the other hand, older version, such as v2.0.6, also use CLOCK_REALTIME in the timer sources, see: https://github.com/eclipse-iceoryx/iceoryx/blob/v2.0.6/iceoryx_hoofs/source/posix_wrapper/timer.cpp#L243 and https://github.com/eclipse-iceoryx/iceoryx/blob/v2.0.6/iceoryx_hoofs/source/posix_wrapper/timer.cpp#L436 |
@passchaos the code in The heartbeat uses |
@elBoberido Well, I know this is the logic of the heartbeat, and each process involved in the communication sends a Keepalive signal to roudi, but because each process uses a timer to send a Keepalive signal, the timer uses a time source that is not monotonic (see my previous link). |
@passchaos ah, right. The thread is waits on a semaphore and that causes the issue. The problem is that Have you checked whether one of your cores runs at 100% after the change? I assume that the heartbeat thread does not sleep anymore and returns immediately from the |
@elBoberido Yeah, I have also noticed this problem, it is a sad thing |
This issue is showing for me 7/10 times upon stopping publisher process. How to recreate?
Hopefully this helps you recreate/debug this on your end. On my end, I ran gdb on the pub process with
How do I stop/interrupt the publisher thread gracefully? It seems like just letting the de-constructor call is not enough. Not advocating a quick fix, but I wonder if we can put an atomic bool somewhere in the publish thread that we can call to abruptly stop what it's doing, and tear down. Note: This test was done on commit: b2cd72b which is not latest. But the same issue persists on latest. |
@hrudhansh I think the root of your problem is a different one than from this issue. It just manifests in the same error code. Can you please create a separate issue for your problem? |
@niclar @PMSRGroup @rbdm-qnt @passchaos shall we schedule a community meeting to discuss the problem, solution approaches and how to distribute the work? |
Maybe I'm interested as well. I haven't enabled monitoring but I encounter probably the same error case when shutting down RouDi - likely because some process has crashed at some point while holding the mutex. |
yes please do, it's a serious bug that needs to be addressed |
What are folks thoughts on the idea of a short-term mitigation whereby this error becomes "non-fatal" and is treated as a leaked resource? Basically it means that this publisher port is in a corrupted state and may not be reused until the whole system (RouDI) is restarted. The default configuration has an allotment for 512 publisher ports overall so in practice this seems like a manageable scenario, so long as you aren't trying to keep restarting the failing process and repeatedly corrupting publisher ports. |
@gpalmer-latai I've thought about this and had a brief look at the code. Unfortunately, it is also quite some work since the locking failure is not exposed to RouDi @niclar and all interested parties: how about a meeting next week on Tuesday at 5 p.m. CEST? |
That works for me - same link as the regular meetup? |
@elBoberido meetup at Tuesday at 5 p.m. CEST works fine. |
Interested in the meeting as well, will possibly enlighten me a fix for #2304 |
Hi, we just encountered a fatal crash in roudi (first time on b184113 since it came out), roudi logs below.
Maybe an iceoryx "client" process died prior, but no iceoryx logging in them or the syslog indicates this. And this should not crash roudi of course..
-Do you maybe have any pointers as to why we encountered this and how it should be remedied ?
(We're running 21 iceoryx client processes on the box, pinned and isolated, ~1500 publishers, and roudi on highest prio)
Thanks
Required information
Operating system:
Ubuntu 22.04.2 LTS
Compiler version:
Clang 17.0.5
Eclipse iceoryx version:
b184113 (v2.0.x, 20231121)
Observed result or behaviour:
Edit by elBoberido:
Related issues
The text was updated successfully, but these errors were encountered: