Deadlock in shared mem mutex when apps are killed #325

marthtz · 2020-10-30T12:07:45Z

Required information

Operating system:
Ubuntu 18.04 LTS

Compiler version:
GCC 7.4.0

Observed result or behaviour:
Deadlock in shared mem mutex when apps are killed

Expected result or behaviour:
No deadlock.

Conditions where it occurred / Performed steps:
Kill an app while some system is running

Related issues

…case application fails to unlock mutex and terminate Signed-off-by: Sumedha Maharudrayya Mathad (RBEI/ESU4) <[email protected]>

Signed-off-by: Hintz Martin (CC-AD/ESW1) <[email protected]>

…case application fails to unlock mutex and terminate Signed-off-by: Sumedha Maharudrayya Mathad (RBEI/ESU4) <[email protected]>

Signed-off-by: Hintz Martin (CC-AD/ESW1) <[email protected]>

elBoberido · 2021-04-19T11:17:23Z

I've had an idea how this could be solved in an elegant way without a mutex but an atomic and semaphore instead.
Let's assume we have the following class

class DualAccessTransactionTray
{
    private:
    class AccessGuard;

    public:
    AccessGuard acquireRouDiAccess()
    {
        switch (m_accessToken.exchange(AccessToken::ROUDI, std::memory_order_acquire))
        {
        case AccessToken::NONE:
            break;
        case AccessToken::ROUDI:
            errorHandler(kBROKEN_INVARIANT);
            break;
        case AccessToken::APP:
            m_waitingLine.wait();
            break;
        }
        return AccessGuard{*this, AccessToken::ROUDI};
    }
    AccessGuard acquireAppAccess()
    {
        switch (m_accessToken.exchange(AccessToken::APP, std::memory_order_acquire))
        {
        case AccessToken::NONE:
            break;
        case AccessToken::ROUDI:
            m_waitingLine.wait();
            break;
        case AccessToken::APP:
            errorHandler(kBROKEN_INVARIANT);
            break;
        }
        return AccessGuard{*this, AccessToken::APP};
    }
    void cleanupAfterAppWalkedThePlank()
    {
        releaseAccess(AccessToken::APP);
        // just to be sure the memory is synchronized
        AccessToken expected = AccessToken::APP;
        m_accessToken.compare_exchange_strong(expected, AccessToken::NONE, std::memory_order_acq_rel);
    }

    private:
    enum class AccessToken
    {
        NONE,
        ROUDI,
        APP
    };

    void releaseAccess(AccessToken expected)
    {
        auto casSuccessful =
            m_accessToken.compare_exchange_strong(expected, AccessToken::NONE, std::memory_order_release);
        if (!casSuccessful)
        {
            if (expected == AccessToken::NONE)
            {
                errorHandler(kBROKEN_INVARIANT);
            }
            m_waitingLine.post();
        }
    }

    class AccessGuard
    {
        public:
        AccessGuard(DualAccessTransactionTray& transactionTray, AccessToken accessToken)
            : m_transactionTray(transactionTray)
            , m_accessToken(accessToken)
        {
        }
        ~AccessGuard()
        {
            m_transactionTray.releaseAccess(m_accessToken);
        }

        private:
        DualAccessTransactionTray& m_transactionTray;
        AccessToken m_accessToken{AccessToken::NONE};
    };

    std::atomic<AccessToken> m_accessToken{AccessToken::NONE};
    posix::Semaphore m_waitingLine;
};

So basically a futex but just for two threads. If the application terminates abnormally, the semaphore can be cleaned up contrary to the mutex which cannot be unlocked from an other thread than the one which acquired the lock.

This would be an example of a concurrent access from RouDi and the application

                              DualAccessTransactionTray
RouDi (discovery loop)               AccessFlag            App (publisher thread)
                                        NONE
acquireRouDiAccess                      NONE
-> previously NONE                      ROUDI
-> continue                             ROUDI
                                        ROUDI              acquireAppAccess
                                        APP                -> previously ROUDI but expected NONE
                                        APP                -> m_waitingLine.wait
-> add/remove queues                    APP
releaseRouDiAccess                      APP
-> CAS fails                            APP
-> m_waitingLine.post                   APP
                                        APP                -> continue (maybe check if to be destroyed flag is set)
                                        APP                -> push to all queues
                                        APP                releaseAppAccess
                                        NONE               -> CAS succeeds
                                        NONE               -> nothing to be done
                                        NONE
acquireRouDiAccess                      NONE
-> previously NONE                      ROUDI
-> continue                             ROUDI
                                        ROUDI              acquireAppAccess
                                        APP                -> previously ROUDI but expected NONE
                                        APP                -> m_waitingLine.wait
-> add/remove queues                    APP
releaseRouDiAccess                      APP
-> CAS fails                            APP
-> m_waitingLine.post                   APP
                                        APP                -> continue (maybe check if to be destroyed flag is set)
acquireRouDiAccess                      APP
-> previously APP but expected NONE     ROUDI
-> m_waitingLine.wait                   ROUDI
                                        ROUDI
                                        ROUDI              -> push to all queues
                                        ROUDI              releaseAppAccess
                                        ROUDI              -> CAS fails
                                        ROUDI              -> m_waitingLine.post
-> continue                             ROUDI
-> add/remove queues                    ROUDI
releaseRouDiAccess                      ROUDI
-> CAS succeeds                         NONE
-> nothing to be done                   NONE

@budrus @MatthiasKillat @elfenpiff what do you think?

budrus · 2021-04-19T11:55:10Z

@elBoberido so like a mutex that can be unlocked by someone else if this one is sure that the locking one walked the plank. But we could still have corrupted data structures which we maybe are not allowed to access. So we could detect and continue with the consequence that chunks are lost. Or we have to go for the extended UsedChunkList to have a more robust data structure.
Which is this todo I added some day

elBoberido · 2021-04-19T12:04:52Z

@budrus yes, the history vector might be corrupted. The simple approach would indeed be to leak chunks and print a warning if RouDi detects that the lock was hold by the application. A more sophisticated solution would be a HistoryRingBuffer with the same mechanism like the UsedChunkList. In my view, extending the UsedChunkList makes no sense since it has a totally different API. I'd rather go for a small HistoryRingBuffer which does one thing but well.

Signed-off-by: Mathias Kraus <[email protected]>

mossmaurice · 2021-08-27T09:32:49Z

@marthtz @sculpordwarf @bishibashiB @shankar-in
What is the status here? Will you re-open the PR to solve this bug?

cc @CarolinaGGB

shankar-in · 2021-08-27T13:00:45Z

@mossmaurice @marthtz is on vacation. he will be back in 2 weeks.

marthtz added the bug Something isn't working label Oct 30, 2020

marthtz self-assigned this Oct 30, 2020

marthtz pushed a commit to marthtz/iceoryx that referenced this issue Nov 2, 2020

iox-eclipse-iceoryx#325: Set mutex as robust to release the lock, in …

e74f14b

…case application fails to unlock mutex and terminate Signed-off-by: Sumedha Maharudrayya Mathad (RBEI/ESU4) <[email protected]>

marthtz added a commit to marthtz/iceoryx that referenced this issue Nov 2, 2020

iox-eclipse-iceoryx#325: Fix missing inc guard closures

251bf6f

Signed-off-by: Hintz Martin (CC-AD/ESW1) <[email protected]>

marthtz added a commit to marthtz/iceoryx that referenced this issue Nov 2, 2020

iox-eclipse-iceoryx#325: Use robust mutex in locking policy as well

e051de6

Signed-off-by: Hintz Martin (CC-AD/ESW1) <[email protected]>

marthtz added a commit to marthtz/iceoryx that referenced this issue Nov 2, 2020

iox-eclipse-iceoryx#325: Improve comment

8dbd601

Signed-off-by: Hintz Martin (CC-AD/ESW1) <[email protected]>

marthtz pushed a commit to marthtz/iceoryx that referenced this issue Nov 2, 2020

iox-eclipse-iceoryx#325: Set mutex as robust to release the lock, in …

b311626

…case application fails to unlock mutex and terminate Signed-off-by: Sumedha Maharudrayya Mathad (RBEI/ESU4) <[email protected]>

marthtz added a commit to marthtz/iceoryx that referenced this issue Nov 2, 2020

iox-eclipse-iceoryx#325: Fix missing inc guard closures

135770f

Signed-off-by: Hintz Martin (CC-AD/ESW1) <[email protected]>

marthtz added a commit to marthtz/iceoryx that referenced this issue Nov 2, 2020

iox-eclipse-iceoryx#325: Use robust mutex in locking policy as well

fb30e9e

Signed-off-by: Hintz Martin (CC-AD/ESW1) <[email protected]>

marthtz added a commit to marthtz/iceoryx that referenced this issue Nov 2, 2020

iox-eclipse-iceoryx#325: Improve comment

ecac6fd

Signed-off-by: Hintz Martin (CC-AD/ESW1) <[email protected]>

marthtz linked a pull request Nov 2, 2020 that will close this issue

Iox #325 deadlock in shared mem mutex #330

Closed

mossmaurice added this to the v1.x milestone Nov 4, 2020

elBoberido added a commit to ApexAI/iceoryx that referenced this issue Apr 20, 2021

iox-eclipse-iceoryx#325 introduce DualAccessTransactionTray

fdc5269

Signed-off-by: Mathias Kraus <[email protected]>

elBoberido mentioned this issue Apr 23, 2021

iox-#325 introduce DualAccessTransactionTray #748

Closed

19 tasks

elBoberido added a commit to ApexAI/iceoryx that referenced this issue May 5, 2021

iox-eclipse-iceoryx#325 make parameter const

d87c9a8

Signed-off-by: Mathias Kraus <[email protected]>

elBoberido added a commit to ApexAI/iceoryx that referenced this issue May 5, 2021

iox-eclipse-iceoryx#325 rework cleanup method

eccbe18

Signed-off-by: Mathias Kraus <[email protected]>

mossmaurice mentioned this issue Apr 27, 2022

why iceoryx didn't use robust mutex for roudi deadlock? #1042

Closed

elBoberido mentioned this issue Jul 26, 2022

potential deadlock in roudi Mon+Discover and IPC-msg-process thread #1546

Open

elBoberido mentioned this issue Mar 21, 2023

Roudi SEGFAULT when a subscriber dies #1740

Closed

elBoberido mentioned this issue Feb 14, 2024

mutex owner died -> POPO__CHUNK_LOCKING_ERROR #2193

Open

elBoberido mentioned this issue Apr 16, 2024

POPO__CHUNK_LOCKING_ERROR on iox-roudi #2260

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock in shared mem mutex when apps are killed #325

Deadlock in shared mem mutex when apps are killed #325

marthtz commented Oct 30, 2020 •

edited by elBoberido

Loading

elBoberido commented Apr 19, 2021

budrus commented Apr 19, 2021 •

edited

Loading

elBoberido commented Apr 19, 2021 •

edited

Loading

mossmaurice commented Aug 27, 2021

shankar-in commented Aug 27, 2021

Deadlock in shared mem mutex when apps are killed #325

Deadlock in shared mem mutex when apps are killed #325

Comments

marthtz commented Oct 30, 2020 • edited by elBoberido Loading

Required information

Related issues

elBoberido commented Apr 19, 2021

budrus commented Apr 19, 2021 • edited Loading

elBoberido commented Apr 19, 2021 • edited Loading

mossmaurice commented Aug 27, 2021

shankar-in commented Aug 27, 2021

marthtz commented Oct 30, 2020 •

edited by elBoberido

Loading

budrus commented Apr 19, 2021 •

edited

Loading

elBoberido commented Apr 19, 2021 •

edited

Loading