Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: fix scout deadlock #188

Merged
merged 1 commit into from
Apr 22, 2024

Conversation

wyfo
Copy link
Contributor

@wyfo wyfo commented Apr 18, 2024

Closes #180
Releasing GIL in Scout destructor prevents the deadlock.

Releasing GIL in Scout destructor prevents the deadlock.
@wyfo
Copy link
Contributor Author

wyfo commented Apr 18, 2024

@Mallets @YuanYuYuan

@wyfo
Copy link
Contributor Author

wyfo commented Apr 18, 2024

This fix brings the question of the hidden side effects of Drop implementations in Zenoh. Scout drop actually wait that the inner tasks is finished, by pausing the thread. This is quite a huge hidden thing I didn't expect.

@YuanYuYuan
Copy link
Contributor

YuanYuYuan commented Apr 19, 2024

Hi @wyfo , this is a nice fix by yielding the GIL. I try to visualize the fix. Please correct me if I'm wrong.

sequenceDiagram
    participant C as GIL
    participant A as Python
    participant B as Rust
    participant D as Rust

    box zenoh-python
      participant A
      participant B
      participant C
    end

    box zenoh-rust
      participant D
    end

    A ->> A: scout.stop
    A ->> C: Acquire lock
    C ->> A: Return lock
    activate C
    A ->> C: Release lock  <br> (This fixes the issue)
    deactivate C
    A ->> B: _Scout::drop
    B ->> D: ScoutInner::drop

    D ->> B: PyClosure::drop
    B ->> C: Acquire lock
    C ->> B: Return lock
    activate C
    B ->> A: Closure.drop
    A ->> B: Closure.drop <br> return
    B ->> C: Release lock
    deactivate C
    B ->> D: PyClosure::drop <br> return

    D ->> B: ScoutInner::drop <br> return
    B ->> A: _Scout::drop <br> return

Loading

And my another fix in zenoh rust looks like this.

sequenceDiagram
    participant C as GIL
    participant A as Python
    participant B as Rust
    participant D as Rust

    box zenoh-python
      participant A
      participant B
      participant C
    end

    box zenoh-rust
      participant D
    end

    A ->> A: scout.stop
    A ->> C: Acquire lock
    C ->> A: Return lock
    activate C
    A ->> B: _Scout::drop
    B ->> D: ScoutInner::drop
    D ->> B: ScoutInner::drop <br> return
    B ->> A: _Scout::drop <br> return
    A ->> C: Release lock
    deactivate C

    D ->> D: Swtich to another thread

    D ->> B: PyClosure::drop
    B ->> C: Acquire lock
    C ->> B: Return lock
    activate C
    B ->> A: Closure.drop
    A ->> B: Closure.drop <br> return
    B ->> C: Release lock
    deactivate C
    B ->> D: PyClosure::drop <br> return
Loading
  1. IINM, the Drop in zenoh rust is blocking isn't problematic. Termination in Rust just works fine. The deadlock is caused by another drop callback to Python and hence GIL is needed again.
  2. TBH, I prefer your solution to my previous fix. Since the step "switch to another thread" is not deterministic.
  3. Another fundamental solution is to separate the cleanup of two lands. Say, clean the callback on the Python side first, and drop the Scout on the Rust side. Therefore we can prevent this convoluted situation.

@wyfo
Copy link
Contributor Author

wyfo commented Apr 19, 2024

You graphs need one correction: the GIL is not acquired by scout.stop, it is already acquired because Python is executing.

With your solution, I suspect the Python destructor could in fact not even be executed, because the main thread could terminate before the execution of the spawned thread.

Another fundamental solution is to separate the cleanup of two lands.

The issue is that we don't control the cleanup, because it's done on zenoh size, with the Python object embedded, and we don't control zenoh objects. I mean, it could be possible by storing the Python callback into some `Arc<Mutex<Option>>, and manually extract the Python object to prevent zenoh to call the Python drop callback, but it's quite a lot of complexity when releasing the GIL is enough.

@YuanYuYuan
Copy link
Contributor

With your solution, I suspect the Python destructor could in fact not even be executed, because the main thread could terminate before the execution of the spawned thread.

Yes. That's what I said, "non-deterministic".

The issue is that we don't control the cleanup, because it's done on zenoh size, with the Python object embedded, and we don't control zenoh objects. I mean, it could be possible by storing the Python callback into some `Arc<Mutex>, and manually extract the Python object to prevent zenoh to call the Python drop callback, but it's quite a lot of complexity when releasing the GIL is enough.

I see. Then we might need to perform all the similar drop with this trick in zenoh-python.

@wyfo
Copy link
Contributor Author

wyfo commented Apr 19, 2024

I see. Then we might need to perform all the similar drop with this trick in zenoh-python.

You want me to update the drop of the other classes?

@YuanYuYuan
Copy link
Contributor

I see. Then we might need to perform all the similar drop with this trick in zenoh-python.

You want me to update the drop of the other classes?

Let's launch a discussion with the others first in the channel. I think others may want to learn more details about this problem.

To me, this PR has already been good to merge.

@Mallets Mallets merged commit 0ac4615 into eclipse-zenoh:main Apr 22, 2024
7 checks passed
@Mallets Mallets deleted the fix_scout_deadlock branch April 22, 2024 10:30
Mallets added a commit that referenced this pull request Aug 2, 2024
* chore: bump pyo3 to 0.21 (#175)

Roughly adapt the code to the new PyO3 API, with a few warning fixes.

* fix(examples): fix commented parts of z_put.py (#176)

* fix(examples): fix commented parts of z_put.py

* fix: apply PR reviews

* fix: apply PR reviews

* Correct the syntax of pyproject.toml (#181)

* feat: Automate Release (#165)

* build: Get Read the Docs release number from Cargo manifest

This avoids hardcoding the release number in the documentation build config,
making it easier to bump the version by only modifying the manifest.

* build: Require Python 3.11 in Read the Docs configuration

Python 3.11 is needed to access tomllib; useful for parsing the Cargo manifest file.

* feat: Create branch, bump version and tag in release workflow

* feat: Add publish-github job

* fix: Broken tag dependencies

* chore: Remove enforce-linking-issues workflow

* fix: Bump version in pyproject.toml

* chore: Upgrade artifact actions from v3 to v4

* fix: Typo in git-commit command

* fix: Support jq 1.6

ubuntu-22.04 runners use jq 1.6 which doesn't recognize a dot for `[]` value iterator.

See: jqlang/jq#1168.

* Revert "chore: Upgrade artifact actions from v3 to v4"

This reverts commit a535971.

* fix: Build wheels from release branch

* fix: Switch to pypa/gh-action-pypi-publish@release/v1

The older actions doesn't recognize the pyproject.toml metadata fields.

* Force sphinx version to 7.2.6 (#185)

* fix: Install maturin according to requirements-dev.txt (#186)

* fix: fix scout deadlock (#188)

Releasing GIL in Scout destructor prevents the deadlock.

* fix(uhlc): bump uhlc version (#173)

Signed-off-by: gabrik <[email protected]>
Co-authored-by: Luca Cominardi <[email protected]>

* feat(tracing): using zenoh-util function for log initialization (#172)

* feat(tracing): using zenoh-util function for log initialization

Signed-off-by: gabrik <[email protected]>

* chore: adding Cargo.lock

Signed-off-by: gabrik <[email protected]>

* feat(tracing): using zenoh main

Signed-off-by: gabrik <[email protected]>

* chore: using new try_init_log_from_env

Signed-off-by: gabrik <[email protected]>

* chore: sync Cargo.lock

Signed-off-by: gabrik <[email protected]>

---------

Signed-off-by: gabrik <[email protected]>
Co-authored-by: Luca Cominardi <[email protected]>

* feat: add Attachment to API (#189)

* feat: add Attachment to API

* fix: make attachment a CLI arg in z_put example

* Update zenoh/session.py

Co-authored-by: Luca Cominardi <[email protected]>

* Apply suggestions from code review

Co-authored-by: Luca Cominardi <[email protected]>

* test: add test for attachment

---------

Co-authored-by: Luca Cominardi <[email protected]>

* build: Sync  with eclipse-zenoh/zenoh@81217c7 from 2024-04-22 (#190)

Co-authored-by: eclipse-zenoh-bot <[email protected]>

* fix: Create virtual environment for armv6 build (#191)

* build: Sync  with eclipse-zenoh/zenoh@2fdddae from 2024-04-23 (#192)

Co-authored-by: eclipse-zenoh-bot <[email protected]>

* build: Sync  with eclipse-zenoh/zenoh@9a9832a from 2024-04-24 (#193)

Co-authored-by: eclipse-zenoh-bot <[email protected]>

* build: Sync  with eclipse-zenoh/zenoh@7c64d99 from 2024-04-26 (#194)

Co-authored-by: eclipse-zenoh-bot <[email protected]>

* build: Sync  with eclipse-zenoh/zenoh@ad58af6 from 2024-04-26 (#195)

Co-authored-by: eclipse-zenoh-bot <[email protected]>

* build: Sync  with eclipse-zenoh/zenoh@e8916bf from 2024-04-26 (#196)

Co-authored-by: eclipse-zenoh-bot <[email protected]>

* build: Sync  with eclipse-zenoh/zenoh@ea604b6 from 2024-04-29 (#197)

Co-authored-by: eclipse-zenoh-bot <[email protected]>

* build: Sync  with eclipse-zenoh/zenoh@371ca6b from 2024-04-30 (#198)

Co-authored-by: eclipse-zenoh-bot <[email protected]>

* build: Sync  with eclipse-zenoh/zenoh@7a47445 from 2024-05-03 (#200)

Co-authored-by: eclipse-zenoh-bot <[email protected]>

* build: Sync  with eclipse-zenoh/zenoh@f5195c0 from 2024-05-03 (#202)

Co-authored-by: eclipse-zenoh-bot <[email protected]>

* build: Sync  with eclipse-zenoh/zenoh@e53364f from 2024-05-04 (#203)

Co-authored-by: eclipse-zenoh-bot <[email protected]>

* test: add z_ping/z_pong examples (#205)

* fix: make attachment API more similar to dict API (#201)

* fix: make attachment API more similar to dict API

* fix: fix formatting

* fix: fix formatting

* fix: fix typing

* fix: uncomment tests 😅

* docs: add docstring for Attachment

* build: Sync  with eclipse-zenoh/zenoh@7e5d5e8 from 2024-05-07 (#206)

Co-authored-by: eclipse-zenoh-bot <[email protected]>

* fix: add `Encoding.with_suffix` instead of `Encoding.append` (#208)

* fix: add `Encoding.with_suffix` instead of `Encoding.append`

* feat: add `prefix`/`suffix` methods to `Encoding`

* build: Sync  with eclipse-zenoh/zenoh@b8dd01d from 2024-05-07 (#209)

Co-authored-by: eclipse-zenoh-bot <[email protected]>

* Support timeout for get in session.

Signed-off-by: ChenYing Kuo <[email protected]>

* Update PR based on the review.

Signed-off-by: ChenYing Kuo <[email protected]>

* Fix CI lint error.

Signed-off-by: ChenYing Kuo <[email protected]>

* Update: PyTypeError => PyValueError

Signed-off-by: ChenYing Kuo <[email protected]>

* build: Sync  with eclipse-zenoh/zenoh@45e05f0 from 2024-05-13 (#212)

Co-authored-by: eclipse-zenoh-bot <[email protected]>

* build: Sync  with eclipse-zenoh/zenoh@763a05f from 2024-05-14 (#213)

Co-authored-by: eclipse-zenoh-bot <[email protected]>

* build: Sync  with eclipse-zenoh/zenoh@75aa273 from 2024-05-15 (#214)

Co-authored-by: eclipse-zenoh-bot <[email protected]>

* build: Sync  with eclipse-zenoh/zenoh@25f06bd from 2024-05-21 (#215)

Co-authored-by: eclipse-zenoh-bot <[email protected]>

* build: Sync  with eclipse-zenoh/zenoh@3118d31 from 2024-05-28 (#216)

Co-authored-by: eclipse-zenoh-bot <[email protected]>

* build: Sync  with eclipse-zenoh/zenoh@d574654 from 2024-06-03 (#217)

Co-authored-by: eclipse-zenoh-bot <[email protected]>

* chore: Update artifacts action to v4 (#218)

artifacts actions v3 are deprecated

* fix: give unique name for artifacts (#220)

upload-artifacts/v4 has a breaking change from v3, which requires the
artifacts names to be unique. Fix #219

* build: Sync  with eclipse-zenoh/zenoh@c279982 from 2024-06-05 (#221)

Co-authored-by: eclipse-zenoh-bot <[email protected]>

* build: Sync  with eclipse-zenoh/zenoh@d8e66de from 2024-06-10 (#222)

Co-authored-by: eclipse-zenoh-bot <[email protected]>

* build: Sync  with eclipse-zenoh/zenoh@9d09742 from 2024-06-11 (#223)

Co-authored-by: eclipse-zenoh-bot <[email protected]>

* Fix markdown format in README. (#211)

Signed-off-by: ChenYing Kuo <[email protected]>

* build: Sync  with eclipse-zenoh/zenoh@ed6c636 from 2024-06-12 (#225)

Co-authored-by: eclipse-zenoh-bot <[email protected]>

* build: Sync  with eclipse-zenoh/zenoh@8160b01 from 2024-06-13 (#228)

Co-authored-by: eclipse-zenoh-bot <[email protected]>

* Enable releasing from any branch (#227)

* build: Sync  with eclipse-zenoh/zenoh@7adad94 from 2024-06-14 (#230)

Co-authored-by: eclipse-zenoh-bot <[email protected]>

* build: Sync  with eclipse-zenoh/zenoh@93f93d2 from 2024-06-17 (#231)

Co-authored-by: eclipse-zenoh-bot <[email protected]>

* build: Sync  with eclipse-zenoh/zenoh@2500e5a from 2024-06-20 (#236)

Co-authored-by: eclipse-zenoh-bot <[email protected]>

* fix: remove deprecated maturin arg (#241)

See PyO3/maturin#1620, `--universal2` has been
deprecated

* build: Sync  with eclipse-zenoh/zenoh@869ace6 from 2024-07-02 (#247)

Co-authored-by: eclipse-zenoh-bot <[email protected]>

* build: Sync  with eclipse-zenoh/zenoh@b93ca84 from 2024-07-03 (#248)

Co-authored-by: eclipse-zenoh-bot <[email protected]>

* build: Sync  with eclipse-zenoh/zenoh@b3e42ce from 2024-07-08 (#252)

Co-authored-by: eclipse-zenoh-bot <[email protected]>

* build: Sync  with eclipse-zenoh/zenoh@b3e42ce from 2024-07-08 (#253)

Co-authored-by: eclipse-zenoh-bot <[email protected]>

* build: Sync  with eclipse-zenoh/zenoh@b3e42ce from 2024-07-08 (#259)

Co-authored-by: eclipse-zenoh-bot <[email protected]>

* build: Sync  with eclipse-zenoh/zenoh@0a969cb from 2024-07-25 (#261)

Co-authored-by: eclipse-zenoh-bot <[email protected]>

* build: Sync  with eclipse-zenoh/zenoh@0a969cb from 2024-07-25 (#262)

Co-authored-by: eclipse-zenoh-bot <[email protected]>

* build: Sync  with eclipse-zenoh/zenoh@e587aa9 from 2024-07-26 (#265)

Co-authored-by: eclipse-zenoh-bot <[email protected]>

* build: Sync  with eclipse-zenoh/zenoh@2d88c7b from 2024-07-29 (#268)

Co-authored-by: eclipse-zenoh-bot <[email protected]>

* chore: remove useless dependencies

---------

Signed-off-by: gabrik <[email protected]>
Signed-off-by: ChenYing Kuo <[email protected]>
Co-authored-by: Yuyuan Yuan <[email protected]>
Co-authored-by: Mahmoud Mazouz <[email protected]>
Co-authored-by: oteffahi <[email protected]>
Co-authored-by: Gabriele Baldoni <[email protected]>
Co-authored-by: Luca Cominardi <[email protected]>
Co-authored-by: eclipse-zenoh-bot <[email protected]>
Co-authored-by: eclipse-zenoh-bot <[email protected]>
Co-authored-by: ChenYing Kuo <[email protected]>
Co-authored-by: kydos <[email protected]>
Co-authored-by: Diogo Matsubara <[email protected]>
Co-authored-by: Diogo Matsubara <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] z_scout fails to terminate
3 participants