Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

snapshot(ticdc): fix ddl puller and ddl manager stuck caused by dead lock (#11886) #11896

Open
wants to merge 1 commit into
base: release-7.1
Choose a base branch
from

Conversation

ti-chi-bot
Copy link
Member

This is an automated cherry-pick of #11886

What problem does this PR solve?

Issue Number: close #11884

What is changed and how it works?

Summary

This PR fixes a deadlock issue in the Snapshot implementation:

Deadlock in Recursive Read Locking: Although Go’s sync.RWMutex allows recursive read locks, they can result in deadlocks if a write lock is requested during the recursive read lock execution. This blocks the outer read lock from releasing, preventing the write lock from proceeding.

This PR refactors lock usage patterns to avoid recursive read locking.


Root Causes of the Deadlocks

Recursive Read Lock Issue

Recursive calls involving RWMutex.RLock() can result in deadlocks when a write lock is requested during the recursive read lock execution. This behavior arises because Go’s sync.RWMutex prioritizes write locks over read locks.

Here is an example that illustrates the problem:

func (s *Snapshot) Operation() {
    s.rwlock.RLock()
    defer s.rwlock.RUnlock()
    s.NestedOperation() // Second RLock
}

func (s *Snapshot) NestedOperation() {
    s.rwlock.RLock()
    defer s.rwlock.RUnlock()
    // Perform some operations
}

If a write lock is requested while NestedOperation is executing, the following chain of events occurs:

  1. The write lock request blocks new readers, including the recursive RLock in NestedOperation.
  2. NestedOperation cannot complete until its RLock is granted.
  3. The first RLock in Operation cannot release until NestedOperation completes.
  4. Deadlock occurs because the first RLock and the recursive RLock are mutually dependent.

Check List

Tests

  • Unit test
  • Integration test

Questions

Will it cause performance regression or break compatibility?

No.

Do you need to update user documentation, design documentation or monitoring documentation?

No.

Release note

None.

@ti-chi-bot ti-chi-bot added lgtm release-note-none Denotes a PR that doesn't merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. type/cherry-pick-for-release-7.1 This PR is cherry-picked to release-7.1 from a source PR. labels Dec 17, 2024
Copy link
Contributor

ti-chi-bot bot commented Dec 17, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign lichunzhu for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

ti-chi-bot bot commented Dec 17, 2024

This cherry pick PR is for a release branch and has not yet been approved by triage owners.
Adding the do-not-merge/cherry-pick-not-approved label.

To merge this cherry pick:

  1. It must be approved by the approvers firstly.
  2. AFTER it has been approved by approvers, please wait for the cherry-pick merging approval from triage owners.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot ti-chi-bot bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Dec 17, 2024
Copy link
Contributor

ti-chi-bot bot commented Dec 19, 2024

@ti-chi-bot: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
jenkins-ticdc/verify d41c731 link true /test verify
pull-cdc-integration-mysql-test d41c731 link true /test cdc-integration-mysql-test
pull-cdc-integration-kafka-test d41c731 link true /test cdc-integration-kafka-test
pull-cdc-integration-storage-test d41c731 link true /test cdc-integration-storage-test

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/cherry-pick-not-approved lgtm release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. type/cherry-pick-for-release-7.1 This PR is cherry-picked to release-7.1 from a source PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants