Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX[MQB]: race in admin domain remove #567

Merged
merged 1 commit into from
Jan 16, 2025

Conversation

emelialei88
Copy link
Collaborator

@emelialei88 emelialei88 commented Jan 13, 2025

  1. Check queue open status in cluster thread to prevent race when checking if the queue is actively used
  2. Consolidate purge and GC since they're both called in cluster thread
  3. Decide whether or not to call d_teardownCb based on function pointer being nullptr or not, since d_state can be rewritten when shutdown is called after DOMAINS REMOVE
  4. Remove e_REMOVING and e_REMOVED since these states are not necessary
    when we check if d_teardownRemoveCb is assigned to decide whether to call it
  5. Change e_PREREMOVE to e_REMOVING
  6. Replace the use of e_POSTREMOVE to e_STOPPED since there's a chance the state of a domain could be changed to e_POSTREMOVE after e_STOPPING in a late unregisterQueue
  7. Explicitly invalidate a SharedResource of DomainManage. Commented in the code for the reason

@emelialei88 emelialei88 force-pushed the admin-domain-remove branch 2 times, most recently from 8222790 to 4bac73e Compare January 13, 2025 17:25
@emelialei88 emelialei88 marked this pull request as ready for review January 13, 2025 18:37
@emelialei88 emelialei88 requested a review from a team as a code owner January 13, 2025 18:37
Copy link
Collaborator

@dorjesinpo dorjesinpo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, may need few tweaks

src/groups/mqb/mqbblp/mqbblp_clusterqueuehelper.cpp Outdated Show resolved Hide resolved
src/groups/mqb/mqbblp/mqbblp_domain.cpp Show resolved Hide resolved
d_state = e_PREREMOVE;
d_teardownRemoveCb = bsl::nullptr_t();
}

void Domain::removeDomainComplete()
{
bslmt::LockGuard<bslmt::Mutex> guard(&d_mutex); // LOCK
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be no need for removeDomainComplete. If the Domain knows, it is being removed and it calls tearDownCb in teardownRemove / d_teardownRemoveCb in unregisterQueue, it can make the state transition to e_POSTREMOVE.

In any case, changing Domain state (including d_teardownRemoveCb) should be done under the lock.

Copy link
Collaborator Author

@emelialei88 emelialei88 Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why I did it outside of teardownRemove / d_teardownRemoveCb in unregisterQueue was because I hoped the second pass to only be able to go through until the first pass is fully finished, and I didn't want the second pass to be issued between these two lines. I'm not fully sure if this concern is valid tho.

And for locking - I thought the state is an AtomicInt so we probably don't need to protect with another mutex?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the concern about removing domain before mqbblp::Queue? Cause, the queue keeps raw pointer to the domain?

@emelialei88 emelialei88 force-pushed the admin-domain-remove branch 3 times, most recently from f2b7804 to ad9642c Compare January 15, 2025 16:03

// 5. Mark DOMAIN REMOVED to accecpt the second pass

// 3. Mark DOMAIN REMOVED to accecpt the second pass
bmqu::SharedResource<DomainManager> self(this);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Careful with bmqu::SharedResource. Calling self.acquire() followed by destructing/invalidating without releasing the shared_ptr will result in a deadlock.

@emelialei88 emelialei88 force-pushed the admin-domain-remove branch 2 times, most recently from a974d15 to 5897540 Compare January 15, 2025 20:02
@emelialei88 emelialei88 force-pushed the admin-domain-remove branch 2 times, most recently from 99fff14 to a6792a2 Compare January 15, 2025 21:31
Copy link
Collaborator

@dorjesinpo dorjesinpo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. One small comment

dorjesinpo
dorjesinpo previously approved these changes Jan 16, 2025
Copy link
Collaborator

@dorjesinpo dorjesinpo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you

1. Check queue open status in cluster thread to prevent race when checking
if the queue is actively used
2. Consolidate purge and GC since they're both called in cluster thread
3. Decide whether or not to call d_teardownCb based on function pointer
being nullptr or not, since d_state can be rewritten when shutdown is
called after DOMAINS REMOVE
4. Remove e_REMOVING and e_REMOVED since these states are not necessary
when we check if d_teardownRemoveCb is assigned to decide whether to call it
5. Change e_PREREMOVE to e_REMOVING
6. Replace the use of e_POSTREMOVE to e_STOPPED since there's a chance
the state of a domain could be changed to e_POSTREMOVE after e_STOPPING
in a late unregisterQueue
7. Explicitly invalidate a SharedResource of DomainManage. Commented in the
code for the reason

Signed-off-by: Emelia Lei <[email protected]>
@emelialei88 emelialei88 merged commit acbb52d into bloomberg:main Jan 16, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants