Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update all hf examples to have dist.barrier #1139

Merged
merged 1 commit into from
Aug 21, 2024

Conversation

muellerzr
Copy link
Contributor

Without having dist.barrier(), all of the HF examples wind up hanging since we're destroying the pg before all comms have completed in these small examples, leading to a hang. This PR adds dist.barrier() just before dist.destroy_process_group() to fix this.

Copy link
Contributor

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

Pasting our triage here for future reference:
If the batch is small, and the two ranks have a gap in terms of when they are launched, then rank 0 could have finished sending the activations before rank 1 bootstraps.
(In NCCL, sending a small message does not need handshake between the two ranks)

@kwen2501 kwen2501 merged commit 1bcb2bf into pytorch:main Aug 21, 2024
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants