Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] Use a different timeout dor data loading and checkpoints #122

Open
jlamypoirier opened this issue Jan 19, 2025 · 0 comments · May be fixed by #129
Open

[feat] Use a different timeout dor data loading and checkpoints #122

jlamypoirier opened this issue Jan 19, 2025 · 0 comments · May be fixed by #129
Labels
enhancement New feature or request Priority

Comments

@jlamypoirier
Copy link
Collaborator

🧐 Problem Description

Distributed timeouts for dataset building, checkpoint saving, and tensor computations need to be set independently. These tasks have vastly different durations, and a one-size-fits-all timeout isn’t cutting it.

💡 Proposed Solution

We can't do it directly, pytorch/nccl won't let us (only one timeout per process group, and timeout kills the process). Alternatives:

  • Store-based barrier (https://github.com/pytorch/pytorch/blame/main/torch/distributed/distributed_c10d.py#L931). Good and simple alternative but not sure about speed and robustness.
  • Use a different process group with a different timeout: does what we want with a small overhead.
  • Use a different Gloo process group that supports monitored_barrier which does what we want. Also works, but not sure about speed and overhead, and I think torch may not always be compiled with gloo.

🔄 Alternatives Considered

Increasing the timeout works, but means we have to wait longer to see job failures, and waste resources.

📈 Potential Benefits

Lower distributed timeout for main ops so we can notice failures faster

@jlamypoirier jlamypoirier added the enhancement New feature or request label Jan 19, 2025
@jlamypoirier jlamypoirier linked a pull request Jan 24, 2025 that will close this issue
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant