You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Distributed timeouts for dataset building, checkpoint saving, and tensor computations need to be set independently. These tasks have vastly different durations, and a one-size-fits-all timeout isn’t cutting it.
💡 Proposed Solution
We can't do it directly, pytorch/nccl won't let us (only one timeout per process group, and timeout kills the process). Alternatives:
Use a different process group with a different timeout: does what we want with a small overhead.
Use a different Gloo process group that supports monitored_barrier which does what we want. Also works, but not sure about speed and overhead, and I think torch may not always be compiled with gloo.
🔄 Alternatives Considered
Increasing the timeout works, but means we have to wait longer to see job failures, and waste resources.
📈 Potential Benefits
Lower distributed timeout for main ops so we can notice failures faster
The text was updated successfully, but these errors were encountered:
🧐 Problem Description
Distributed timeouts for dataset building, checkpoint saving, and tensor computations need to be set independently. These tasks have vastly different durations, and a one-size-fits-all timeout isn’t cutting it.
💡 Proposed Solution
We can't do it directly, pytorch/nccl won't let us (only one timeout per process group, and timeout kills the process). Alternatives:
monitored_barrier
which does what we want. Also works, but not sure about speed and overhead, and I think torch may not always be compiled with gloo.🔄 Alternatives Considered
Increasing the timeout works, but means we have to wait longer to see job failures, and waste resources.
📈 Potential Benefits
Lower distributed timeout for main ops so we can notice failures faster
The text was updated successfully, but these errors were encountered: