[feat] Use a different timeout dor data loading and checkpoints #122

jlamypoirier · 2025-01-19T20:02:06Z

🧐 Problem Description

Distributed timeouts for dataset building, checkpoint saving, and tensor computations need to be set independently. These tasks have vastly different durations, and a one-size-fits-all timeout isn’t cutting it.

💡 Proposed Solution

We can't do it directly, pytorch/nccl won't let us (only one timeout per process group, and timeout kills the process). Alternatives:

Store-based barrier (https://github.com/pytorch/pytorch/blame/main/torch/distributed/distributed_c10d.py#L931). Good and simple alternative but not sure about speed and robustness.
Use a different process group with a different timeout: does what we want with a small overhead.
Use a different Gloo process group that supports monitored_barrier which does what we want. Also works, but not sure about speed and overhead, and I think torch may not always be compiled with gloo.

🔄 Alternatives Considered

Increasing the timeout works, but means we have to wait longer to see job failures, and waste resources.

📈 Potential Benefits

Lower distributed timeout for main ops so we can notice failures faster

The text was updated successfully, but these errors were encountered:

jlamypoirier added the enhancement New feature or request label Jan 19, 2025

jlamypoirier mentioned this issue Jan 19, 2025

[meta] Fast-LLM Improvements Tracker 🌟 #100

Open

jlamypoirier added the Priority label Jan 22, 2025

jlamypoirier linked a pull request Jan 24, 2025 that will close this issue

Better timeouts #129

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Use a different timeout dor data loading and checkpoints #122

[feat] Use a different timeout dor data loading and checkpoints #122

jlamypoirier commented Jan 19, 2025

[feat] Use a different timeout dor data loading and checkpoints #122

[feat] Use a different timeout dor data loading and checkpoints #122

Comments

jlamypoirier commented Jan 19, 2025

🧐 Problem Description

💡 Proposed Solution

🔄 Alternatives Considered

📈 Potential Benefits