You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
hi, when I train models using tutel, I find that, in each step, multi-nodes training will need much more step time (if n nodes, it will take around n times of training time of 1-node) than single node training. Thus multi-node training will take even more time than single-node training to finish one epoch.
Any debugging suggestions with this issue?
Thanks!!!
The text was updated successfully, but these errors were encountered:
For low-equipped distributed environment (e.g. eithernet with low-end busbw), cross-node All2All is supposed to have a significant bandwidth utilization drop against single-node training as the communication is fully over NVlink, unless you have high-end infini-band. This issue #160 discusses the detail of what busbw is required to achieve corresponding training throughput.
A good thing is that even though you see a throughput drop after first scaling to multiple nodes, further increasing nodes no longer makes it worse significantly.
In addition, for a few scenarios, you can set --parallel_type=adaptive:0 which won't perform All2All for training, then see whether the step time becomes a little better.
hi, when I train models using tutel, I find that, in each step, multi-nodes training will need much more step time (if n nodes, it will take around n times of training time of 1-node) than single node training. Thus multi-node training will take even more time than single-node training to finish one epoch.
Any debugging suggestions with this issue?
Thanks!!!
The text was updated successfully, but these errors were encountered: