Multi-nodes training is much more slower than single node #187

YingqingHe · 2022-09-29T14:02:39Z

hi, when I train models using tutel, I find that, in each step, multi-nodes training will need much more step time (if n nodes, it will take around n times of training time of 1-node) than single node training. Thus multi-node training will take even more time than single-node training to finish one epoch.
Any debugging suggestions with this issue?
Thanks!!!

ghostplant · 2022-09-30T03:11:38Z

Hi, thanks for reporting this issue.

For low-equipped distributed environment (e.g. eithernet with low-end busbw), cross-node All2All is supposed to have a significant bandwidth utilization drop against single-node training as the communication is fully over NVlink, unless you have high-end infini-band. This issue #160 discusses the detail of what busbw is required to achieve corresponding training throughput.

A good thing is that even though you see a throughput drop after first scaling to multiple nodes, further increasing nodes no longer makes it worse significantly.

In addition, for a few scenarios, you can set --parallel_type=adaptive:0 which won't perform All2All for training, then see whether the step time becomes a little better.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-nodes training is much more slower than single node #187

Multi-nodes training is much more slower than single node #187

YingqingHe commented Sep 29, 2022

ghostplant commented Sep 30, 2022

Multi-nodes training is much more slower than single node #187

Multi-nodes training is much more slower than single node #187

Comments

YingqingHe commented Sep 29, 2022

ghostplant commented Sep 30, 2022