Synthesize on multiple-machine topology #46

DictXiong · 2022-12-30T06:44:29Z

Hi. I'm trying to synthesize some pareto-optimal allreduce algorithms on several DGX-A100 machines, which are connected via a single switch. I didn't find this among the pre-defined topologies, and I also don't know how to write a proper topology file. Could you please help me with this?

For example, there are three DGX-A100 machines. Each of them has 8 A100 GPUs and 4 200Gbps NICs. The total 12 NICs are connected to one switch. What's the best practice to synthesize allreduce algorithms on this topology?

Thanks.

saeedmaleki · 2023-05-21T18:29:37Z

Sorry for late reply! For synthesizing multi-node algorithm, please use TACCL. However, if you wish to write your algorithm manually, you could start from one of the many multi-node A100 algorithms. For example, this one.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synthesize on multiple-machine topology #46

Synthesize on multiple-machine topology #46

DictXiong commented Dec 30, 2022

saeedmaleki commented May 21, 2023

Synthesize on multiple-machine topology #46

Synthesize on multiple-machine topology #46

Comments

DictXiong commented Dec 30, 2022

saeedmaleki commented May 21, 2023