Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synthesize on multiple-machine topology #46

Open
DictXiong opened this issue Dec 30, 2022 · 1 comment
Open

Synthesize on multiple-machine topology #46

DictXiong opened this issue Dec 30, 2022 · 1 comment

Comments

@DictXiong
Copy link

Hi. I'm trying to synthesize some pareto-optimal allreduce algorithms on several DGX-A100 machines, which are connected via a single switch. I didn't find this among the pre-defined topologies, and I also don't know how to write a proper topology file. Could you please help me with this?

For example, there are three DGX-A100 machines. Each of them has 8 A100 GPUs and 4 200Gbps NICs. The total 12 NICs are connected to one switch. What's the best practice to synthesize allreduce algorithms on this topology?

Thanks.

@saeedmaleki
Copy link
Contributor

Sorry for late reply! For synthesizing multi-node algorithm, please use TACCL. However, if you wish to write your algorithm manually, you could start from one of the many multi-node A100 algorithms. For example, this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants