Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarifications on All_Gather Handling in Profiling Communication Operators #46

Open
zhhangBian opened this issue Nov 25, 2024 · 2 comments

Comments

@zhhangBian
Copy link

Hello Vidur,

Thank you for sharing your work. While reading the code and documentation, I encountered some questions related to the Profiling Communication Operators mentioned in the paper.

In the paper, it is noted that there are three collective operations: all_reduce, all_gather, and send_recv. However, in the simulated device data located at data/compute, it seems that simulation parameters are provided only for all_reduce and send_recv. There are no simulation parameters for the all_gather operation.

After reviewing the relevant code in vidur/profiling, it appears that all_gather is treated as device-independent, and thus its parameters are not explicitly introduced. However, isn’t all_gather typically device-dependent? If so, could you clarify why it is treated as device-independent in this case?

Additionally, in vidur/profiling/collectives/main.py, the --collective argument only supports choices=["all_reduce", "send_recv"]. Could you explain the rationale behind excluding all_gather as an option here?

The above are my points of confusion while going through the code. I would greatly appreciate it if you could provide clarification or corrections if I have misunderstood any part of your work.

Thank you in advance for your time and insights!

@AgrawalAmey
Copy link
Contributor

Hi @zhhangBian, we originally included all gather to represent some parallel strategies that we were experimenting with. However, as of today we actually only use all reduce and send/recv operations -- which are sufficient to represent tensor and pipeline parallelism.

@zhhangBian
Copy link
Author

@AgrawalAmey Thank you for your response!

From my understanding, performing both a row-partition and a column-partition would involve using all-reduce, while performing only a row-partition or only a column-partition would require an all-gather operation.

However, I’m still curious. Could you kindly elaborate further on what you mean by “only use all-reduce and send/recv operations -- which are sufficient to represent tensor and pipeline parallelism”? Could you also explain the underlying principles and how this is implemented?

Thank you so much for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants