-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Distributed][refactor] Add base class for device-specific communicator #11324
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
c4f0481
to
eeb5aae
Compare
eeb5aae
to
6e6501a
Compare
1a977d3
to
6de2b98
Compare
This pull request has merge conflicts that must be resolved before it can be |
class CommunicatorABC(ABC): | ||
""" | ||
CommunicatorBase ABC | ||
""" | ||
|
||
@abstractmethod | ||
def all_reduce(self, input_: torch.Tensor) -> torch.Tensor: | ||
raise NotImplementedError | ||
|
||
def gather(self, | ||
input_: torch.Tensor, | ||
dst: int = 0, | ||
dim: int = -1) -> Optional[torch.Tensor]: | ||
|
||
raise NotImplementedError | ||
|
||
def all_gather(self, input_: torch.Tensor, dim: int = -1) -> torch.Tensor: | ||
raise NotImplementedError | ||
|
||
|
||
class CommunicatorBase(CommunicatorABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are these two classes instead of one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The abstract class is just to provide a unified interface and indicate the object type, like following
vllm/vllm/distributed/parallel_state.py
Line 229 in 6de2b98
self.communicator: CommunicatorABC = CommunicatorBase( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@youkaichao WDYT of we keep both CommunicatorABC
and CommunicatorBase
here? Plz let me know if you have any suggestion on this.
b5d2063
to
f03eedb
Compare
Signed-off-by: Mengqing Cao <[email protected]>
242fb40
to
b085f82
Compare
part of #11162
This PR provide a base class
CommunicatorBase
for device-specific communicators (HpuCommunicator
,TpuCommunicator
andXpuCommunicator
), avoiding the cumbersome dispatch in each communicator operator ofGroupCoordinator
, e.g.,https://github.com/vllm-project/vllm/blob/main/vllm/distributed/parallel_state.py#L342-L353
In this pr, the communication-related classes are organized as the following fig. This allows new backends to implement their own communicators and dynamic dispatch them in the platform.