-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC]: Hardware pluggable #11162
Labels
Comments
This was referenced Dec 13, 2024
This was referenced Dec 26, 2024
For more ephemeral conversations, please join the vLLM slack and join #sig-extensible-hardware channel to discussion! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Motivation.
Currently, vLLM support many hardware backend(cpu, cuda, hpu, neuron, openvino, rocm, tpu, xpu). Some other backend are also eager to be integrated by vllm(ascend, IBM Spyre).
But as VLLM’s backend is more and more, we have encountered some problems:
To solve the problem, a good solution is to support hardware pluggable. There are some benefit:
Proposed Change.
There are two related RFC before: #7131 and #9268.
#7131 (Done) added generic plugin system into vLLM.
#9268 (In progress) tries to make backend code modular and decouple.
These two RFC helps hardware pluggable to be implemented easier.
Pluggable
from #7131, vLLM now support out-of-tree plugin ability, developers can integrate his own code into vLLM easily as below.
But, what object can be pluggable is not fully defined and supported. Currently, only Models support this mechanism base on
ModelRegistry
feature. The out-of-tree code would like:So back to this RFC, the hardware plugin can be done in the same way as below.
Usage(The same as before, the only change is to install a new plugin package):
Refactor
So what the
Platform
object should be like? Currently, the backend related object areexecutor
,worker
,model_runner
,attention
,custom ops
anddevice_communicator
. Takeattention
for example, thePlatform
class should provide an API likeget_attention_cls
to init attention backend.Let’s take a look one by one.
Executor
Now, either from V1 Engine or community goal, we want to see the executor be backend agnostic. out-of-tree backend doesn’t need to implement XXXExecutor anymore. See: [core] platform agnostic executor via collective_rpc #11256
Worker, ModelRunner, AttentionBackend
All of these object should be implement in out-of-tree backend, once the XXXPlatform is registered, these
XXXWorker
,XXXModelRunner
,XXXAttentionBackend
should be registered as well.Communicator
Communicator is the same as Worker, ModelRunner, AttentionBackend. The problem in vLLM now is that there is no base interface for communicator. We should implement the base class in vLLM first. See: [Distributed][refactor] Add base class for device-specific communicator #11324
Custom OP
This case is a little complex. Currently users need to build vLLM from source to support different backend custom op using
VLLM_TARGET_DEVICE
env. The vLLM package from pypi is CUDA based. IMO, there are some ways to support out-of-tree custom op:_C.so
replace mechanism, oncevllm-xx-backend-plugin
is installed, theC.so
can be replaced to target device implement._C.so
and load it from plugin package dynamically.so
file exist at the same time, vLLM load the target one by backend’s choice.Not sure which is better or if there any other way, Need more discussion here.
Overall, after the refactor, what the out-of-tree plugin need do is to implement its own
Worker
,ModelRunner
,AttentionBackend
,Communicator
, and then provide itsPlatform
to include these object, then register to vLLM.Feedback Period.
Everyday
CC List.
@simon-mo @youkaichao @DarkLight1337 @tlrmchlsmth And other maintainers who are interest in.
Any Other Things.
Once the backend plugin is supported, some other things need to be considered as well. For example, how to make sure the backed runs well? How to let users know the hardware support matrix? Is CI/CD a mandatory requirement? How to cooperate with release? and so on. Here I’d like to start with some topics.
CI/CD
vLLM now use buildkite to run UT and functional test. I notice that buildkite support self host agent. This makes it possible to integrate different hardware for testing. The hardware contributor can donate the hardware resource to community for CI Test.
V1 Engine
V1 now has its own Executor, Worker, Model Runner and Attention. The backend plugin feature needs to be compatible with V1. I’m not sure the roadmap about V1. If V1 would be the default Engine soon, the better way is to do the refactor and refactor work on V1 directly. Otherwise, work on V0 and migrate to V1 should be good.
Plugin location
Once the backend plugin is supported, the repo can be located anywhere. TBH, vLLM community may do not care it. But for the long-term consideration of the vLLM ecosystem, it is best to have a specification for backend access and maintenance. The backend can be maintained in
vllm-project
, but there are necessary requirements:For this kind of backend, we can call it
official support
. Once if community wants to move the inner hardware code to out-of-tree or a new backend is added, it can follow this rule.Before submitting a new issue...
The text was updated successfully, but these errors were encountered: