Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC]: Hardware pluggable #11162

Open
1 task done
wangxiyuan opened this issue Dec 13, 2024 · 2 comments
Open
1 task done

[RFC]: Hardware pluggable #11162

wangxiyuan opened this issue Dec 13, 2024 · 2 comments
Labels

Comments

@wangxiyuan
Copy link
Contributor

wangxiyuan commented Dec 13, 2024

Motivation.

Currently, vLLM support many hardware backend(cpu, cuda, hpu, neuron, openvino, rocm, tpu, xpu). Some other backend are also eager to be integrated by vllm(ascend, IBM Spyre).

But as VLLM’s backend is more and more, we have encountered some problems:

  • Each backend has its own executor, worker, runner, attention. It makes the code complex. we can see many backend specified code is left here and there.
  • It's not easy for community to make the backend keep working. For example, it needs fully CI coverage, maintainers continuous contribution and so on.
  • New features are hard to be added to vLLM as well, since the backend case is complex.

To solve the problem, a good solution is to support hardware pluggable. There are some benefit:

  • The backend decoupling can make the code cleaner and easier to maintain
  • Developers can pay more attention to the generic feature, so that it is no longer troubled by the tedious backend category
  • Each backend can evolve by itself to ensure availability and realtime integration.

Proposed Change.

There are two related RFC before: #7131 and #9268.

#7131 (Done) added generic plugin system into vLLM.
#9268 (In progress) tries to make backend code modular and decouple.

These two RFC helps hardware pluggable to be implemented easier.

Pluggable

from #7131, vLLM now support out-of-tree plugin ability, developers can integrate his own code into vLLM easily as below.

image

But, what object can be pluggable is not fully defined and supported. Currently, only Models support this mechanism base on ModelRegistry feature. The out-of-tree code would like:

from vllm import ModelRegistry

def register():
    from .my_opt import MyOPTForCausalLM

    if "MyOPTForCausalLM" not in ModelRegistry.get_supported_archs():
        ModelRegistry.register_model("MyOPTForCausalLM", MyOPTForCausalLM)

So back to this RFC, the hardware plugin can be done in the same way as below.

  1. First, vLLM need mange a backend list and provide register API for out-of-tree code.
  2. Out-of-tree backend plugin call the register API to register the new Backend to vLLM.
  3. Finally, users can use the new backend the same as before.

image

Usage(The same as before, the only change is to install a new plugin package):

pip install vllm
pip install vllm-ascend-plugin
# The inference will run on ascend npu automatically.
from vllm import LLM, SamplingParams

# Sample prompts.
prompts = ["Hello, my name is",]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(model="facebook/opt-125m")
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Refactor

So what the Platform object should be like? Currently, the backend related object are executor, worker, model_runner, attention, custom ops and device_communicator. Take attention for example, the Platform class should provide an API like get_attention_cls to init attention backend.

image

Let’s take a look one by one.

  1. Executor
    Now, either from V1 Engine or community goal, we want to see the executor be backend agnostic. out-of-tree backend doesn’t need to implement XXXExecutor anymore. See: [core] platform agnostic executor via collective_rpc #11256

  2. Worker, ModelRunner, AttentionBackend
    All of these object should be implement in out-of-tree backend, once the XXXPlatform is registered, these XXXWorker, XXXModelRunner, XXXAttentionBackend should be registered as well.

  3. Communicator
    Communicator is the same as Worker, ModelRunner, AttentionBackend. The problem in vLLM now is that there is no base interface for communicator. We should implement the base class in vLLM first. See: [Distributed][refactor] Add base class for device-specific communicator #11324

  4. Custom OP
    This case is a little complex. Currently users need to build vLLM from source to support different backend custom op using VLLM_TARGET_DEVICE env. The vLLM package from pypi is CUDA based. IMO, there are some ways to support out-of-tree custom op:

    1. Support _C.so replace mechanism, once vllm-xx-backend-plugin is installed, the C.so can be replaced to target device implement.
    2. Similar with 1, but vLLM support minimal package without _C.so and load it from plugin package dynamically.
    3. Support dynamically load feature. i.e. there are many so file exist at the same time, vLLM load the target one by backend’s choice.

    Not sure which is better or if there any other way, Need more discussion here.

Overall, after the refactor, what the out-of-tree plugin need do is to implement its own Worker, ModelRunner, AttentionBackend, Communicator, and then provide its Platform to include these object, then register to vLLM.

Feedback Period.

Everyday

CC List.

@simon-mo @youkaichao @DarkLight1337 @tlrmchlsmth And other maintainers who are interest in.

Any Other Things.

Once the backend plugin is supported, some other things need to be considered as well. For example, how to make sure the backed runs well? How to let users know the hardware support matrix? Is CI/CD a mandatory requirement? How to cooperate with release? and so on. Here I’d like to start with some topics.

CI/CD

vLLM now use buildkite to run UT and functional test. I notice that buildkite support self host agent. This makes it possible to integrate different hardware for testing. The hardware contributor can donate the hardware resource to community for CI Test.

V1 Engine

V1 now has its own Executor, Worker, Model Runner and Attention. The backend plugin feature needs to be compatible with V1. I’m not sure the roadmap about V1. If V1 would be the default Engine soon, the better way is to do the refactor and refactor work on V1 directly. Otherwise, work on V0 and migrate to V1 should be good.

Plugin location

Once the backend plugin is supported, the repo can be located anywhere. TBH, vLLM community may do not care it. But for the long-term consideration of the vLLM ecosystem, it is best to have a specification for backend access and maintenance. The backend can be maintained in vllm-project, but there are necessary requirements:

  • Hardware CI/CD is required.
  • Backend developers must ensure continuous contribution.
  • Keep release cycle the same with vLLM to make sure the backend can be used all the time.

For this kind of backend, we can call it official support. Once if community wants to move the inner hardware code to out-of-tree or a new backend is added, it can follow this rule.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@simon-mo
Copy link
Collaborator

simon-mo commented Jan 6, 2025

For more ephemeral conversations, please join the vLLM slack and join #sig-extensible-hardware channel to discussion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants