-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC]: Support encode only models by Workflow Defined Engine #8453
Comments
benchmarks: test on 4090 * 1 Throughput on the abscissa, delay on the ordinate, lower right is better The speed of wde is significantly faster than hf under various batch sizes. Using double buffer, IO and calculation can be parallelized, which is slightly faster, but the gpu memory is almost doubled. |
benchmarks different attention implementations:
test on 4090 * 1 Throughput on the abscissa, delay on the ordinate, lower right is better Flash Attention Backend is the fastest, no surprise at all When using FLASH_ATTN, bf16 and fp16 are almost the same speed |
I am doing the final code cleanup. Please sort out the issues related to prefill only models. I will solve them as much as possible. And there are almost 10,000 lines of code. This pr is not going to support multimodal llm. |
This is a good job and I really need help with this job |
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you! |
Motivation.
As vllm supports more and more models and functions, they require different attention, scheduler, executor, and input output processor. . These modules are becoming increasingly complex, and sometimes new features must be compromised for compatibility. ultimately leading to suboptimal results
Take support for encode only models as an example
Although the encode only models is much simpler than the decode model, they are very different.
The simplest way to support the encode only models is to implement different modules for models of different architectures and load the required modules on demand.
I call this architecture Workflow Defined Engine, or WDE for short.
Terminology.
The scope of discussion is slightly larger than encode only models, and is roughly divided into three categories:
What the above three usages have in common is that there is only the prefill stage. In order to make the terminology more precise, prefill only is used below.
You can think of prefill only as encode only fancy writing.
add more:
Natural language processing (NLP) can be divided into natural language understanding (NLU) and natural language generation (NLG). The prefill only models mentioned in this discussion are NLU models. NLU is just like the name and does not generate new token.
Proposed Change.
SUMMARY:
The main thread executes scheduling and all CPU processing, and the gpu thread only executes h2d, execution model, d2h
You can always use the workflow to load new modules at the highest level to support new functions.
Feedback Period.
No response
CC List.
No response
Any Other Things.
PTAL #8452
Supported models:
last_hidden_states
, is there an interface for that? If not, what code should be modified to realize it? #853 [Feature]: Does VLLM only support MistralModel Architecture for embedding? #7915 [Feature]: Add embeddings api for Llama #6947)Features supported and tested:
WIP:
Functions that have not yet, but are relatively important
For small models, data parallelism is more efficient
Anyway, I hope vllm can support prefill only models as soon as possible
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: