[RFC]: Support encode only models by Workflow Defined Engine #8453

noooop · 2024-09-13T08:27:51Z

Motivation.

As vllm supports more and more models and functions, they require different attention, scheduler, executor, and input output processor. . These modules are becoming increasingly complex, and sometimes new features must be compromised for compatibility. ultimately leading to suboptimal results

Take support for encode only models as an example

Although the encode only models is much simpler than the decode model, they are very different.

The simplest way to support the encode only models is to implement different modules for models of different architectures and load the required modules on demand.

I call this architecture Workflow Defined Engine, or WDE for short.

Terminology.

The scope of discussion is slightly larger than encode only models, and is roughly divided into three categories：

Encode only models. (Bidirectional Transformers, causal=False), Often fine-tuned as retriever and reranker etc.
Decode only models. (masked multi-head attention, causal=True). There are two interesting uses:
- Output last hidden states as a feature extractor
- Decode only retriever （I don't know of a better name），E.g. e5-mistral-7b （The only Embed model currently supported by vllm)
- Whether it has been fine-tuned or not, there is almost no difference in the code.
Enable bidirectional. LLM2Vec propose a simple unsupervised approach that can transform any decoder-only LLM into a strong text encoder.
- Therefore, we need to support enable_bidirectional flag manually or read hf config automatically, enable bidirectional.

What the above three usages have in common is that there is only the prefill stage. In order to make the terminology more precise, prefill only is used below.

You can think of prefill only as encode only fancy writing.

add more:
Natural language processing (NLP) can be divided into natural language understanding (NLU) and natural language generation (NLG). The prefill only models mentioned in this discussion are NLU models. NLU is just like the name and does not generate new token.

Proposed Change.

SUMMARY:

Prefill only models requires simpler attention implementations (prefill only, no kvcache...)
Prefill only models requires simpler scheduler. (no kvcache, no preemption...)
In order to support asynchronous scheduling, model_input_builder needs to be separated from the runner.
The main thread executes scheduling and all CPU processing, and the gpu thread only executes h2d, execution model, d2h
With wde, there is no need for one module to be compatible with all functions.
You can always use the workflow to load new modules at the highest level to support new functions.

Feedback Period.

No response

CC List.

No response

Any Other Things.

PTAL #8452

Supported models:

xlm_roberta （[Core][Model] Add simple_model_runner and a new model XLMRobertaForSequenceClassification through multimodal interface #6260）
bge-m3 （Supporting embedding models #3187 [Usage]: has vllm supported encoder-only model such as bge-m3? #5737 deploying embedding model in same way as LLM #6498 Roberta embedding #7969）
bge-reranker-v2-m3 ([New Model]: Want to support BAAI/bge-reranker-v2-m3 model #8022)
bert ([Feature]: BERT models for embeddings #5179 [Model] Bert Embedding Model #5447 add causal parameter for flash attention #7496)
bge v1.5 family which rely on bert ( [Usage]: Can the embedding model be deployed using the openai interface? #7506 [Usage]: RAG system #5502)
Snowflake Arctic Embed (Family) ([New Model]: Snowflake Arctic Embed (Family) #7792) (The architecture is the same as bge v1.5 family, lucky)
e5-mistral-7b （The only Embed model currently supported by vllm)
output last hidden states ([Usage]: Dose vLLM support embedding api of multimodal llm? #8483 [Feature] Want to get the last_hidden_states, is there an interface for that? If not, what code should be modified to realize it? #853 [Feature]: Does VLLM only support MistralModel Architecture for embedding? #7915 [Feature]: Add embeddings api for Llama #6947)
gte-Qwen2 （[Model] Add support for 'gte-Qwen2' embedding models #6282 [Bug]: Internal Server Error when hosting Alibaba-NLP/gte-Qwen2-7B-instruct #5827 [Feature]: support Qwen2 embedding #5600 embedings error python -m vllm.entrypoints.openai.api_server --trust-remote-code --model gte_Qwen2-7B-instruct --seed 48 --max-model-len 1000 --tensor-parallel-size 2 --gpu-memory-utilization 1 --dtype float16 #6015 [Model] Add support for Qwen2 for embeddings #5611 [Bug]: Successfully deployed embedding model 'gte-Qwen2-7B-instruct', but got "TypeError: 'async for' requires an object with __aiter__ method, got coroutine" when calling it #7389）
- Because gte-Qwen2 and Qwen2 use the same architecture name，Qwen2ForCausalLM. The code looks very sad!
- gte-Qwen2 family may have multiple different architectures. The code looks very sad!
  - gte-Qwen2-1.5B-instruct, Official sample code sentence_transformers usage and Transformers usage does not use enable bidirectional， discussions. I'm not sure if this is a bug
  - gte-Qwen2-7B-instruct use enable bidirectional

Features supported and tested:

WDE core
Attention Backend for prefill only models
- Flash Attention Backend
- Torch SDPA Backend
- XFormers Backend
- FlashInfer Backend (Because prefill only models do not involve kv cache, When using Flashinfer backend in prefill only models, you are actually using FLASH ATTN backend
- Torch naive backend (as a control group
Asynchronous scheduling for prefill only models (simple_execute_loop and double_buffer_execute_loop)
output last hidden states
enable bidirectional
data parallelism Not fully tested

WIP:

Limit GPU memory usage by gpu_memory_utilization to avoid oom

Functions that have not yet, but are relatively important

Integrate wde into vllm entrypoints
more Attention Backend support （I only have cuda device
- ROCM_FLASH
- OPENVINO
- PALLAS
- IPEX
Support distributed executer
For small models, data parallelism is more efficient
- tensor parallelism (tensor parallelism is coupled with other parts, can we decouple it?
- pipeline parallelism
Limit GPU memory usage by gpu_memory_utilization to avoid oom
Support quantization models
Support lora
- LLM2Vec ([Feature]: LLM2Vec (Fine-Tuned Embeddings) Support #6584)
maybe more

Anyway, I hope vllm can support prefill only models as soon as possible

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

noooop · 2024-09-13T08:34:53Z

benchmarks:

for bge-m3
for xlm-roberta

test on 4090 * 1

Throughput on the abscissa, delay on the ordinate, lower right is better

The speed of wde is significantly faster than hf under various batch sizes.

profiler

Using double buffer, IO and calculation can be parallelized, which is slightly faster, but the gpu memory is almost doubled.

noooop · 2024-09-20T08:41:58Z

benchmarks different attention implementations:

FlashInfer Backend (Because encode only models do not involve kv cache, When using Flashinfer backend in encode only models, you are actually using FLASH ATTN backend

code

test on 4090 * 1

Throughput on the abscissa, delay on the ordinate, lower right is better

Flash Attention Backend is the fastest, no surprise at all

When using FLASH_ATTN, bf16 and fp16 are almost the same speed

noooop · 2024-09-26T04:59:09Z

@DarkLight1337

I am doing the final code cleanup. Please sort out the issues related to prefill only models. I will solve them as much as possible.

And there are almost 10,000 lines of code. This pr is not going to support multimodal llm.

liweiqing1997 · 2024-10-11T02:19:26Z

This is a good job and I really need help with this job

github-actions · 2025-01-10T02:01:53Z

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

noooop added the RFC label Sep 13, 2024

This was referenced Sep 13, 2024

[PoC]: Support encode only models by Workflow Defined Engine #8452

Draft

[Core]: Support encode only models (xlm-roberta、bge-m3...) by Workflow Defined Engine #8462

Closed

[Usage]: Dose vLLM support embedding api of multimodal llm? #8483

Closed

noooop changed the title ~~[RFC]: Support encode only models (xlm-roberta、bge-m3...) by Workflow Defined Engine~~ [RFC]: Support encode only models by Workflow Defined Engine Sep 18, 2024

noooop mentioned this issue Sep 25, 2024

vLLM's V1 Engine Architecture #8779

Open

1 task

fan-niu mentioned this issue Sep 25, 2024

[Core][Model] Add simple_model_runner and a new model XLMRobertaForSequenceClassification through multimodal interface #6260

Closed

noooop mentioned this issue Sep 30, 2024

[Core]: (Last/N) Support prefill only models by Workflow Defined Engine #8964

Draft

3 tasks

This was referenced Oct 7, 2024

[Core]: (2/N) Support prefill only models by Workflow Defined Engine - Prefill only attention #9124

Draft

[Core]: (1/N) Support prefill only models by Workflow Defined Engine - Prefill only scheduler #9181

Draft

This was referenced Oct 14, 2024

[WIP] Prototyping re-arch #9166

Closed

[Model] Add user-configurable task for models that support both generation and embedding #9424

Merged

noooop mentioned this issue Oct 21, 2024

[Performance]: InternVL multi image speed is not improved compare to original #9483

Open

1 task

github-actions bot added the stale label Jan 10, 2025

noooop closed this as completed Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]: Support encode only models by Workflow Defined Engine #8453

[RFC]: Support encode only models by Workflow Defined Engine #8453

noooop commented Sep 13, 2024 •

edited

Loading

noooop commented Sep 13, 2024 •

edited

Loading

noooop commented Sep 20, 2024 •

edited

Loading

noooop commented Sep 26, 2024 •

edited

Loading

liweiqing1997 commented Oct 11, 2024

github-actions bot commented Jan 10, 2025

[RFC]: Support encode only models by Workflow Defined Engine #8453

[RFC]: Support encode only models by Workflow Defined Engine #8453

Comments

noooop commented Sep 13, 2024 • edited Loading

Motivation.

Terminology.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

noooop commented Sep 13, 2024 • edited Loading

noooop commented Sep 20, 2024 • edited Loading

noooop commented Sep 26, 2024 • edited Loading

liweiqing1997 commented Oct 11, 2024

github-actions bot commented Jan 10, 2025

noooop commented Sep 13, 2024 •

edited

Loading

noooop commented Sep 13, 2024 •

edited

Loading

noooop commented Sep 20, 2024 •

edited

Loading

noooop commented Sep 26, 2024 •

edited

Loading