Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC]: Support encode only models by Workflow Defined Engine #8453

Closed
1 task done
noooop opened this issue Sep 13, 2024 · 5 comments · May be fixed by #8964
Closed
1 task done

[RFC]: Support encode only models by Workflow Defined Engine #8453

noooop opened this issue Sep 13, 2024 · 5 comments · May be fixed by #8964

Comments

@noooop
Copy link
Contributor

noooop commented Sep 13, 2024

Motivation.

As vllm supports more and more models and functions, they require different attention, scheduler, executor, and input output processor. . These modules are becoming increasingly complex, and sometimes new features must be compromised for compatibility. ultimately leading to suboptimal results

Take support for encode only models as an example

Although the encode only models is much simpler than the decode model, they are very different.

The simplest way to support the encode only models is to implement different modules for models of different architectures and load the required modules on demand.

I call this architecture Workflow Defined Engine, or WDE for short.

Terminology.

The scope of discussion is slightly larger than encode only models, and is roughly divided into three categories:

  • Encode only models. (Bidirectional Transformers, causal=False), Often fine-tuned as retriever and reranker etc.
  • Decode only models. (masked multi-head attention, causal=True). There are two interesting uses:
    • Output last hidden states as a feature extractor
    • Decode only retriever (I don't know of a better name),E.g. e5-mistral-7b (The only Embed model currently supported by vllm)
    • Whether it has been fine-tuned or not, there is almost no difference in the code.
  • Enable bidirectional. LLM2Vec propose a simple unsupervised approach that can transform any decoder-only LLM into a strong text encoder.
    • Therefore, we need to support enable_bidirectional flag manually or read hf config automatically, enable bidirectional.

What the above three usages have in common is that there is only the prefill stage. In order to make the terminology more precise, prefill only is used below.

You can think of prefill only as encode only fancy writing.

add more:
Natural language processing (NLP) can be divided into natural language understanding (NLU) and natural language generation (NLG). The prefill only models mentioned in this discussion are NLU models. NLU is just like the name and does not generate new token.

Proposed Change.

SUMMARY:

  1. Prefill only models requires simpler attention implementations (prefill only, no kvcache...)
  2. Prefill only models requires simpler scheduler. (no kvcache, no preemption...)
  3. In order to support asynchronous scheduling, model_input_builder needs to be separated from the runner.
    The main thread executes scheduling and all CPU processing, and the gpu thread only executes h2d, execution model, d2h
  4. With wde, there is no need for one module to be compatible with all functions.
    You can always use the workflow to load new modules at the highest level to support new functions.

Feedback Period.

No response

CC List.

No response

Any Other Things.

PTAL #8452

Supported models:

Features supported and tested:

  • WDE core
  • Attention Backend for prefill only models
    • Flash Attention Backend
    • Torch SDPA Backend
    • XFormers Backend
    • FlashInfer Backend (Because prefill only models do not involve kv cache, When using Flashinfer backend in prefill only models, you are actually using FLASH ATTN backend
    • Torch naive backend (as a control group
  • Asynchronous scheduling for prefill only models (simple_execute_loop and double_buffer_execute_loop)
  • output last hidden states
  • enable bidirectional
  • data parallelism Not fully tested

WIP:

  • Limit GPU memory usage by gpu_memory_utilization to avoid oom

Functions that have not yet, but are relatively important

  • Integrate wde into vllm entrypoints
  • more Attention Backend support (I only have cuda device
    • ROCM_FLASH
    • OPENVINO
    • PALLAS
    • IPEX
  • Support distributed executer
    For small models, data parallelism is more efficient
    • tensor parallelism (tensor parallelism is coupled with other parts, can we decouple it?
    • pipeline parallelism
  • Limit GPU memory usage by gpu_memory_utilization to avoid oom
  • Support quantization models
  • Support lora
  • maybe more

Anyway, I hope vllm can support prefill only models as soon as possible

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@noooop noooop added the RFC label Sep 13, 2024
@noooop
Copy link
Contributor Author

noooop commented Sep 13, 2024

benchmarks:

for bge-m3
for xlm-roberta

test on 4090 * 1

Throughput on the abscissa, delay on the ordinate, lower right is better

xlm-roberta-base

xlm-roberta-large

bge-m3

The speed of wde is significantly faster than hf under various batch sizes.

profiler

simple_execute_loop

double_buffer_execute_loop

Using double buffer, IO and calculation can be parallelized, which is slightly faster, but the gpu memory is almost doubled.

@noooop noooop changed the title [RFC]: Support encode only models (xlm-roberta、bge-m3...) by Workflow Defined Engine [RFC]: Support encode only models by Workflow Defined Engine Sep 18, 2024
@noooop
Copy link
Contributor Author

noooop commented Sep 20, 2024

benchmarks different attention implementations:

FlashInfer Backend (Because encode only models do not involve kv cache, When using Flashinfer backend in encode only models, you are actually using FLASH ATTN backend

code

test on 4090 * 1

Throughput on the abscissa, delay on the ordinate, lower right is better

fp32

fp16

bf16

Flash Attention Backend is the fastest, no surprise at all

FLASH_ATTN

When using FLASH_ATTN, bf16 and fp16 are almost the same speed

@noooop
Copy link
Contributor Author

noooop commented Sep 26, 2024

@DarkLight1337

I am doing the final code cleanup. Please sort out the issues related to prefill only models. I will solve them as much as possible.

And there are almost 10,000 lines of code. This pr is not going to support multimodal llm.

@liweiqing1997
Copy link

This is a good job and I really need help with this job

Copy link

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

@github-actions github-actions bot added the stale label Jan 10, 2025
@noooop noooop closed this as completed Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants