[Core] Changes to support 0.2.0 flashinfer #11314

pavanimajety · 2024-12-19T00:58:59Z

Dataype and wrapper changes for 0.2.0 flashinfer

Related: #11194

github-actions · 2024-12-19T00:59:11Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

…anged Signed-off-by: Pavani Majety <[email protected]>

JaheimLee · 2024-12-20T07:23:04Z

I found flashinfer 0.2.0 uses more memory on rank 0 when tp>1. I built it from source with AOT mode. Is that normal?

pavanimajety · 2024-12-23T21:15:20Z

@JaheimLee Seems like we have a fix, we'll update to Flashinfer 0.2.0.post1. Thanks

JaheimLee · 2024-12-25T14:56:08Z

@JaheimLee Seems like we have a fix, we'll update to Flashinfer 0.2.0.post1. Thanks

Still have this problem. And I got another error

INFO 12-25 22:54:21 config.py:478] This model supports multiple tasks: {'generate', 'score', 'classify', 'reward', 'embed'}. Defaulting to 'generate'.
INFO 12-25 22:54:22 awq_marlin.py:109] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 12-25 22:54:22 config.py:925] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
INFO 12-25 22:54:22 config.py:1216] Defaulting to use mp for distributed inference
WARNING 12-25 22:54:22 cuda.py:98] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
WARNING 12-25 22:54:22 config.py:604] Async output processing is not supported on the current platform type cuda.
INFO 12-25 22:54:22 llm_engine.py:249] Initializing an LLM engine (v0.6.5) with config: model='/data/pretrained_models/Qwen2.5-72B-Instruct-AWQ', speculative_config=None, tokenizer='/data/pretrained_models/Qwen2.5-72B-Instruct-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=True, kv_cache_dtype=fp8, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data/pretrained_models/Qwen2.5-72B-Instruct-AWQ, num_scheduler_steps=8, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=False, mm_cache_preprocessor=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False, 
WARNING 12-25 22:54:22 multiproc_worker_utils.py:280] CUDA was previously initialized. We must use the `spawn` multiprocessing start method. Setting VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/getting_started/debugging.html#python-multiprocessing for more information.
WARNING 12-25 22:54:22 multiproc_worker_utils.py:312] Reducing Torch parallelism from 28 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 12-25 22:54:22 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
INFO 12-25 22:54:22 selector.py:155] Using Flashinfer backend.
WARNING 12-25 22:54:22 registry.py:262] `mm_limits` has already been set for model=/data/pretrained_models/Qwen2.5-72B-Instruct-AWQ, and will be overwritten by the new values.
INFO 12-25 22:54:37 config.py:478] This model supports multiple tasks: {'generate', 'classify', 'score', 'reward', 'embed'}. Defaulting to 'generate'.
INFO 12-25 22:54:38 awq_marlin.py:109] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 12-25 22:54:38 config.py:925] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
INFO 12-25 22:54:38 config.py:1216] Defaulting to use mp for distributed inference
WARNING 12-25 22:54:38 cuda.py:98] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
WARNING 12-25 22:54:38 config.py:604] Async output processing is not supported on the current platform type cuda.
INFO 12-25 22:54:38 llm_engine.py:249] Initializing an LLM engine (v0.6.5) with config: model='/data/pretrained_models/Qwen2.5-72B-Instruct-AWQ', speculative_config=None, tokenizer='/data/pretrained_models/Qwen2.5-72B-Instruct-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=True, kv_cache_dtype=fp8, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data/pretrained_models/Qwen2.5-72B-Instruct-AWQ, num_scheduler_steps=8, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=False, mm_cache_preprocessor=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False, 
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/mosh/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/lib/python3.12/multiprocessing/spawn.py", line 122, in spawn_main
    exitcode = _main(fd, parent_sentinel)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mosh/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/lib/python3.12/multiprocessing/spawn.py", line 131, in _main
    prepare(preparation_data)
  File "/home/mosh/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/lib/python3.12/multiprocessing/spawn.py", line 246, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/home/mosh/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/lib/python3.12/multiprocessing/spawn.py", line 297, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen runpy>", line 287, in run_path
  File "<frozen runpy>", line 98, in _run_module_code
  File "<frozen runpy>", line 88, in _run_code
  File "/data/lijinghui/uv_projects/LLM/test.py", line 19, in <module>
    llm = LLM(
          ^^^^
  File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/utils.py", line 990, in inner
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 230, in __init__
    self.llm_engine = self.engine_class.from_engine_args(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 532, in from_engine_args
    engine = cls(
             ^^^^
  File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 288, in __init__
    self.model_executor = executor_class(vllm_config=vllm_config, )
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/executor/distributed_gpu_executor.py", line 26, in __init__
    super().__init__(*args, **kwargs)
  File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 36, in __init__
    self._init_executor()
  File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/executor/multiproc_gpu_executor.py", line 58, in _init_executor
    worker = ProcessWorkerWrapper(
             ^^^^^^^^^^^^^^^^^^^^^
  File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/executor/multiproc_worker_utils.py", line 167, in __init__
    self.process.start()
  File "/home/mosh/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
                  ^^^^^^^^^^^^^^^^^
  File "/home/mosh/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/lib/python3.12/multiprocessing/context.py", line 289, in _Popen
    return Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^
  File "/home/mosh/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/lib/python3.12/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/home/mosh/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/lib/python3.12/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/mosh/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/lib/python3.12/multiprocessing/popen_spawn_posix.py", line 42, in _launch
    prep_data = spawn.get_preparation_data(process_obj._name)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mosh/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/lib/python3.12/multiprocessing/spawn.py", line 164, in get_preparation_data
    _check_not_importing_main()
  File "/home/mosh/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/lib/python3.12/multiprocessing/spawn.py", line 140, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

        To fix this issue, refer to the "Safe importing of main module"
        section in https://docs.python.org/3/library/multiprocessing.html
        
Exception ignored in: <function LLM.__del__ at 0x7f1a6c5fc180>
Traceback (most recent call last):
  File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 236, in __del__
    if self.llm_engine and hasattr(self.llm_engine, "shutdown"):
       ^^^^^^^^^^^^^^^
AttributeError: 'LLM' object has no attribute 'llm_engine'
ERROR 12-25 22:54:39 multiproc_worker_utils.py:123] Worker VllmWorkerProcess pid 2759640 died, exit code: 1
INFO 12-25 22:54:39 multiproc_worker_utils.py:127] Killing local vLLM worker processes

Here is my code

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
os.environ["VLLM_ATTENTION_BACKEND"] = "FLASHINFER"
os.environ["VLLM_USE_FLASHINFER_SAMPLER"] = "1"
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

model_name = "Qwen2.5-72B-Instruct-AWQ"
model_path = os.path.join("/data/pretrained_models", model_name)
# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Pass the default decoding hyperparameters of Qwen2.5-7B-Instruct
# max_tokens is for the maximum length for generation.
sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=512)

# Input the model name or path. Can be GPTQ or AWQ models.
llm = LLM(
    model=model_path,
    gpu_memory_utilization=0.97,
    tensor_parallel_size=2,
    kv_cache_dtype="fp8",
    enforce_eager=True,
    enable_prefix_caching=True,
    num_scheduler_steps=8,
)

# Prepare your prompts
prompt = "Tell me something about large language models."
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# generate outputs
outputs = llm.generate([text], sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

[Core] Changes to support 0.2.0 flashinfer

690ae5d

pavanimajety marked this pull request as draft December 19, 2024 01:00

Revert to using begin_forward/forward because plan/run inputs have ch…

5439e7d

…anged Signed-off-by: Pavani Majety <[email protected]>

pavanimajety force-pushed the flashinfer-0.2-changes branch from c1e4b21 to 5439e7d Compare December 19, 2024 02:29

pavanimajety marked this pull request as ready for review December 19, 2024 02:37

noooop mentioned this pull request Dec 19, 2024

[Performance]: decoding speed on long context #11286

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Changes to support 0.2.0 flashinfer #11314

[Core] Changes to support 0.2.0 flashinfer #11314

pavanimajety commented Dec 19, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Dec 19, 2024

JaheimLee commented Dec 20, 2024

pavanimajety commented Dec 23, 2024

JaheimLee commented Dec 25, 2024

[Core] Changes to support 0.2.0 flashinfer #11314

Are you sure you want to change the base?

[Core] Changes to support 0.2.0 flashinfer #11314

Conversation

pavanimajety commented Dec 19, 2024 • edited by github-actions bot Loading

github-actions bot commented Dec 19, 2024

JaheimLee commented Dec 20, 2024

pavanimajety commented Dec 23, 2024

JaheimLee commented Dec 25, 2024

pavanimajety commented Dec 19, 2024 •

edited by github-actions bot

Loading