Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Changes to support 0.2.0 flashinfer #11314

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

pavanimajety
Copy link
Contributor

@pavanimajety pavanimajety commented Dec 19, 2024

Dataype and wrapper changes for 0.2.0 flashinfer

Related: #11194

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@pavanimajety pavanimajety marked this pull request as draft December 19, 2024 01:00
@pavanimajety pavanimajety force-pushed the flashinfer-0.2-changes branch from c1e4b21 to 5439e7d Compare December 19, 2024 02:29
@pavanimajety pavanimajety marked this pull request as ready for review December 19, 2024 02:37
@JaheimLee
Copy link

I found flashinfer 0.2.0 uses more memory on rank 0 when tp>1. I built it from source with AOT mode. Is that normal?

@pavanimajety
Copy link
Contributor Author

@JaheimLee Seems like we have a fix, we'll update to Flashinfer 0.2.0.post1. Thanks

@JaheimLee
Copy link

@JaheimLee Seems like we have a fix, we'll update to Flashinfer 0.2.0.post1. Thanks

Still have this problem. And I got another error

INFO 12-25 22:54:21 config.py:478] This model supports multiple tasks: {'generate', 'score', 'classify', 'reward', 'embed'}. Defaulting to 'generate'.
INFO 12-25 22:54:22 awq_marlin.py:109] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 12-25 22:54:22 config.py:925] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
INFO 12-25 22:54:22 config.py:1216] Defaulting to use mp for distributed inference
WARNING 12-25 22:54:22 cuda.py:98] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
WARNING 12-25 22:54:22 config.py:604] Async output processing is not supported on the current platform type cuda.
INFO 12-25 22:54:22 llm_engine.py:249] Initializing an LLM engine (v0.6.5) with config: model='/data/pretrained_models/Qwen2.5-72B-Instruct-AWQ', speculative_config=None, tokenizer='/data/pretrained_models/Qwen2.5-72B-Instruct-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=True, kv_cache_dtype=fp8, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data/pretrained_models/Qwen2.5-72B-Instruct-AWQ, num_scheduler_steps=8, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=False, mm_cache_preprocessor=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False, 
WARNING 12-25 22:54:22 multiproc_worker_utils.py:280] CUDA was previously initialized. We must use the `spawn` multiprocessing start method. Setting VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/getting_started/debugging.html#python-multiprocessing for more information.
WARNING 12-25 22:54:22 multiproc_worker_utils.py:312] Reducing Torch parallelism from 28 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 12-25 22:54:22 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
INFO 12-25 22:54:22 selector.py:155] Using Flashinfer backend.
WARNING 12-25 22:54:22 registry.py:262] `mm_limits` has already been set for model=/data/pretrained_models/Qwen2.5-72B-Instruct-AWQ, and will be overwritten by the new values.
INFO 12-25 22:54:37 config.py:478] This model supports multiple tasks: {'generate', 'classify', 'score', 'reward', 'embed'}. Defaulting to 'generate'.
INFO 12-25 22:54:38 awq_marlin.py:109] The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.
INFO 12-25 22:54:38 config.py:925] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
INFO 12-25 22:54:38 config.py:1216] Defaulting to use mp for distributed inference
WARNING 12-25 22:54:38 cuda.py:98] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
WARNING 12-25 22:54:38 config.py:604] Async output processing is not supported on the current platform type cuda.
INFO 12-25 22:54:38 llm_engine.py:249] Initializing an LLM engine (v0.6.5) with config: model='/data/pretrained_models/Qwen2.5-72B-Instruct-AWQ', speculative_config=None, tokenizer='/data/pretrained_models/Qwen2.5-72B-Instruct-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq_marlin, enforce_eager=True, kv_cache_dtype=fp8, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data/pretrained_models/Qwen2.5-72B-Instruct-AWQ, num_scheduler_steps=8, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=False, mm_cache_preprocessor=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False, 
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/mosh/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/lib/python3.12/multiprocessing/spawn.py", line 122, in spawn_main
    exitcode = _main(fd, parent_sentinel)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mosh/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/lib/python3.12/multiprocessing/spawn.py", line 131, in _main
    prepare(preparation_data)
  File "/home/mosh/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/lib/python3.12/multiprocessing/spawn.py", line 246, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/home/mosh/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/lib/python3.12/multiprocessing/spawn.py", line 297, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen runpy>", line 287, in run_path
  File "<frozen runpy>", line 98, in _run_module_code
  File "<frozen runpy>", line 88, in _run_code
  File "/data/lijinghui/uv_projects/LLM/test.py", line 19, in <module>
    llm = LLM(
          ^^^^
  File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/utils.py", line 990, in inner
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 230, in __init__
    self.llm_engine = self.engine_class.from_engine_args(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 532, in from_engine_args
    engine = cls(
             ^^^^
  File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 288, in __init__
    self.model_executor = executor_class(vllm_config=vllm_config, )
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/executor/distributed_gpu_executor.py", line 26, in __init__
    super().__init__(*args, **kwargs)
  File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 36, in __init__
    self._init_executor()
  File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/executor/multiproc_gpu_executor.py", line 58, in _init_executor
    worker = ProcessWorkerWrapper(
             ^^^^^^^^^^^^^^^^^^^^^
  File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/executor/multiproc_worker_utils.py", line 167, in __init__
    self.process.start()
  File "/home/mosh/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
                  ^^^^^^^^^^^^^^^^^
  File "/home/mosh/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/lib/python3.12/multiprocessing/context.py", line 289, in _Popen
    return Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^
  File "/home/mosh/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/lib/python3.12/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/home/mosh/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/lib/python3.12/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/mosh/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/lib/python3.12/multiprocessing/popen_spawn_posix.py", line 42, in _launch
    prep_data = spawn.get_preparation_data(process_obj._name)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mosh/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/lib/python3.12/multiprocessing/spawn.py", line 164, in get_preparation_data
    _check_not_importing_main()
  File "/home/mosh/.local/share/uv/python/cpython-3.12.7-linux-x86_64-gnu/lib/python3.12/multiprocessing/spawn.py", line 140, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

        To fix this issue, refer to the "Safe importing of main module"
        section in https://docs.python.org/3/library/multiprocessing.html
        
Exception ignored in: <function LLM.__del__ at 0x7f1a6c5fc180>
Traceback (most recent call last):
  File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 236, in __del__
    if self.llm_engine and hasattr(self.llm_engine, "shutdown"):
       ^^^^^^^^^^^^^^^
AttributeError: 'LLM' object has no attribute 'llm_engine'
ERROR 12-25 22:54:39 multiproc_worker_utils.py:123] Worker VllmWorkerProcess pid 2759640 died, exit code: 1
INFO 12-25 22:54:39 multiproc_worker_utils.py:127] Killing local vLLM worker processes

Here is my code

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
os.environ["VLLM_ATTENTION_BACKEND"] = "FLASHINFER"
os.environ["VLLM_USE_FLASHINFER_SAMPLER"] = "1"
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

model_name = "Qwen2.5-72B-Instruct-AWQ"
model_path = os.path.join("/data/pretrained_models", model_name)
# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Pass the default decoding hyperparameters of Qwen2.5-7B-Instruct
# max_tokens is for the maximum length for generation.
sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=512)

# Input the model name or path. Can be GPTQ or AWQ models.
llm = LLM(
    model=model_path,
    gpu_memory_utilization=0.97,
    tensor_parallel_size=2,
    kv_cache_dtype="fp8",
    enforce_eager=True,
    enable_prefix_caching=True,
    num_scheduler_steps=8,
)

# Prepare your prompts
prompt = "Tell me something about large language models."
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# generate outputs
outputs = llm.generate([text], sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants