Pull from head #1

sroy745 · 2024-05-29T03:39:36Z

FILL IN THE PR DESCRIPTION HERE

FIX #xxxx (link existing issues this PR will resolve)

BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE

PR Checklist (Click to Expand)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

[Bugfix] for bug fixes.
[CI/Build] for build or continuous integration improvements.
[Doc] for documentation fixes and improvements.
[Model] for adding a new model or improving an existing model. Model name should appear in the title.
[Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
[Kernel] for changes affecting CUDA kernels or other compute kernels.
[Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
[Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
[Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

We adhere to Google Python style guide and Google C++ style guide.
Pass all linter checks. Please use format.sh to format your code.
The code need to be well-documented to ensure future contributors can easily understand the code.
Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!

Co-authored-by: Cade Daniel <[email protected]>

…gprobs (#4672)

…4626)

Co-authored-by: Michael Goin <[email protected]>

Co-authored-by: miloice <[email protected]>

Co-authored-by: Dash Desai <[email protected]> Co-authored-by: Aurick Qiao <[email protected]> Co-authored-by: Aurick Qiao <[email protected]> Co-authored-by: Aurick Qiao <[email protected]> Co-authored-by: Cody Yu <[email protected]>

This PR improves the FP8 performance of linear layers, which had been lacking before (#4118 (comment) and #4118 (comment)). We noticed that CUBLASLt can find a better algorithm if the first dimension of the matrix is greater than 16. So this PR enlarges matrices appropriately during quantization. This improves FP8 performance and removes the performance regression vs. FP16, in many cases exceeding FP16 performance. Here are benchmarks on llama3 70b (ITL numbers for 1000 input and 50 output tokens at fixed qps and at TP 4), all FP8 measurements are for dynamic quantization: qps = 1: 24 ms (FP8, this PR), 32 ms (FP8, previous main), 26 ms (FP16) qps = 2: 26 ms (FP8, this PR), 34ms (FP8, previous main), 28 ms (FP16) qps = 4: 33 ms (FP8, this PR), 44 ms (FP8, previous main), 36 ms (FP16) qps = 6: 46 ms (FP8, this PR), 56 ms (FP8, previous main), 54 ms (FP16) qps = 8: 85 ms (FP8, this PR), 85 ms (FP8, previous main), 138 ms (FP16)

[Core][Distributed] refactor pynccl to hold multiple communicators (#4591)

…on. (#4716)

Fix the docs: https://docs.vllm.ai/en/latest/models/performance.html Co-authored-by: sang <[email protected]>

…env (#4737) Storing exception frame is extremely prone to circular refernece because it contains the reference to objects. When tensorizer is not installed, it leaks llm instance because error frame has references to various modules which cause circular reference problem. I also found spec decoding has a circular reference issue, and I solved it using weakref.proxy.

Co-authored-by: Cade Daniel <[email protected]>

Pass the CUDA stream into the CUTLASS GEMMs, to avoid future issues with CUDA graphs

The 2nd PR for #4532. This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).

…4894)

…Config (#4991)

…e) (#4983)

…ot defined (#5009)

Signed-off-by: Muralidhar Andoorveedu <[email protected]>

Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]>

Co-authored-by: Elisei Smirnov <[email protected]>

Co-authored-by: Michael Goin <[email protected]>

Co-authored-by: Cody Yu <[email protected]>

Co-authored-by: Lei Wen <[email protected]>

…-Small model (#4799) Co-authored-by: beagleski <[email protected]> Co-authored-by: bapatra <[email protected]> Co-authored-by: Barun Patra <[email protected]> Co-authored-by: Michael Goin <[email protected]>

…5000)

Co-authored-by: rsnm2 <[email protected]> Co-authored-by: Robert Shaw <[email protected]>

Co-authored-by: Ruth Evans <[email protected]>

This PR adds Triton kernel configs for the MoE kernel for MI300X

Co-authored-by: Roger Wang <[email protected]>

Signed-off-by: pandyamarut <[email protected]>

youkaichao and others added 30 commits May 8, 2024 13:14

[CI/Test] fix swap test for multi gpu (#4689)

230c4b3

[Misc] Use vllm-flash-attn instead of flash-attn (#4686)

89579a2

[Dynamic Spec Decoding] Auto-disable by the running queue size (#4592)

f942efb

Co-authored-by: Cade Daniel <[email protected]>

[Speculative decoding] [Bugfix] Fix overallocation in ngram + spec lo…

8b9241b

…gprobs (#4672)

[Bugfix] Fine-tune gptq_marlin configs to be more similar to marlin (#…

e288df0

…4626)

[Frontend] add tok/s speed metric to llm class when using tqdm (#4400)

16bc0a0

Co-authored-by: Michael Goin <[email protected]>

[Frontend] Move async logic outside of constructor (#4674)

f12b20d

[Misc] Remove unnecessary ModelRunner imports (#4703)

190bc83

[Misc] Set block size at initialization & Fix test_model_runner (#4705)

0ee535b

[ROCm] Add support for Punica kernels on AMD GPUs (#3140)

ff5abcd

Co-authored-by: miloice <[email protected]>

[Bugfix] Fix CLI arguments in OpenAI server docs (#4709)

a3c1245

[Bugfix] Update grafana.json (#4711)

cea6443

[Bugfix] Add logs for all model dtype casting (#4717)

be0c518

[Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support (#4535)

c833101

[Core][Distributed] refactor pynccl (#4591)

208b71b

[Core][Distributed] refactor pynccl to hold multiple communicators (#4591)

[Misc] Keep only one implementation of the create_dummy_prompt functi…

e965d46

…on. (#4716)

chunked-prefill-doc-syntax (#4603)

51d4094

Fix the docs: https://docs.vllm.ai/en/latest/models/performance.html Co-authored-by: sang <[email protected]>

[Core]fix type annotation for swap_blocks (#4726)

64b77df

[Misc] Apply a couple g++ cleanups (#4719)

dac6a3f

[Bugfix] Fix CLI arguments in OpenAI server docs (#4729)

706588a

[Speculative decoding] CUDA graph support (#4295)

2e7796f

Co-authored-by: Cade Daniel <[email protected]>

[CI] Nits for bad initialization of SeqGroup in testing (#4748)

fcc2994

[Core][Test] fix function name typo in custom allreduce (#4750)

4e12131

[Model][Misc] Add e5-mistral-7b-instruct and Embedding API (#3734)

e254497

[Model] Add support for IBM Granite Code models (#4636)

6eaccb7

[CI/Build] Tweak Marlin Nondeterminism Issues (#4713)

a709e87

[CORE] Improvement in ranks code (#4718)

a7be4d0

rkooo567 and others added 29 commits May 22, 2024 09:02

[misc] remove comments that were supposed to be removed (#4977)

c74c913

[Kernel] Fixup for CUTLASS kernels in CUDA graphs (#4954)

8674f98

Pass the CUDA stream into the CUTLASS GEMMs, to avoid future issues with CUDA graphs

[Misc] Load FP8 kv-cache scaling factors from checkpoints (#4893)

a3a73ab

The 2nd PR for #4532. This PR supports loading FP8 kv-cache scaling factors from a FP8 checkpoint (with .kv_scale parameter).

[Model] LoRA gptbigcode implementation (#3949)

97b0300

[Core] Eliminate parallel worker per-step task scheduling overhead (#…

eb6d3c2

…4894)

[Minor] Fix small typo in llama.py: QKVParallelLinear -> Quantization…

a36de68

…Config (#4991)

[Misc] Take user preference in attention selector (#4960)

ee3eea0

Marlin 24 prefill performance improvement (about 25% better on averag…

6066253

…e) (#4983)

[Bugfix] Update Dockerfile.cpu to fix NameError: name 'vllm_ops' is n…

2ba80be

…ot defined (#5009)

[Core][1/N] Support send/recv in PyNCCL Groups (#4988)

5eda2ea

Signed-off-by: Muralidhar Andoorveedu <[email protected]>

[Kernel] Initial Activation Quantization Support (#4525)

a124232

Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]>

[Core]: Option To Use Prompt Token Ids Inside Logits Processor (#4985)

e3470f8

Co-authored-by: Elisei Smirnov <[email protected]>

[Doc] add ccache guide in doc (#5012)

6a50f4c

Co-authored-by: Michael Goin <[email protected]>

[Bugfix] Fix Mistral v0.3 Weight Loading (#5005)

9197709

Co-authored-by: Cody Yu <[email protected]>

[Core][Bugfix]: fix prefix caching for blockv2 (#4764)

e64fde4

Co-authored-by: Lei Wen <[email protected]>

[Kernel][Backend][Model] Blocksparse flash attention kernel and Phi-3…

8e192ff

…-Small model (#4799) Co-authored-by: beagleski <[email protected]> Co-authored-by: bapatra <[email protected]> Co-authored-by: Barun Patra <[email protected]> Co-authored-by: Michael Goin <[email protected]>

[Misc] add logging level env var (#5045)

325c119

[Dynamic Spec Decoding] Minor fix for disabling speculative decoding (#…

d5a1697

…5000)

[Misc] Make Serving Benchmark More User-friendly (#5044)

f17a1a8

[Bugfix / Core] Prefix Caching Guards (merged with main) (#4846)

1102bef

Co-authored-by: rsnm2 <[email protected]> Co-authored-by: Robert Shaw <[email protected]>

[Core] Allow AQLM on Pascal (#5058)

fbdb7b3

[Model] Add support for falcon-11B (#5069)

890aa93

[Core] Sliding window for block manager v2 (#4545)

d4f3985

Co-authored-by: Ruth Evans <[email protected]>

[BugFix] Fix Embedding Models with TP>1 (#5075)

9ba4155

[Kernel][ROCm][AMD] Add fused_moe Triton configs for MI300X (#4951)

dd8de11

This PR adds Triton kernel configs for the MoE kernel for MI300X

[Docs] Add Dropbox as sponsors (#5089)

290f4ad

[Core] Consolidate prompt arguments to LLM engines (#4328)

5ae5ed1

Co-authored-by: Roger Wang <[email protected]>

[Bugfix] Remove the last EOS token unless explicitly specified (#5077)

dfba529

[Misc] add gpu_memory_utilization arg (#5079)

616e600

Signed-off-by: pandyamarut <[email protected]>

sroy745 merged commit 5650b95 into sroy745:main May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pull from head #1

Pull from head #1

sroy745 commented May 29, 2024

Pull from head #1

Pull from head #1

Conversation

sroy745 commented May 29, 2024

PR Title and Classification

Code Quality

Notes for Large Changes

What to Expect for the Reviews

Thank You