Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upstream merge 24 12 09 #314

Merged
merged 81 commits into from
Dec 9, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
81 commits
Select commit Hold shift + click to select a range
b45f0d7
[Misc][LoRA] Move the implementation of lora bias to punica.py (#10829)
jeejeelee Dec 2, 2024
519cc6c
[Misc][XPU] Avoid torch compile for XPU platform (#10747)
yma11 Dec 2, 2024
9b14d97
Fix openvino on GPU (#10793)
janimo Dec 2, 2024
4c05edb
[Model] Add TP and BNB quantization support to LlavaMultiModalProject…
Isotr0py Dec 2, 2024
4433195
[Bugfix] Prevent benchmark_throughput.py from using duplicated random…
mgoin Dec 3, 2024
d746268
[Model] support bitsandbytes quantization with minicpm model (#10842)
zixuanzhang226 Dec 3, 2024
a4cf256
[Bugfix] Fix QKVParallelLinearWithShardedLora bias bug (#10844)
jeejeelee Dec 3, 2024
21fe7b4
[core][distributed] add pynccl broadcast (#10843)
youkaichao Dec 3, 2024
dc5ce86
[torch.compile] remove compilation_context and simplify code (#10838)
youkaichao Dec 3, 2024
ef51831
[Doc] Add github links for source code references (#10672)
russellb Dec 3, 2024
3257d44
[Misc] Remove deprecated names (#10817)
DarkLight1337 Dec 3, 2024
9323a31
[Core][Performance] Add XGrammar support for guided decoding and set …
aarnphm Dec 3, 2024
f6084f6
[Speculative Decoding] Move indices to device before filtering output…
zhengy001 Dec 3, 2024
3bc94ca
[V1] VLM - Run the mm_mapper preprocessor in the frontend process (#1…
alexm-neuralmagic Dec 3, 2024
2f2cdc7
[MISC][XPU] quick fix for XPU CI (#10859)
yma11 Dec 3, 2024
7090c27
[Bugfix] Only require XGrammar on x86 (#10865)
mgoin Dec 3, 2024
7c32b68
[Frontend] correctly record prefill and decode time metrics (#10853)
tomeras91 Dec 3, 2024
a061fe6
[Build][Bugfix] Using the correct type hint (#10866)
gshtras Dec 3, 2024
381ac93
[Benchmark] Benchmark structured output with datasets (#10557)
xuechendi Dec 4, 2024
d2bd88b
[CI/Build] Replace mean with torch.all in test_pynccl.py (#10876)
tlrmchlsmth Dec 4, 2024
b5b647b
Drop ROCm load format check (#10767)
wangxiyuan Dec 4, 2024
fa2dea6
[ci/build] Change queue name for Release jobs (#10875)
khluu Dec 4, 2024
c9ca4fc
[ci/build] Job to build and push release image (#10877)
khluu Dec 4, 2024
8db957e
[bugfix] fixed parameter “n” when set parameter “bestof” > 1 (#10854)
o2363286 Dec 4, 2024
c92acb9
[ci/build] Update vLLM postmerge ECR repo (#10887)
khluu Dec 4, 2024
01d079f
[LoRA] Change lora_tokenizers capacity (#10796)
xyang16 Dec 4, 2024
10398b4
[Model] Consolidate ViTs attention implementation without mask (#10893)
Isotr0py Dec 4, 2024
82eb5ea
Benchmark serving structured output (#10880)
xuechendi Dec 4, 2024
e4c34c2
[CI/Build] improve python-only dev setup (#9621)
dtrifiro Dec 4, 2024
2a56e12
[V1] Fix when max_model_len is not divisible by block_size (#10903)
WoosukKwon Dec 5, 2024
7883c2b
[benchmark] Make H100 benchmark optional (#10908)
khluu Dec 5, 2024
8d370e9
[Bugfix] Fallback to outlines for complex json schemas (#10899)
mgoin Dec 5, 2024
aa39a8e
[Doc] Create a new "Usage" section (#10827)
DarkLight1337 Dec 5, 2024
1f958a7
[Bugfix] Fix BNB loader target_modules (#10720)
jeejeelee Dec 5, 2024
39c89e7
[Misc] Update llama 3.2 template to support system prompt with images…
tjohnson31415 Dec 5, 2024
571da8f
[Misc][LoRA] Clean up the function interface of Punica (#10917)
jeejeelee Dec 5, 2024
998eeaf
[CI/Build] Bump test transformers version (#10106)
Isotr0py Dec 5, 2024
a430652
[Misc][Gaudi] Avoid torch.compile and enable lazy collectives (#10897)
kzawora-intel Dec 5, 2024
9743d64
[ci][build] add tests for python only compilation (#10915)
youkaichao Dec 5, 2024
db87eb6
[torch.compile] use size tuning for specific sizes (#10933)
youkaichao Dec 6, 2024
b031a45
[torch.compile] add logging for compilation time (#10941)
youkaichao Dec 6, 2024
222f5b0
[CI/Build] Fix broken multimodal test (#10950)
DarkLight1337 Dec 6, 2024
a1887f2
[torch.compile] fix deprecated code (#10948)
youkaichao Dec 6, 2024
8b59631
[Core] Support Lark grammars for XGrammar (#10870)
mgoin Dec 6, 2024
7406274
[Doc] add KubeAI to serving integrations (#10837)
samos123 Dec 6, 2024
c05cfb6
[misc] fix typo (#10960)
youkaichao Dec 6, 2024
dcdc3fa
[ci] fix broken tests (#10956)
youkaichao Dec 6, 2024
69d357b
[Core] Cleanup startup logging a bit (#10961)
russellb Dec 7, 2024
acf092d
[Bugfix] Fix test-pipeline.yaml (#10973)
jeejeelee Dec 7, 2024
955fa95
[3/N] Support and implement merged input processor for LLaVA model (#…
DarkLight1337 Dec 7, 2024
f13cf9a
[Build] Fix for the Wswitch-bool clang warning (#10060)
gshtras Dec 7, 2024
b26b4cd
[Misc][LoRA] Refactor and clean MergedQKVParallelLinearWithLora imple…
Isotr0py Dec 7, 2024
bf0e382
[Model] Composite weight loading for multimodal Qwen2 (#10944)
DarkLight1337 Dec 7, 2024
1c768fe
[Doc] Explicitly state that InternVL 2.5 is supported (#10978)
DarkLight1337 Dec 7, 2024
39e227c
[Model] Update multi-modal processor to support Mantis(LLaVA) model (…
DarkLight1337 Dec 7, 2024
c889d58
[Doc] Explicitly state that PP isn't compatible with speculative deco…
DarkLight1337 Dec 7, 2024
78029b3
[BugFix][Kernel]: fix illegal memory access in causal_conv1d when con…
xffxff Dec 7, 2024
1b62745
[core][executor] simplify instance id (#10976)
youkaichao Dec 7, 2024
7be15d9
[core][misc] remove use_dummy driver for _run_workers (#10920)
youkaichao Dec 7, 2024
fd57d2b
[torch.compile] allow candidate compile sizes (#10984)
youkaichao Dec 8, 2024
a11f326
[V1] Initial support of multimodal models for V1 re-arch (#10699)
ywang96 Dec 8, 2024
43b05fa
[torch.compile][misc] fix comments (#10993)
youkaichao Dec 8, 2024
46004e8
[misc] clean up and unify logging (#10999)
youkaichao Dec 9, 2024
af7c4a9
[Doc][V1] Add V1 support column for multimodal models (#10998)
ywang96 Dec 9, 2024
d1c2e15
[torch.compile] add dynamo time tracking (#11005)
youkaichao Dec 9, 2024
c690357
[V1] Fix Detokenizer loading in `AsyncLLM` (#10997)
ywang96 Dec 9, 2024
e691b26
[Core] Require xgrammar >= 0.1.6 (#11021)
russellb Dec 9, 2024
aea2fc3
[Platform] Move `async output` check to platform (#10768)
wangxiyuan Dec 9, 2024
25b79d9
[V1] Input Batch Relocation (#10962)
varun-sundar-rabindranath Dec 9, 2024
edc4fa3
[ci/build] Recompile CI dependencies list with Python 3.12 (#11013)
khluu Dec 9, 2024
3b61cb4
[V1] Further reduce CPU overheads in flash-attn (#10989)
WoosukKwon Dec 9, 2024
ca87149
[Misc][LoRA] Abstract PunicaWrapper (#10955)
jeejeelee Dec 9, 2024
a811dd6
[Model] merged input processor for Phi-3-Vision models (#10977)
Isotr0py Dec 9, 2024
cbcbdb1
[Bugfix][Hardware][Gaudi] Bump vllm_hpu_extension version (#11028)
kzawora-intel Dec 9, 2024
1a2f8fb
[v1] fix use compile sizes (#11000)
youkaichao Dec 9, 2024
9c6459e
[Neuron] Upgrade neuron to 2.20.2 (#11016)
xendo Dec 9, 2024
b63ba84
[ROCm][bugfix] scpecilative decoding worker class (#11035)
gshtras Dec 9, 2024
7c61516
Merge remote-tracking branch 'upstream/main' into develop
gshtras Dec 9, 2024
401a541
format
gshtras Dec 9, 2024
c9f5c24
Merge remote-tracking branch 'origin/main' into upstream_merge_24_12_09
gshtras Dec 9, 2024
c324ea8
Merge remote-tracking branch 'upstream/main' into upstream_merge_24_1…
gshtras Dec 9, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 8 additions & 3 deletions .buildkite/nightly-benchmarks/benchmark-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ steps:
podSpec:
priorityClassName: perf-benchmark
containers:
- image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
- image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
command:
- bash .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
resources:
Expand Down Expand Up @@ -51,7 +51,7 @@ steps:
queue: H200
plugins:
- docker#v5.12.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
command:
- bash
- .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
Expand All @@ -65,13 +65,18 @@ steps:
- VLLM_USAGE_SOURCE
- HF_TOKEN

- block: "Run H100 Benchmark"
key: block-h100
depends_on: ~

- label: "H100"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: H100
depends_on: block-h100
plugins:
- docker#v5.12.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
command:
- bash
- .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
Expand Down
4 changes: 2 additions & 2 deletions .buildkite/nightly-benchmarks/scripts/wait-for-image.sh
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#!/bin/sh
TOKEN=$(curl -s -L "https://public.ecr.aws/token?service=public.ecr.aws&scope=repository:q9t5s3a7/vllm-ci-test-repo:pull" | jq -r .token)
URL="https://public.ecr.aws/v2/q9t5s3a7/vllm-ci-test-repo/manifests/$BUILDKITE_COMMIT"
TOKEN=$(curl -s -L "https://public.ecr.aws/token?service=public.ecr.aws&scope=repository:q9t5s3a7/vllm-ci-postmerge-repo:pull" | jq -r .token)
URL="https://public.ecr.aws/v2/q9t5s3a7/vllm-ci-postmerge-repo/manifests/$BUILDKITE_COMMIT"

TIMEOUT_SECONDS=10

Expand Down
17 changes: 15 additions & 2 deletions .buildkite/release-pipeline.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
steps:
- label: "Build wheel - CUDA 12.1"
agents:
queue: cpu_queue
queue: cpu_queue_postmerge
commands:
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.1.0 --tag vllm-ci:build-image --target build --progress plain ."
- "mkdir artifacts"
Expand All @@ -18,11 +18,24 @@ steps:
- label: "Build wheel - CUDA 11.8"
# depends_on: block-build-cu118-wheel
agents:
queue: cpu_queue
queue: cpu_queue_postmerge
commands:
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=11.8.0 --tag vllm-ci:build-image --target build --progress plain ."
- "mkdir artifacts"
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
- "bash .buildkite/upload-wheels.sh"
env:
DOCKER_BUILDKIT: "1"

- block: "Build release image"
depends_on: ~
key: block-release-image-build

- label: "Build release image"
depends_on: block-release-image-build
agents:
queue: cpu_queue_postmerge
commands:
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.1.0 --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT --target vllm-openai --progress plain ."
- "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT"
7 changes: 5 additions & 2 deletions .buildkite/run-xpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,5 +12,8 @@ remove_docker_container() { docker rm -f xpu-test || true; }
trap remove_docker_container EXIT
remove_docker_container

# Run the image and launch offline inference
docker run --network host --name xpu-test --device /dev/dri -v /dev/dri/by-path:/dev/dri/by-path --entrypoint="" xpu-test python3 examples/offline_inference.py
# Run the image and test offline inference/tensor parallel
docker run --name xpu-test --device /dev/dri -v /dev/dri/by-path:/dev/dri/by-path --entrypoint="" xpu-test sh -c '
python3 examples/offline_inference.py
python3 examples/offline_inference_cli.py -tp 2
'
16 changes: 12 additions & 4 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -50,9 +50,9 @@ steps:
- tests/multimodal
- tests/test_utils
- tests/worker
- tests/test_lazy_torch_compile.py
- tests/standalone_tests/lazy_torch_compile.py
commands:
- python3 test_lazy_torch_compile.py
- python3 standalone_tests/lazy_torch_compile.py
- pytest -v -s mq_llm_engine # MQLLMEngine
- pytest -v -s async_engine # AsyncLLMEngine
- NUM_SCHEDULER_STEPS=4 pytest -v -s async_engine/test_async_llm_engine.py
Expand All @@ -61,6 +61,13 @@ steps:
- pytest -v -s test_utils.py # Utils
- pytest -v -s worker # Worker

- label: Python-only Installation Test
source_file_dependencies:
- tests/standalone_tests/python_only_compile.sh
- setup.py
commands:
- bash standalone_tests/python_only_compile.sh

- label: Basic Correctness Test # 30min
#mirror_hardwares: [amd]
fast_check: true
Expand Down Expand Up @@ -230,7 +237,7 @@ steps:
source_file_dependencies:
- vllm/lora
- tests/lora
command: pytest -v -s lora --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT --ignore lora/test_long_context.py lora/test_chatglm3_tp.py lora/test_llama_tp.py
command: pytest -v -s lora --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT --ignore=lora/test_long_context.py --ignore=lora/test_chatglm3_tp.py --ignore=lora/test_llama_tp.py
parallelism: 4

- label: "PyTorch Fullgraph Smoke Test" # 9min
Expand Down Expand Up @@ -355,6 +362,7 @@ steps:
- tests/models/embedding/vision_language
- tests/models/encoder_decoder/vision_language
commands:
- pip install git+https://github.com/TIGER-AI-Lab/Mantis.git
- pytest -v -s models/decoder_only/audio_language -m 'core_model or quant_model'
- pytest -v -s --ignore models/decoder_only/vision_language/test_phi3v.py models/decoder_only/vision_language -m 'core_model or quant_model'
- pytest -v -s models/embedding/vision_language -m core_model
Expand All @@ -370,6 +378,7 @@ steps:
- tests/models/embedding/vision_language
- tests/models/encoder_decoder/vision_language
commands:
- pip install git+https://github.com/TIGER-AI-Lab/Mantis.git
- pytest -v -s models/decoder_only/audio_language -m 'not core_model and not quant_model'
# HACK - run phi3v tests separately to sidestep this transformers bug
# https://github.com/huggingface/transformers/issues/34307
Expand Down Expand Up @@ -481,7 +490,6 @@ steps:

- label: LoRA TP Test (Distributed)
num_gpus: 4
soft_fail: true
source_file_dependencies:
- vllm/lora
- tests/lora
Expand Down
3 changes: 2 additions & 1 deletion Dockerfile.neuron
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# default base image
ARG BASE_IMAGE="public.ecr.aws/neuron/pytorch-inference-neuronx:2.1.2-neuronx-py310-sdk2.20.0-ubuntu20.04"
# https://gallery.ecr.aws/neuron/pytorch-inference-neuronx
ARG BASE_IMAGE="public.ecr.aws/neuron/pytorch-inference-neuronx:2.1.2-neuronx-py310-sdk2.20.2-ubuntu20.04"

FROM $BASE_IMAGE

Expand Down
6 changes: 6 additions & 0 deletions benchmarks/backend_request_func.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ class RequestFuncInput:
model: str
best_of: int = 1
logprobs: Optional[int] = None
extra_body: Optional[dict] = None
multi_modal_content: Optional[dict] = None
ignore_eos: bool = False

Expand All @@ -36,6 +37,7 @@ class RequestFuncOutput:
ttft: float = 0.0 # Time to first token
itl: List[float] = field(
default_factory=list) # List of inter-token latencies
tpot: float = 0.0 # avg next-token latencies
prompt_len: int = 0
error: str = ""

Expand Down Expand Up @@ -242,6 +244,8 @@ async def async_request_openai_completions(
"stream": True,
"ignore_eos": request_func_input.ignore_eos,
}
if request_func_input.extra_body:
payload.update(request_func_input.extra_body)
headers = {
"Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY')}"
}
Expand Down Expand Up @@ -336,6 +340,8 @@ async def async_request_openai_chat_completions(
"stream": True,
"ignore_eos": request_func_input.ignore_eos,
}
if request_func_input.extra_body:
payload.update(request_func_input.extra_body)
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY')}",
Expand Down
Loading
Loading