[V1] PR 1/2 for v1 sample and prompt logprobs support #9880

afeldman-nm · 2024-10-31T13:34:52Z

This PR is adding support for sample logprobs & prompt logprobs to vLLM v1.

New behavior:

During model execution, model runner computes sample logprobs tensor (if user requests >0 logprobs) and prompt logprobs tensor (if user requests >0 prompt_logprobs) & transfers them to CPU.
Scheduler.update_from_output() pythonizes the sample logprobs & prompt logprobs tensors into lists-of-dicts. This method ensures that each sample logprob dict has the appropriate number of keys based on how many sample logprobs the user requested (same goes for prompt logprobs)
Each "logprob" (whether sample- or prompt-) consists of a token's log-probability, rank, and optional detokenized string representation. Prior to this PR, the detokenizer only operated on the generated tokens; however, with this PR, the detokenizer is also responsible for detokenizing the string representations of the tokens associated with the sample logprobs. (To be consistent with the behavior of v0, prompt logprobs are not detokenized.)

PR no. 1 (this PR) adds the infrastructure for logprobs support with limit unit tests

PR no. 2 (next PR) ports test_completion.py logprobs tests to v1 and (if necessary) tweaks the infra to make them pass

github-actions · 2024-10-31T13:35:06Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

mergify · 2024-11-06T06:17:34Z

This pull request has merge conflicts that must be resolved before it can be
merged. @afeldman-nm please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2024-11-13T04:54:23Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @afeldman-nm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

alexm-neuralmagic · 2024-11-18T16:05:46Z

Thanks @afeldman-nm! Did you had a chance to see the performance penalty of enabling logprobs?

vllm/v1/engine/llm_engine.py

vllm/v1/engine/detokenizer.py

robertgshaw2-neuralmagic · 2024-11-19T15:52:55Z

vllm/v1/engine/detokenizer.py

@@ -91,25 +102,34 @@ def from_new_request(
            prompt_token_ids=request.prompt_token_ids,
            tokenizer=tokenizer,
            stop_buffer_length=stop_buffer_length,
-        )
+            logprobs=[] if do_logprobs else None,


Instead of making logprobs: Optional[List[SampleLogprobs]] and using None to handle the case where there are no logprobs, you could instead make logprobs: List[SampleLogprobs] and treat an empty list as no logprobs

This will simplify the code in the detokenizer since you can then avoid the None checking

In principle I see the value of this. However, I attempted to emulate the behavior expected by the v0 engine unit tests (since there is no reason for the interface spec to change between v0 and v1.)

At the link below, I highlighted two lines from a v0 logprobs unit test. The test implicitly configures the number of logprobs as None (by leaving logprobs unspecified in the SamplingParams) and the test expects that the request output will have results_logprobs_none[i].outputs[0].logprobs be None:

vllm/tests/samplers/test_logprobs.py

Lines 170 to 172 in 9a99273

for i in range(len(results_logprobs_none)):

assert results_logprobs_none[i].outputs[0].logprobs is None

assert results_logprobs_none[i].outputs[0].cumulative_logprob is None

This is why I used slightly more complex logic to make the request output logprobs be None when the user does not request logprobs.

vllm/v1/request.py

vllm/v1/worker/gpu_model_runner.py

comaniac · 2024-11-19T17:09:53Z

Just to confirm: is this PR ready to review? If so please remove "WIP" from the title to be less confusing.

afeldman-nm · 2024-11-21T14:08:20Z

Initial benchmark results

The vLLM server with V1 engine is launched with the following command:

VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.2-3B --trust-remote-code --max-model-len 4096

The vLLM server with V0 engine is launched with the following command:

VLLM_USE_V1=0 vllm serve meta-llama/Llama-3.2-3B --trust-remote-code --max-model-len 4096

Comparing V1 engine to V0 engine with `logprobs=None` on `main` branch

The serving benchmark client was launched with the following command:

python benchmarks/benchmark_serving.py --model meta-llama/Llama-3.2-3B --dataset-path ../sharegpt.json

The main branch was utilized. The V1 or V0 engine was selected using VLLM_USE_V1=1 and VLLM_USE_V1=0 respectively.

Metric	Main Branch V0 (logprobs=None)	Main Branch V1 (logprobs=None)
Successful requests	659	660
Benchmark duration (s)	22.24	23.71
Total input tokens	85032	84700
Total generated tokens	131399	131550
Request throughput (req/s)	29.63	27.84
Output token throughput (tok/s)	5908.23	5548.69
Total Token throughput (tok/s)	9731.61	9121.28
Mean TTFT (ms)	8791.65	7430.40
Median TTFT (ms)	6851.74	7223.75
P99 TTFT (ms)	16865.68	11917.53
Mean TPOT (ms)	49.30	60.85
Median TPOT (ms)	41.68	56.23
P99 TPOT (ms)	211.19	141.15
Mean ITL (ms)	32.44	43.55
Median ITL (ms)	20.71	21.85
P99 ITL (ms)	344.67	430.56

Observations

6% lower total token throughput
30% lower P99 TTFT
33% lower P99 TPOT
25% higher P99 ITL

Comparing `v1_logprobs` branch to `main` with `logprobs=None` (V1 engine)

The serving benchmark client was launched with the following command:

python benchmarks/benchmark_serving.py --model meta-llama/Llama-3.2-3B --dataset-path ../sharegpt.json

And the benchmark results were compared between the v1_logprobs and main branches. VLLM_USE_V1=1 was used to enable the vLLM v1 engine in both cases.

Metric	main	v1_logprobs
Successful requests	660	660
Benchmark duration (s)	23.71	22.30
Total input tokens	84700	84700
Total generated tokens	131550	131550
Request throughput (req/s)	27.84	29.60
Output token throughput (tok/s)	5548.69	5900.08
Total Token throughput (tok/s)	9121.28	9698.92
Mean TTFT (ms)	7430.40	6026.84
Median TTFT (ms)	7223.75	5885.70
P99 TTFT (ms)	11917.53	10310.81
Mean TPOT (ms)	60.85	58.68
Median TPOT (ms)	56.23	55.20
P99 TPOT (ms)	141.15	187.53
Mean ITL (ms)	43.55	41.98
Median ITL (ms)	21.85	22.46
P99 ITL (ms)	430.56	392.53

Observations

It appears that when logprobs=None, performance by most metrics is similar or better with the v1_logprobs branch than the main branch. The exception is P99 time-per-output-token which is about 30%-40% worse however it is possible that this is run-to-run variation.;

Comparing `logprobs=5` to `logprobs=None` with the `v1_logprobs` branch

Utilizing the v1_logprobs branch, two different scenarios were benchmarked:

logprobs=None

python benchmarks/benchmark_serving.py --model meta-llama/Llama-3.2-3B --dataset-path ../sharegpt.json

logprobs=5

python benchmarks/benchmark_serving.py --model meta-llama/Llama-3.2-3B --dataset-path ../sharegpt.json  --logprobs 5

Metric	logprobs=None	logprobs=5
Successful requests	660	662
Benchmark duration (s)	22.30	26.94
Total input tokens	84700	84124
Total generated tokens	131550	132169
Request throughput (req/s)	29.60	24.57
Output token throughput (tok/s)	5900.08	4905.59
Total Token throughput (tok/s)	9698.92	8027.94
Mean TTFT (ms)	6026.84	6687.43
Median TTFT (ms)	5885.70	5451.56
P99 TTFT (ms)	10310.81	17033.30
Mean TPOT (ms)	58.68	230.62
Median TPOT (ms)	55.20	92.15
P99 TPOT (ms)	187.53	1325.77
Mean ITL (ms)	41.98	70.19
Median ITL (ms)	22.46	23.57
P99 ITL (ms)	392.53	1159.93

Observations

With logprobs=5 and using the v1_logprobs branch,

~17% decrease in all throughput metrics
65% higher P99 TTFT
7x higher P99 TPOT
3x higher P99 ITL

compared to logprobs=None and the v1_logprobs branch

Comparing the `v1_logprobs` branch with V1 engine to the `main` branch with V0 engine, `logprobs=5`

Both scenarios used the same benchmark launch command:

python benchmarks/benchmark_serving.py --model meta-llama/Llama-3.2-3B --dataset-path ../sharegpt.json --logprobs 5

Since the V1 engine on the main branch does not support logprobs, the V0 engine's logprobs support was used as a baseline. V1 engine vs V0 engine was configured using VLLM_USE_V1=1 and VLLM_USE_V1=0 respectively.

Metric	Main Branch V0	v1_logprobs Branch V1
Successful requests	659	662
Benchmark duration (s)	36.93	26.94
Total input tokens	85032	84124
Total generated tokens	131412	132169
Request throughput (req/s)	17.85	24.57
Output token throughput (tok/s)	3558.73	4905.59
Total Token throughput (tok/s)	5861.46	8027.94
Mean TTFT (ms)	9922.62	6687.43
Median TTFT (ms)	6618.40	5451.56
P99 TTFT (ms)	23072.55	17033.30
Mean TPOT (ms)	103.29	230.62
Median TPOT (ms)	105.09	92.15
P99 TPOT (ms)	267.94	1325.77
Mean ITL (ms)	75.31	70.19
Median ITL (ms)	42.60	23.57
P99 ITL (ms)	644.76	1159.93

Observations

37% higher throughput
26% lower P99 TTFT
5x higher P99 TPOT
80% higher P99 ITL

Overall analysis of benchmark results

It appears that the addition of logprobs support in the v1 engine has not significantly degraded performance in the logprobs=None scenario
Enabling logprobs=5 in the v1 engine degrades throughput by about 17% and increases TTFT/TPOT/ITL significantly
Across most metrics, the V0 and V1 engines with logprobs=None seem to be within 30% of each other. However, with logprobs=5, V1 ITL is 80% higher than V0 and V1 TPOT is 5x higher than V0

Next steps

The default behavior of the V1 engine is that if logprobs is enabled for any requests in the batch, then logprobs are computed for all requests. If the max number of logprobs requested in the batch is 5, then 5 logprobs are computed across all requests in the batch. I hypothesize that computing logprobs only for requests which require logprobs & only computing the required number could reduce the performance difference between V1 and V0 when logprobs > 0

robertgshaw2-neuralmagic · 2024-11-21T14:12:37Z

Initial benchmark results

The vLLM server with V1 engine is launched with the following command:
VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.2-3B --trust-remote-code --max-model-len 4096
The vLLM server with V0 engine is launched with the following command:
VLLM_USE_V1=0 vllm serve meta-llama/Llama-3.2-3B --trust-remote-code --max-model-len 4096
Comparing V1 engine to V0 engine with logprobs=None on main branch

The serving benchmark client was launched with the following command:
python benchmarks/benchmark_serving.py --model meta-llama/Llama-3.2-3B --dataset-path ../sharegpt.json
The main branch was utilized. The V1 or V0 engine was selected using VLLM_USE_V1=1 and VLLM_USE_V1=0 respectively.

Metric Main Branch V0 (logprobs=None) Main Branch V1 (logprobs=None)
Successful requests 659 660
Benchmark duration (s) 22.24 23.71
Total input tokens 85032 84700
Total generated tokens 131399 131550
Request throughput (req/s) 29.63 27.84
Output token throughput (tok/s) 5908.23 5548.69
Total Token throughput (tok/s) 9731.61 9121.28
Mean TTFT (ms) 8791.65 7430.40
Median TTFT (ms) 6851.74 7223.75
P99 TTFT (ms) 16865.68 11917.53
Mean TPOT (ms) 49.30 60.85
Median TPOT (ms) 41.68 56.23
P99 TPOT (ms) 211.19 141.15
Mean ITL (ms) 32.44 43.55
Median ITL (ms) 20.71 21.85
P99 ITL (ms) 344.67 430.56

Observations

6% lower total token throughput

30% lower P99 TTFT

33% lower P99 TPOT

25% higher P99 ITL

Comparing v1_logprobs branch to main with logprobs=None (V1 engine)

The serving benchmark client was launched with the following command:
python benchmarks/benchmark_serving.py --model meta-llama/Llama-3.2-3B --dataset-path ../sharegpt.json
And the benchmark results were compared between the v1_logprobs and main branches. VLLM_USE_V1=1 was used to enable the vLLM v1 engine in both cases.

Metric main v1_logprobs
Successful requests 660 660
Benchmark duration (s) 23.71 22.30
Total input tokens 84700 84700
Total generated tokens 131550 131550
Request throughput (req/s) 27.84 29.60
Output token throughput (tok/s) 5548.69 5900.08
Total Token throughput (tok/s) 9121.28 9698.92
Mean TTFT (ms) 7430.40 6026.84
Median TTFT (ms) 7223.75 5885.70
P99 TTFT (ms) 11917.53 10310.81
Mean TPOT (ms) 60.85 58.68
Median TPOT (ms) 56.23 55.20
P99 TPOT (ms) 141.15 187.53
Mean ITL (ms) 43.55 41.98
Median ITL (ms) 21.85 22.46
P99 ITL (ms) 430.56 392.53

Observations

It appears that when logprobs=None, performance by most metrics is similar or better with the v1_logprobs branch than the main branch. The exception is P99 time-per-output-token which is about 30%-40% worse however it is possible that this is run-to-run variation.;

Comparing logprobs=5 to logprobs=None with the v1_logprobs branch

Utilizing the v1_logprobs branch, two different scenarios were benchmarked:

logprobs=None
python benchmarks/benchmark_serving.py --model meta-llama/Llama-3.2-3B --dataset-path ../sharegpt.json
logprobs=5
python benchmarks/benchmark_serving.py --model meta-llama/Llama-3.2-3B --dataset-path ../sharegpt.json  --logprobs 5
Metric logprobs=None logprobs=5
Successful requests 660 662
Benchmark duration (s) 22.30 26.94
Total input tokens 84700 84124
Total generated tokens 131550 132169
Request throughput (req/s) 29.60 24.57
Output token throughput (tok/s) 5900.08 4905.59
Total Token throughput (tok/s) 9698.92 8027.94
Mean TTFT (ms) 6026.84 6687.43
Median TTFT (ms) 5885.70 5451.56
P99 TTFT (ms) 10310.81 17033.30
Mean TPOT (ms) 58.68 230.62
Median TPOT (ms) 55.20 92.15
P99 TPOT (ms) 187.53 1325.77
Mean ITL (ms) 41.98 70.19
Median ITL (ms) 22.46 23.57
P99 ITL (ms) 392.53 1159.93

Observations

With logprobs=5 and using the v1_logprobs branch,

~17% decrease in all throughput metrics

65% higher P99 TTFT

7x higher P99 TPOT

3x higher P99 ITL

compared to logprobs=None and the v1_logprobs branch

Comparing the v1_logprobs branch with V1 engine to the main branch with V0 engine, logprobs=5

Both scenarios used the same benchmark launch command:
python benchmarks/benchmark_serving.py --model meta-llama/Llama-3.2-3B --dataset-path ../sharegpt.json --logprobs 5
Since the V1 engine on the main branch does not support logprobs, the V0 engine's logprobs support was used as a baseline. V1 engine vs V0 engine was configured using VLLM_USE_V1=1 and VLLM_USE_V1=0 respectively.

Metric Main Branch V0 v1_logprobs Branch V1
Successful requests 659 662
Benchmark duration (s) 36.93 26.94
Total input tokens 85032 84124
Total generated tokens 131412 132169
Request throughput (req/s) 17.85 24.57
Output token throughput (tok/s) 3558.73 4905.59
Total Token throughput (tok/s) 5861.46 8027.94
Mean TTFT (ms) 9922.62 6687.43
Median TTFT (ms) 6618.40 5451.56
P99 TTFT (ms) 23072.55 17033.30
Mean TPOT (ms) 103.29 230.62
Median TPOT (ms) 105.09 92.15
P99 TPOT (ms) 267.94 1325.77
Mean ITL (ms) 75.31 70.19
Median ITL (ms) 42.60 23.57
P99 ITL (ms) 644.76 1159.93

Observations

37% higher throughput

26% lower P99 TTFT

5x higher P99 TPOT

80% higher P99 ITL

Overall analysis of benchmark results

It appears that the addition of logprobs support in the v1 engine has not significantly degraded performance in the logprobs=None scenario

Enabling logprobs=5 in the v1 engine degrades throughput by about 17% and increases TTFT/TPOT/ITL significantly

Across most metrics, the V0 and V1 engines with logprobs=None seem to be within 30% of each other. However, with logprobs=5, V1 ITL is 80% higher than V0 and V1 TPOT is 5x higher than V0

Next steps

The default behavior of the V1 engine is that if logprobs is enabled for any requests in the batch, then logprobs are computed for all requests. If the max number of logprobs requested in the batch is 5, then 5 logprobs are computed across all requests in the batch. I hypothesize that computing logprobs only for requests which require logprobs & only computing the required number could reduce the performance difference between V1 and V0 when logprobs > 0

I don't think you have CUDAGraphs enabled in your V1 benchmarks, so I would not compare against V0 without that turned on as it makes a big difference.

robertgshaw2-neuralmagic · 2025-01-02T23:37:42Z

vllm/v1/worker/gpu_input_batch.py

        if sampling_params.prompt_logprobs:
-            self.prompt_logprob_reqs.add(req_id)
+            # TODO(rob): handle prefix caching and recomputation.


@afeldman-nm - this will be broken. We should do this as an immediate follow up.

Signed-off-by: Andrew Feldman <[email protected]>

afeldman-nm marked this pull request as draft October 31, 2024 13:34

mergify bot added needs-rebase and removed needs-rebase labels Nov 6, 2024

mergify bot added needs-rebase and removed needs-rebase labels Nov 13, 2024

afeldman-nm marked this pull request as ready for review November 18, 2024 16:00

afeldman-nm requested review from WoosukKwon, robertgshaw2-neuralmagic, njhill, ywang96, comaniac and alexm-neuralmagic as code owners November 18, 2024 16:00

afeldman-nm changed the title ~~[WIP] Complete v1 logprobs support~~ [WIP] Complete v1 sample and prompt logprobs support Nov 18, 2024