Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[V1] PR 1/2 for v1 sample and prompt logprobs support #9880

Open
wants to merge 304 commits into
base: main
Choose a base branch
from

Conversation

afeldman-nm
Copy link
Contributor

@afeldman-nm afeldman-nm commented Oct 31, 2024

This PR is adding support for sample logprobs & prompt logprobs to vLLM v1.

New behavior:

  • During model execution, model runner computes sample logprobs tensor (if user requests >0 logprobs) and prompt logprobs tensor (if user requests >0 prompt_logprobs) & transfers them to CPU.
  • Scheduler.update_from_output() pythonizes the sample logprobs & prompt logprobs tensors into lists-of-dicts. This method ensures that each sample logprob dict has the appropriate number of keys based on how many sample logprobs the user requested (same goes for prompt logprobs)
  • Each "logprob" (whether sample- or prompt-) consists of a token's log-probability, rank, and optional detokenized string representation. Prior to this PR, the detokenizer only operated on the generated tokens; however, with this PR, the detokenizer is also responsible for detokenizing the string representations of the tokens associated with the sample logprobs. (To be consistent with the behavior of v0, prompt logprobs are not detokenized.)

PR no. 1 (this PR) adds the infrastructure for logprobs support with limit unit tests

PR no. 2 (next PR) ports test_completion.py logprobs tests to v1 and (if necessary) tweaks the infra to make them pass

@afeldman-nm afeldman-nm marked this pull request as draft October 31, 2024 13:34
Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

Copy link

mergify bot commented Nov 6, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. @afeldman-nm please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Copy link

mergify bot commented Nov 13, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @afeldman-nm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@afeldman-nm afeldman-nm marked this pull request as ready for review November 18, 2024 16:00
@afeldman-nm afeldman-nm changed the title [WIP] Complete v1 logprobs support [WIP] Complete v1 sample and prompt logprobs support Nov 18, 2024
@alexm-neuralmagic
Copy link
Collaborator

Thanks @afeldman-nm! Did you had a chance to see the performance penalty of enabling logprobs?

@@ -91,25 +102,34 @@ def from_new_request(
prompt_token_ids=request.prompt_token_ids,
tokenizer=tokenizer,
stop_buffer_length=stop_buffer_length,
)
logprobs=[] if do_logprobs else None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of making logprobs: Optional[List[SampleLogprobs]] and using None to handle the case where there are no logprobs, you could instead make logprobs: List[SampleLogprobs] and treat an empty list as no logprobs

This will simplify the code in the detokenizer since you can then avoid the None checking

Copy link
Contributor Author

@afeldman-nm afeldman-nm Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle I see the value of this. However, I attempted to emulate the behavior expected by the v0 engine unit tests (since there is no reason for the interface spec to change between v0 and v1.)

At the link below, I highlighted two lines from a v0 logprobs unit test. The test implicitly configures the number of logprobs as None (by leaving logprobs unspecified in the SamplingParams) and the test expects that the request output will have results_logprobs_none[i].outputs[0].logprobs be None:

for i in range(len(results_logprobs_none)):
assert results_logprobs_none[i].outputs[0].logprobs is None
assert results_logprobs_none[i].outputs[0].cumulative_logprob is None

This is why I used slightly more complex logic to make the request output logprobs be None when the user does not request logprobs.

@comaniac
Copy link
Collaborator

Just to confirm: is this PR ready to review? If so please remove "WIP" from the title to be less confusing.

@afeldman-nm afeldman-nm changed the title [WIP] Complete v1 sample and prompt logprobs support Complete v1 sample and prompt logprobs support Nov 19, 2024
@afeldman-nm
Copy link
Contributor Author

afeldman-nm commented Nov 21, 2024

Initial benchmark results

The vLLM server with V1 engine is launched with the following command:

VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.2-3B --trust-remote-code --max-model-len 4096

The vLLM server with V0 engine is launched with the following command:

VLLM_USE_V1=0 vllm serve meta-llama/Llama-3.2-3B --trust-remote-code --max-model-len 4096

Comparing V1 engine to V0 engine with logprobs=None on main branch

The serving benchmark client was launched with the following command:

python benchmarks/benchmark_serving.py --model meta-llama/Llama-3.2-3B --dataset-path ../sharegpt.json

The main branch was utilized. The V1 or V0 engine was selected using VLLM_USE_V1=1 and VLLM_USE_V1=0 respectively.

Metric Main Branch V0 (logprobs=None) Main Branch V1 (logprobs=None)
Successful requests 659 660
Benchmark duration (s) 22.24 23.71
Total input tokens 85032 84700
Total generated tokens 131399 131550
Request throughput (req/s) 29.63 27.84
Output token throughput (tok/s) 5908.23 5548.69
Total Token throughput (tok/s) 9731.61 9121.28
Mean TTFT (ms) 8791.65 7430.40
Median TTFT (ms) 6851.74 7223.75
P99 TTFT (ms) 16865.68 11917.53
Mean TPOT (ms) 49.30 60.85
Median TPOT (ms) 41.68 56.23
P99 TPOT (ms) 211.19 141.15
Mean ITL (ms) 32.44 43.55
Median ITL (ms) 20.71 21.85
P99 ITL (ms) 344.67 430.56

Observations

  • 6% lower total token throughput
  • 30% lower P99 TTFT
  • 33% lower P99 TPOT
  • 25% higher P99 ITL

Comparing v1_logprobs branch to main with logprobs=None (V1 engine)

The serving benchmark client was launched with the following command:

python benchmarks/benchmark_serving.py --model meta-llama/Llama-3.2-3B --dataset-path ../sharegpt.json

And the benchmark results were compared between the v1_logprobs and main branches. VLLM_USE_V1=1 was used to enable the vLLM v1 engine in both cases.

Metric main v1_logprobs
Successful requests 660 660
Benchmark duration (s) 23.71 22.30
Total input tokens 84700 84700
Total generated tokens 131550 131550
Request throughput (req/s) 27.84 29.60
Output token throughput (tok/s) 5548.69 5900.08
Total Token throughput (tok/s) 9121.28 9698.92
Mean TTFT (ms) 7430.40 6026.84
Median TTFT (ms) 7223.75 5885.70
P99 TTFT (ms) 11917.53 10310.81
Mean TPOT (ms) 60.85 58.68
Median TPOT (ms) 56.23 55.20
P99 TPOT (ms) 141.15 187.53
Mean ITL (ms) 43.55 41.98
Median ITL (ms) 21.85 22.46
P99 ITL (ms) 430.56 392.53

Observations

It appears that when logprobs=None, performance by most metrics is similar or better with the v1_logprobs branch than the main branch. The exception is P99 time-per-output-token which is about 30%-40% worse however it is possible that this is run-to-run variation.;

Comparing logprobs=5 to logprobs=None with the v1_logprobs branch

Utilizing the v1_logprobs branch, two different scenarios were benchmarked:

  • logprobs=None
python benchmarks/benchmark_serving.py --model meta-llama/Llama-3.2-3B --dataset-path ../sharegpt.json
  • logprobs=5
python benchmarks/benchmark_serving.py --model meta-llama/Llama-3.2-3B --dataset-path ../sharegpt.json  --logprobs 5
Metric logprobs=None logprobs=5
Successful requests 660 662
Benchmark duration (s) 22.30 26.94
Total input tokens 84700 84124
Total generated tokens 131550 132169
Request throughput (req/s) 29.60 24.57
Output token throughput (tok/s) 5900.08 4905.59
Total Token throughput (tok/s) 9698.92 8027.94
Mean TTFT (ms) 6026.84 6687.43
Median TTFT (ms) 5885.70 5451.56
P99 TTFT (ms) 10310.81 17033.30
Mean TPOT (ms) 58.68 230.62
Median TPOT (ms) 55.20 92.15
P99 TPOT (ms) 187.53 1325.77
Mean ITL (ms) 41.98 70.19
Median ITL (ms) 22.46 23.57
P99 ITL (ms) 392.53 1159.93

Observations

With logprobs=5 and using the v1_logprobs branch,

  • ~17% decrease in all throughput metrics
  • 65% higher P99 TTFT
  • 7x higher P99 TPOT
  • 3x higher P99 ITL

compared to logprobs=None and the v1_logprobs branch

Comparing the v1_logprobs branch with V1 engine to the main branch with V0 engine, logprobs=5

Both scenarios used the same benchmark launch command:

python benchmarks/benchmark_serving.py --model meta-llama/Llama-3.2-3B --dataset-path ../sharegpt.json --logprobs 5

Since the V1 engine on the main branch does not support logprobs, the V0 engine's logprobs support was used as a baseline. V1 engine vs V0 engine was configured using VLLM_USE_V1=1 and VLLM_USE_V1=0 respectively.

Metric Main Branch V0 v1_logprobs Branch V1
Successful requests 659 662
Benchmark duration (s) 36.93 26.94
Total input tokens 85032 84124
Total generated tokens 131412 132169
Request throughput (req/s) 17.85 24.57
Output token throughput (tok/s) 3558.73 4905.59
Total Token throughput (tok/s) 5861.46 8027.94
Mean TTFT (ms) 9922.62 6687.43
Median TTFT (ms) 6618.40 5451.56
P99 TTFT (ms) 23072.55 17033.30
Mean TPOT (ms) 103.29 230.62
Median TPOT (ms) 105.09 92.15
P99 TPOT (ms) 267.94 1325.77
Mean ITL (ms) 75.31 70.19
Median ITL (ms) 42.60 23.57
P99 ITL (ms) 644.76 1159.93

Observations

  • 37% higher throughput
  • 26% lower P99 TTFT
  • 5x higher P99 TPOT
  • 80% higher P99 ITL

Overall analysis of benchmark results

  • It appears that the addition of logprobs support in the v1 engine has not significantly degraded performance in the logprobs=None scenario
  • Enabling logprobs=5 in the v1 engine degrades throughput by about 17% and increases TTFT/TPOT/ITL significantly
  • Across most metrics, the V0 and V1 engines with logprobs=None seem to be within 30% of each other. However, with logprobs=5, V1 ITL is 80% higher than V0 and V1 TPOT is 5x higher than V0

Next steps

The default behavior of the V1 engine is that if logprobs is enabled for any requests in the batch, then logprobs are computed for all requests. If the max number of logprobs requested in the batch is 5, then 5 logprobs are computed across all requests in the batch. I hypothesize that computing logprobs only for requests which require logprobs & only computing the required number could reduce the performance difference between V1 and V0 when logprobs > 0

@robertgshaw2-neuralmagic
Copy link
Collaborator

robertgshaw2-neuralmagic commented Nov 21, 2024

Initial benchmark results

The vLLM server with V1 engine is launched with the following command:

VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.2-3B --trust-remote-code --max-model-len 4096

The vLLM server with V0 engine is launched with the following command:

VLLM_USE_V1=0 vllm serve meta-llama/Llama-3.2-3B --trust-remote-code --max-model-len 4096

Comparing V1 engine to V0 engine with logprobs=None on main branch

The serving benchmark client was launched with the following command:

python benchmarks/benchmark_serving.py --model meta-llama/Llama-3.2-3B --dataset-path ../sharegpt.json

The main branch was utilized. The V1 or V0 engine was selected using VLLM_USE_V1=1 and VLLM_USE_V1=0 respectively.

Metric Main Branch V0 (logprobs=None) Main Branch V1 (logprobs=None)
Successful requests 659 660
Benchmark duration (s) 22.24 23.71
Total input tokens 85032 84700
Total generated tokens 131399 131550
Request throughput (req/s) 29.63 27.84
Output token throughput (tok/s) 5908.23 5548.69
Total Token throughput (tok/s) 9731.61 9121.28
Mean TTFT (ms) 8791.65 7430.40
Median TTFT (ms) 6851.74 7223.75
P99 TTFT (ms) 16865.68 11917.53
Mean TPOT (ms) 49.30 60.85
Median TPOT (ms) 41.68 56.23
P99 TPOT (ms) 211.19 141.15
Mean ITL (ms) 32.44 43.55
Median ITL (ms) 20.71 21.85
P99 ITL (ms) 344.67 430.56

Observations

  • 6% lower total token throughput
  • 30% lower P99 TTFT
  • 33% lower P99 TPOT
  • 25% higher P99 ITL

Comparing v1_logprobs branch to main with logprobs=None (V1 engine)

The serving benchmark client was launched with the following command:

python benchmarks/benchmark_serving.py --model meta-llama/Llama-3.2-3B --dataset-path ../sharegpt.json

And the benchmark results were compared between the v1_logprobs and main branches. VLLM_USE_V1=1 was used to enable the vLLM v1 engine in both cases.

Metric main v1_logprobs
Successful requests 660 660
Benchmark duration (s) 23.71 22.30
Total input tokens 84700 84700
Total generated tokens 131550 131550
Request throughput (req/s) 27.84 29.60
Output token throughput (tok/s) 5548.69 5900.08
Total Token throughput (tok/s) 9121.28 9698.92
Mean TTFT (ms) 7430.40 6026.84
Median TTFT (ms) 7223.75 5885.70
P99 TTFT (ms) 11917.53 10310.81
Mean TPOT (ms) 60.85 58.68
Median TPOT (ms) 56.23 55.20
P99 TPOT (ms) 141.15 187.53
Mean ITL (ms) 43.55 41.98
Median ITL (ms) 21.85 22.46
P99 ITL (ms) 430.56 392.53

Observations

It appears that when logprobs=None, performance by most metrics is similar or better with the v1_logprobs branch than the main branch. The exception is P99 time-per-output-token which is about 30%-40% worse however it is possible that this is run-to-run variation.;

Comparing logprobs=5 to logprobs=None with the v1_logprobs branch

Utilizing the v1_logprobs branch, two different scenarios were benchmarked:

  • logprobs=None
python benchmarks/benchmark_serving.py --model meta-llama/Llama-3.2-3B --dataset-path ../sharegpt.json
  • logprobs=5
python benchmarks/benchmark_serving.py --model meta-llama/Llama-3.2-3B --dataset-path ../sharegpt.json  --logprobs 5

Metric logprobs=None logprobs=5
Successful requests 660 662
Benchmark duration (s) 22.30 26.94
Total input tokens 84700 84124
Total generated tokens 131550 132169
Request throughput (req/s) 29.60 24.57
Output token throughput (tok/s) 5900.08 4905.59
Total Token throughput (tok/s) 9698.92 8027.94
Mean TTFT (ms) 6026.84 6687.43
Median TTFT (ms) 5885.70 5451.56
P99 TTFT (ms) 10310.81 17033.30
Mean TPOT (ms) 58.68 230.62
Median TPOT (ms) 55.20 92.15
P99 TPOT (ms) 187.53 1325.77
Mean ITL (ms) 41.98 70.19
Median ITL (ms) 22.46 23.57
P99 ITL (ms) 392.53 1159.93

Observations

With logprobs=5 and using the v1_logprobs branch,

  • ~17% decrease in all throughput metrics
  • 65% higher P99 TTFT
  • 7x higher P99 TPOT
  • 3x higher P99 ITL

compared to logprobs=None and the v1_logprobs branch

Comparing the v1_logprobs branch with V1 engine to the main branch with V0 engine, logprobs=5

Both scenarios used the same benchmark launch command:

python benchmarks/benchmark_serving.py --model meta-llama/Llama-3.2-3B --dataset-path ../sharegpt.json --logprobs 5

Since the V1 engine on the main branch does not support logprobs, the V0 engine's logprobs support was used as a baseline. V1 engine vs V0 engine was configured using VLLM_USE_V1=1 and VLLM_USE_V1=0 respectively.

Metric Main Branch V0 v1_logprobs Branch V1
Successful requests 659 662
Benchmark duration (s) 36.93 26.94
Total input tokens 85032 84124
Total generated tokens 131412 132169
Request throughput (req/s) 17.85 24.57
Output token throughput (tok/s) 3558.73 4905.59
Total Token throughput (tok/s) 5861.46 8027.94
Mean TTFT (ms) 9922.62 6687.43
Median TTFT (ms) 6618.40 5451.56
P99 TTFT (ms) 23072.55 17033.30
Mean TPOT (ms) 103.29 230.62
Median TPOT (ms) 105.09 92.15
P99 TPOT (ms) 267.94 1325.77
Mean ITL (ms) 75.31 70.19
Median ITL (ms) 42.60 23.57
P99 ITL (ms) 644.76 1159.93

Observations

  • 37% higher throughput
  • 26% lower P99 TTFT
  • 5x higher P99 TPOT
  • 80% higher P99 ITL

Overall analysis of benchmark results

  • It appears that the addition of logprobs support in the v1 engine has not significantly degraded performance in the logprobs=None scenario
  • Enabling logprobs=5 in the v1 engine degrades throughput by about 17% and increases TTFT/TPOT/ITL significantly
  • Across most metrics, the V0 and V1 engines with logprobs=None seem to be within 30% of each other. However, with logprobs=5, V1 ITL is 80% higher than V0 and V1 TPOT is 5x higher than V0

Next steps

The default behavior of the V1 engine is that if logprobs is enabled for any requests in the batch, then logprobs are computed for all requests. If the max number of logprobs requested in the batch is 5, then 5 logprobs are computed across all requests in the batch. I hypothesize that computing logprobs only for requests which require logprobs & only computing the required number could reduce the performance difference between V1 and V0 when logprobs > 0

I don't think you have CUDAGraphs enabled in your V1 benchmarks, so I would not compare against V0 without that turned on as it makes a big difference.

if sampling_params.prompt_logprobs:
self.prompt_logprob_reqs.add(req_id)
# TODO(rob): handle prefix caching and recomputation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@afeldman-nm - this will be broken. We should do this as an immediate follow up.

@mergify mergify bot removed the needs-rebase label Jan 3, 2025
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Signed-off-by: Andrew Feldman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build frontend ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.