Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[V1] PR 1/2 for v1 sample and prompt logprobs support #9880

Open
wants to merge 321 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
321 commits
Select commit Hold shift + click to select a range
cec0443
refactor
abf149 Nov 26, 2024
7315781
attempted sample_metadata fix; sample logprobs work, prompt logprobs …
abf149 Nov 26, 2024
2cee231
cleaned up sampling metadata
abf149 Nov 26, 2024
cc1e43a
[Hardware][NVIDIA] Add non-NVML CUDA mode for Jetson (#9735)
conroy-cheers Nov 26, 2024
07f9e89
[Bugfix] Fix using `-O[0,3]` with LLM entrypoint (#10677)
mgoin Nov 26, 2024
27e4923
small change
abf149 Nov 26, 2024
1ccef6c
partially re-enabled detokenize cases in test
abf149 Nov 26, 2024
a293451
deferring support for detokenization feature to subsequent SamplingPa…
abf149 Nov 26, 2024
86d0259
[Bugfix] Check bnb_4bit_quant_storage for bitsandbytes (#10642)
mgoin Nov 26, 2024
1f6d7d2
[V1] Refactor model executable interface for multimodal models (#10570)
ywang96 Nov 26, 2024
95dd578
tweak tolerance; fast check
afeldman-nm Nov 29, 2024
dd8ea8b
Remove hard-dependencies of Speculative decode to CUDA workers (#10587)
xuechendi Nov 27, 2024
d414464
[V1] Update interface for idefics3 (#10680)
ywang96 Nov 27, 2024
0f196ac
[Bugfix][SpecDecode] apply sampling parameters to target probabilitie…
jeongin601 Nov 27, 2024
429d17e
[bugfix] fix the default value of llm_int8_threshold in BitsAndBytesC…
yansh97 Nov 27, 2024
89c4f78
[Hardware][Gaudi]add get_name method for HPUAttentionBackend (#10667)
jikunshang Nov 27, 2024
a809ee1
[Misc]Further reduce BNB static variable (#10597)
jeejeelee Nov 27, 2024
57485ba
[Kernel] Remove if-else with identical branches in marlin 2:4 (#10687)
tlrmchlsmth Nov 27, 2024
e255262
[Model] Support telechat2 (#10311)
shunxing12345 Nov 27, 2024
fcc7172
[Bugfix][Hardware][CPU] Fix intel-omp version to avoid segfault (#10700)
bigPYJ1151 Nov 27, 2024
9cc018a
[V1] Update interface for mistral-format Pixtral (#10703)
ywang96 Nov 27, 2024
d65fc83
[ci] fix slow tests (#10698)
youkaichao Nov 27, 2024
046dfc4
[torch.compile] fix shape specialization (#10722)
youkaichao Nov 27, 2024
9bf5c8d
[Bugfix] Fix GGUF inference with FP16 unquantized checkpoint (#10675)
Isotr0py Nov 27, 2024
4e53851
[Bugfix][Mamba] Fix Multistep on Mamba-like models (#10705)
mzusman Nov 27, 2024
8239c6f
[Bugfix] Ignore `lm_head` when loading embedding models (#10719)
DarkLight1337 Nov 27, 2024
5a3a0eb
[Frontend] don't block event loop in tokenization (preprocess) in Ope…
tomeras91 Nov 27, 2024
b22e27c
[misc] upgrade filelock version (#10731)
youkaichao Nov 28, 2024
b5864e2
[Model] support bitsandbytes quantization with minicpm3 model (#10682)
zixuanzhang226 Nov 28, 2024
b9cabc9
[Doc] Update model in arch_overview.rst to match comment (#10701)
spacewander Nov 28, 2024
d61d661
[Bug][CLI] Allow users to disable prefix caching explicitly (#10724)
rickyyx Nov 28, 2024
39f4494
[V1] Do not allocate beyond the max_model_len (#10730)
WoosukKwon Nov 28, 2024
dcdf2f3
[Kernel] Update vllm-flash-attn version (#10736)
WoosukKwon Nov 28, 2024
ea6ed6b
[TPU] Update requirements-tpu (#10726)
richardsliu Nov 28, 2024
ac0b495
[Model] Added GLM-4 series hf format model support vllm==0.6.4 (#10561)
sixsixcoder Nov 28, 2024
1362dac
[Kernel] Update vllm-flash-attn version to reduce CPU overheads (#10742)
WoosukKwon Nov 28, 2024
bc6637c
[V1] Optimize the CPU overheads in FlashAttention custom op (#10733)
WoosukKwon Nov 28, 2024
3733796
[Model] Add Internlm2 LoRA support (#5064)
Isotr0py Nov 28, 2024
170a30c
[Model] Clean up MiniCPMV (#10751)
DarkLight1337 Nov 29, 2024
8d83244
[Misc] typo find in sampling_metadata.py (#10740)
noooop Nov 29, 2024
d8499c0
[Bugfix] Fix Idefics3 bug (#10778)
jeejeelee Nov 29, 2024
3c8ced2
[platform] Add verify_quantization in platform. (#10757)
wangxiyuan Nov 29, 2024
5146352
[Bugfix] Fix OpenVino/Neuron `driver_worker` init (#10779)
NickLucche Nov 30, 2024
d95da87
[Model] Refactor Molmo weights loading to use AutoWeightsLoader (#10771)
Isotr0py Nov 30, 2024
7831672
[Interleaved ATTN] Support for Mistral-8B (#10591)
patrickvonplaten Nov 30, 2024
a877540
[doc] format fix (#10789)
wangxiyuan Nov 30, 2024
cbf1489
[Model] Replace embedding models with pooling adapter (#10769)
DarkLight1337 Dec 1, 2024
db1ca39
[Misc] Improve type annotations for `support_torch_compile` (#10763)
DarkLight1337 Dec 1, 2024
d198e8f
[Misc] Rename embedding classes to pooling (#10801)
DarkLight1337 Dec 1, 2024
cf04e11
[doc] add warning about comparing hf and vllm outputs (#10805)
youkaichao Dec 1, 2024
b58062b
[Misc] Adding `MMMU-Pro` vision dataset to serving benchmark (#10804)
ywang96 Dec 1, 2024
bcdb5b8
removed fast tests from pipeline
afeldman-nm Dec 2, 2024
88f7f57
[Core] Implement disagg prefill by StatelessProcessGroup (#10502)
KuntaiDu Dec 2, 2024
02eb179
[Model] Add BNB support to Llava and Pixtral-HF (#10795)
Isotr0py Dec 2, 2024
8d5035d
[core] Avoid metrics log noise when idle - include speculative decodi…
cduk Dec 2, 2024
ab21a28
[Kernel] Use `out` arg in flash_attn_varlen_func (#10811)
WoosukKwon Dec 2, 2024
6643bf2
Fill TorchSDPAAttentionMetadata seq_lens_field for prefill (#10799)
maxdebayser Dec 2, 2024
9464931
[misc] remove xverse modeling file (#10814)
youkaichao Dec 2, 2024
777bb76
[doc]Update config docstring (#10732)
wangxiyuan Dec 2, 2024
221ee79
[Model]: add some tests for aria model (#10770)
xffxff Dec 2, 2024
39cd324
Update vllm/outputs.py
afeldman-nm Dec 2, 2024
5757476
small fixes
afeldman-nm Dec 2, 2024
3d1373c
moved output processing commands into processor
afeldman-nm Dec 2, 2024
05f39a9
[CI/Build] Update `mistral_common` version for tests and docs (#10825)
DarkLight1337 Dec 2, 2024
74274c2
added explanatory comment to EngineCore.update_from_output()
afeldman-nm Dec 2, 2024
c9a7b3f
[misc] use out argument for flash attention (#10822)
youkaichao Dec 2, 2024
7ea421d
Merge branch 'afeldman-nm/v1_logprobs' of https://github.com/neuralma…
afeldman-nm Dec 2, 2024
f22facd
constructing dummy logprobs
afeldman-nm Dec 2, 2024
b16dd79
dummy logprobs with decodes
afeldman-nm Dec 2, 2024
0054ece
passing some detokenizer tests
afeldman-nm Dec 2, 2024
59853d5
fixing error during debug
afeldman-nm Dec 2, 2024
193e60c
existing detokenizer test checks are unbroken; need to add logprobs c…
afeldman-nm Dec 2, 2024
a078f89
Merge branch 'main' into v1_logprobs_merge
afeldman-nm Dec 2, 2024
15f9825
merge
afeldman-nm Dec 3, 2024
26b165e
Merge branch 'main' into v1_logprobs_merge
afeldman-nm Dec 4, 2024
30ea722
Merge branch 'main' into v1_logprobs_merge
afeldman-nm Dec 4, 2024
4fefd62
Merge branch 'main' into v1_logprobs_merge
afeldman-nm Dec 4, 2024
603f2b5
model runner returns logprobs as np arrays
afeldman-nm Dec 4, 2024
ac602d8
new request types
afeldman-nm Dec 4, 2024
2a9ef8c
first pass at only using numpy in engine core
afeldman-nm Dec 4, 2024
2fe9147
tested removal of pythonization from engine core
afeldman-nm Dec 4, 2024
1283010
Merge branch 'main' into v1_logprobs_merge
afeldman-nm Dec 4, 2024
fee1e8e
Merge branch 'v1_logprobs' into move_pyth
afeldman-nm Dec 4, 2024
a46a8e5
wip detokenizer updates
afeldman-nm Dec 4, 2024
0c04576
wip
afeldman-nm Dec 5, 2024
0f04d6e
wip
afeldman-nm Dec 5, 2024
c6831ca
first pass at pythonization moved out of engine
afeldman-nm Dec 5, 2024
238bc46
Merge branch 'main' into v1_logprobs_merge
afeldman-nm Dec 5, 2024
86b18aa
Merge branch 'v1_logprobs' into move_pyth
afeldman-nm Dec 5, 2024
ae7e10c
incremental/non-incremental detokenized text comparison
afeldman-nm Dec 5, 2024
3cffca3
implemented the sample logprobs N+1 scenario in the front end
afeldman-nm Dec 5, 2024
73e4c12
fixed prompt logprob count bug
afeldman-nm Dec 5, 2024
5b49d36
passing one test!
afeldman-nm Dec 5, 2024
66fe6bc
Merge branch 'main' into v1_logprobs
afeldman-nm Dec 5, 2024
0cf2c79
successfully failing cumulative logprobs test
afeldman-nm Dec 5, 2024
49e0b33
cumulative logprob works
afeldman-nm Dec 5, 2024
6558b37
Merge branch 'main' into v1_logprobs_merge
afeldman-nm Dec 6, 2024
5d36dcc
Merge branch 'v1_logprobs_merge' into v1_logprobs
afeldman-nm Dec 6, 2024
e8bd247
wip
afeldman-nm Dec 6, 2024
9f39817
progress toward detok stop token test
afeldman-nm Dec 7, 2024
867bb71
Merge branch 'main' into v1_logprobs_merge
afeldman-nm Dec 7, 2024
58bcc5a
detokenizer stop tokens test passing; some slight engine fixes for th…
afeldman-nm Dec 7, 2024
696401e
Merge branch 'main' into v1_logprobs_merge
afeldman-nm Dec 7, 2024
d8361d3
Merge branch 'main' into v1_logprobs
afeldman-nm Dec 7, 2024
85e58c9
Merge branch 'v1_logprobs_merge' into v1_logprobs
afeldman-nm Dec 7, 2024
6320868
refactored detokenizer
afeldman-nm Dec 7, 2024
54abd99
wip
afeldman-nm Dec 7, 2024
7852bb2
incremental detokenization test now also checks logprobs
afeldman-nm Dec 7, 2024
8d82049
Merge branch 'main' into v1_logprobs_merge
afeldman-nm Dec 7, 2024
f6d4329
woosuk code structure suggestion
afeldman-nm Dec 7, 2024
aa15b75
Merge branch 'main' into v1_logprobs_merge
afeldman-nm Dec 7, 2024
a4eb6bc
detokenizer tests refactor
afeldman-nm Dec 7, 2024
06185d0
refactor
afeldman-nm Dec 7, 2024
90ed53d
refactoring
afeldman-nm Dec 7, 2024
48f4671
refactor
afeldman-nm Dec 7, 2024
7121739
refactoring to make logprobs var names clearer, touched a lot of file…
afeldman-nm Dec 7, 2024
cef5ddb
Merge branch 'main' into v1_logprobs
afeldman-nm Dec 7, 2024
bed24db
Merge branch 'main' into v1_logprobs_merge
afeldman-nm Dec 7, 2024
5ce8128
move
afeldman-nm Dec 7, 2024
14c7e56
merge
afeldman-nm Dec 9, 2024
bdd0abf
removed VLLM_USE_V1 checks
afeldman-nm Dec 9, 2024
1fc981e
revert logprobs name changes
afeldman-nm Dec 9, 2024
dc63ac1
removing some unnecessary changes'
afeldman-nm Dec 9, 2024
4f30408
removed fast checks
afeldman-nm Dec 9, 2024
d8e9885
Merge branch 'main' into v1_logprobs_merge
afeldman-nm Dec 10, 2024
77488cb
wip test_completion
afeldman-nm Dec 12, 2024
f1a689c
toward completion tests
afeldman-nm Dec 12, 2024
e962aa7
serialization fix
afeldman-nm Dec 12, 2024
05f982f
tried merge, not quite working
afeldman-nm Dec 16, 2024
b22c5e7
formatted vllm/v1/engine/core.py
afeldman-nm Dec 16, 2024
5bc7039
wip merge
afeldman-nm Dec 16, 2024
4d53751
formatting
afeldman-nm Dec 16, 2024
ba3967f
Merge branch 'main' into v1_logprobs_merge
afeldman-nm Dec 16, 2024
fc8340d
woops, didn't pull in latest v1_logprobs changes
afeldman-nm Dec 17, 2024
e084ad0
merge
afeldman-nm Dec 17, 2024
697fc15
cleanup
afeldman-nm Dec 17, 2024
f61d822
remove calling max_logprobs from engine
afeldman-nm Dec 17, 2024
b77c1af
remove change in hpu
afeldman-nm Dec 17, 2024
20b8af1
Merge branch 'main' into v1_logprobs
afeldman-nm Dec 17, 2024
3193659
merge
afeldman-nm Dec 18, 2024
f0c1ba7
deferring v1 test_completion.py to later PR
afeldman-nm Dec 18, 2024
a9df520
Merge branch 'main' into v1_logprobs_merge
afeldman-nm Dec 18, 2024
efce9ca
merge
afeldman-nm Dec 19, 2024
15654c4
simplify changes to scheduler
robertgshaw2-neuralmagic Dec 29, 2024
5857f87
small assert
robertgshaw2-neuralmagic Dec 29, 2024
a49d4c1
nit
robertgshaw2-neuralmagic Dec 29, 2024
dc7d27c
revert moving update from output file
robertgshaw2-neuralmagic Dec 29, 2024
72eed99
updated
robertgshaw2-neuralmagic Jan 1, 2025
7d6eb22
stahs
robertgshaw2-neuralmagic Jan 1, 2025
7c4c231
stash
robertgshaw2-neuralmagic Jan 1, 2025
eab5ceb
updated
robertgshaw2-neuralmagic Jan 1, 2025
970e030
format
robertgshaw2-neuralmagic Jan 2, 2025
54d6f17
single compute_logits, consider switching to N compute_logits
robertgshaw2-neuralmagic Jan 2, 2025
8d4723e
format
robertgshaw2-neuralmagic Jan 2, 2025
fc20e5e
revert update from output changes
robertgshaw2-neuralmagic Jan 2, 2025
fca2dae
update partial reqs to be a list
robertgshaw2-neuralmagic Jan 2, 2025
317ee1e
update
robertgshaw2-neuralmagic Jan 2, 2025
74fc264
updated
robertgshaw2-neuralmagic Jan 2, 2025
db999da
remove unrelated changes
robertgshaw2-neuralmagic Jan 2, 2025
9b430d8
updated
robertgshaw2-neuralmagic Jan 2, 2025
d470e23
nit
robertgshaw2-neuralmagic Jan 2, 2025
ecaa68a
update ModelRunnerOutput
robertgshaw2-neuralmagic Jan 2, 2025
c32b6eb
updated
robertgshaw2-neuralmagic Jan 2, 2025
09d7592
updated
robertgshaw2-neuralmagic Jan 2, 2025
f092bef
cleanup
robertgshaw2-neuralmagic Jan 2, 2025
555861e
remove spurious change
robertgshaw2-neuralmagic Jan 2, 2025
5b7d629
updated
robertgshaw2-neuralmagic Jan 2, 2025
2694b75
less spurious changes
robertgshaw2-neuralmagic Jan 2, 2025
3d651fc
updated
robertgshaw2-neuralmagic Jan 2, 2025
cbe8275
updated to include the sampled logprob
robertgshaw2-neuralmagic Jan 2, 2025
531eeb7
fix logprobs
robertgshaw2-neuralmagic Jan 2, 2025
c4ed7ba
add utility class
robertgshaw2-neuralmagic Jan 2, 2025
a7cb691
format
robertgshaw2-neuralmagic Jan 2, 2025
d001a05
remove cruft
robertgshaw2-neuralmagic Jan 2, 2025
3a257b8
update comment
robertgshaw2-neuralmagic Jan 2, 2025
bd38a24
nit
robertgshaw2-neuralmagic Jan 2, 2025
531c007
stash
robertgshaw2-neuralmagic Jan 2, 2025
0497bf9
update
robertgshaw2-neuralmagic Jan 2, 2025
25041f6
stash
robertgshaw2-neuralmagic Jan 2, 2025
062d0a7
stash
robertgshaw2-neuralmagic Jan 2, 2025
94d9b38
updated
robertgshaw2-neuralmagic Jan 2, 2025
f2cdb61
updated
robertgshaw2-neuralmagic Jan 2, 2025
3c4d9c1
updated
robertgshaw2-neuralmagic Jan 2, 2025
1a36c3b
updated
robertgshaw2-neuralmagic Jan 2, 2025
9e9ec2b
cleanup diff
robertgshaw2-neuralmagic Jan 2, 2025
b99d9cd
clean up diff
robertgshaw2-neuralmagic Jan 2, 2025
2f85118
clean up diff
robertgshaw2-neuralmagic Jan 2, 2025
cb8c87c
more clean
robertgshaw2-neuralmagic Jan 2, 2025
983f2a7
stash
robertgshaw2-neuralmagic Jan 2, 2025
16a8caa
passing mypy
robertgshaw2-neuralmagic Jan 2, 2025
868e653
updated
robertgshaw2-neuralmagic Jan 2, 2025
62b8360
update
robertgshaw2-neuralmagic Jan 2, 2025
7fe4d85
update
robertgshaw2-neuralmagic Jan 2, 2025
92a27aa
updated
robertgshaw2-neuralmagic Jan 2, 2025
e279409
update indexing
robertgshaw2-neuralmagic Jan 2, 2025
bc3942c
reduce changeg
robertgshaw2-neuralmagic Jan 2, 2025
b5647c3
reduce cruft
robertgshaw2-neuralmagic Jan 2, 2025
0db5db0
reduce cruft
robertgshaw2-neuralmagic Jan 2, 2025
ff7d7d2
updated
robertgshaw2-neuralmagic Jan 2, 2025
8aa8baa
update comment
robertgshaw2-neuralmagic Jan 2, 2025
527228d
format
robertgshaw2-neuralmagic Jan 2, 2025
f5d0b57
reduce length of comments
robertgshaw2-neuralmagic Jan 2, 2025
711ff13
updated
robertgshaw2-neuralmagic Jan 2, 2025
3a99615
reduce assets
robertgshaw2-neuralmagic Jan 2, 2025
6bb6d34
updated
robertgshaw2-neuralmagic Jan 2, 2025
d73010d
updated
robertgshaw2-neuralmagic Jan 2, 2025
b8f40df
updated
robertgshaw2-neuralmagic Jan 2, 2025
e806678
clean
robertgshaw2-neuralmagic Jan 2, 2025
afef932
reduce cruft
robertgshaw2-neuralmagic Jan 2, 2025
71580ae
revert crruft
robertgshaw2-neuralmagic Jan 2, 2025
1d52a37
updated
robertgshaw2-neuralmagic Jan 3, 2025
c8eef87
cleanup
robertgshaw2-neuralmagic Jan 3, 2025
b501aed
updated
robertgshaw2-neuralmagic Jan 3, 2025
ac070f8
updated
robertgshaw2-neuralmagic Jan 3, 2025
9a28ddf
updated
robertgshaw2-neuralmagic Jan 3, 2025
d1a956d
update comment
robertgshaw2-neuralmagic Jan 3, 2025
5fd0060
updated
robertgshaw2-neuralmagic Jan 3, 2025
433b93c
merge
robertgshaw2-neuralmagic Jan 3, 2025
0d2f7c8
stash
robertgshaw2-neuralmagic Jan 3, 2025
06b9aba
cleanup
robertgshaw2-neuralmagic Jan 3, 2025
035e2c2
updated
robertgshaw2-neuralmagic Jan 3, 2025
17e41c8
remove
robertgshaw2-neuralmagic Jan 3, 2025
2cb4832
finish cleaning sampler.py
robertgshaw2-neuralmagic Jan 3, 2025
92595a4
updated
robertgshaw2-neuralmagic Jan 3, 2025
c82fc85
updated comment
robertgshaw2-neuralmagic Jan 3, 2025
c3c4f9c
passing mypy!
robertgshaw2-neuralmagic Jan 3, 2025
fec3d15
comment
robertgshaw2-neuralmagic Jan 3, 2025
d002d67
todo -> fixme
robertgshaw2-neuralmagic Jan 3, 2025
3157e8b
updated
robertgshaw2-neuralmagic Jan 3, 2025
60125e3
fixed sampler bug
afeldman-nm Jan 4, 2025
5908cb1
fixed some sampler bugs
afeldman-nm Jan 5, 2025
c5f9565
merge
afeldman-nm Jan 5, 2025
fc52031
wip fixing detokenizer test
afeldman-nm Jan 5, 2025
7dc2756
Merge branch 'main' into v1_logprobs_merge
afeldman-nm Jan 6, 2025
6e57de4
wip
afeldman-nm Jan 6, 2025
599aae8
temporary hack to use pickling
afeldman-nm Jan 6, 2025
2aa1007
Merge branch 'main' into v1_logprobs_merge
afeldman-nm Jan 6, 2025
ae1e1b7
wip detokenizer test
afeldman-nm Jan 6, 2025
ae00145
Merge branch 'main' into v1_logprobs_merge
afeldman-nm Jan 6, 2025
a1c5b2e
fix: logprobs not being wrapped in an array
afeldman-nm Jan 6, 2025
7288370
sample logprobs work
afeldman-nm Jan 6, 2025
85e57d9
Merge branch 'main' into v1_logprobs_merge
afeldman-nm Jan 6, 2025
0e90ccb
detokenizer test passing for sample logprobs
afeldman-nm Jan 6, 2025
c2f48fb
detokenizer tests passing
afeldman-nm Jan 6, 2025
7993d08
Merge branch 'main' into v1_logprobs_merge
afeldman-nm Jan 6, 2025
13177d4
prompt logprobs with chunked prefill!
afeldman-nm Jan 6, 2025
05536f5
cleanup
afeldman-nm Jan 6, 2025
fa64529
Merge branch 'main' into v1_logprobs_merge
afeldman-nm Jan 6, 2025
0d17df8
light refactor
afeldman-nm Jan 6, 2025
f707191
torch serialization with msgpack via enc_/ext_hooksgit status!
afeldman-nm Jan 6, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
146 changes: 129 additions & 17 deletions tests/v1/engine/test_detokenizer.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,21 @@
from typing import List
from typing import List, Optional, Tuple

import pytest
import torch
from transformers import AutoTokenizer

from tests.v1.engine.utils import (generate_dummy_prompt_logprobs,
generate_dummy_sample_logprobs,
validate_requests_logprobs)
from vllm.sampling_params import RequestOutputKind, SamplingParams
from vllm.v1.engine import EngineCoreOutput, EngineCoreRequest
from vllm.v1.engine.detokenizer import Detokenizer

# Number of sample logprobs to request when testing sample logprobs
NUM_SAMPLE_LOGPROBS = 5
# Number of prompt logprobs to request when testing prompt logprobs
NUM_PROMPT_LOGPROBS = 7
# Use Mistral instruct tokenizer
TOKENIZER_NAME = "mistralai/Mistral-7B-Instruct-v0.3"
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME)

Expand All @@ -20,15 +29,34 @@

FULL_TOKENS = [tokenizer(text).input_ids for text in FULL_STRINGS]
PROMPT_LEN = 5

# Tokenize prompts under test & create dummy generated tokens
PROMPT_TOKENS = [
tokenizer(text).input_ids[:PROMPT_LEN] for text in FULL_STRINGS
]
GENERATION_TOKENS = [
tokenizer(text).input_ids[PROMPT_LEN:] for text in FULL_STRINGS
]

# Generate dummy prompt logprobs & sample logprobs for initializing
# the mock engine
PROMPT_LOGPROBS: List[Tuple[torch.Tensor, torch.Tensor]] = [
generate_dummy_prompt_logprobs(prompt_tokens_list=tokens_list,
num_logprobs=NUM_PROMPT_LOGPROBS,
tokenizer=tokenizer)
for tokens_list in PROMPT_TOKENS
]
GENERATION_LOGPROBS = [
generate_dummy_sample_logprobs(sampled_tokens_list=tokens_list,
num_logprobs=NUM_SAMPLE_LOGPROBS,
tokenizer=tokenizer)
for tokens_list in GENERATION_TOKENS
]

PROMPT_STRINGS = [
tokenizer.decode(prompt_tokens, skip_special_tokens=True)
for prompt_tokens in PROMPT_TOKENS
tokenizer.decode(prompt_tokens,
skip_special_tokens=True,
tokenizer=tokenizer) for prompt_tokens in PROMPT_TOKENS
]
PROMPT_STRINGS_LEN = [len(prompt_string) for prompt_string in PROMPT_STRINGS]
GENERATION_STRINGS = [
Expand All @@ -40,34 +68,91 @@
class MockEngineCore:
"""Mock outputs form premade tokens lists."""

def __init__(self, tokens_list: List[List[int]]):
self.tokens_list = tokens_list
def __init__(
self,
generated_tokens_list: List[List[int]],
prompt_tokens_list: List[List[int]],
generated_logprobs_raw: Optional[List[List[Tuple[torch.Tensor,
torch.Tensor]]]],
prompt_logprobs_raw: Optional[List[Tuple[torch.Tensor, torch.Tensor]]],
) -> None:
self.generated_tokens_list = generated_tokens_list
self.prompt_tokens_list = prompt_tokens_list
self.current_idx = 0
self.generated_logprobs_raw = generated_logprobs_raw
self.do_logprobs = generated_logprobs_raw is not None
self.prompt_logprobs_raw = prompt_logprobs_raw
self.do_prompt_logprobs = prompt_logprobs_raw is not None

def get_outputs(self) -> List[EngineCoreOutput]:
do_logprobs = self.do_logprobs
do_prompt_logprobs = self.do_prompt_logprobs
token_idx = self.current_idx
self.current_idx += 1

outputs = []
for req_idx, token_ids in enumerate(self.tokens_list):
if len(token_ids) > token_idx:
output = EngineCoreOutput(request_id=f"request-{req_idx}",
new_token_ids=[token_ids[token_idx]],
finished=False)
if token_idx == len(token_ids) - 1:
for req_idx, generated_token_ids in enumerate(
self.generated_tokens_list):
if len(generated_token_ids) > token_idx:
if do_logprobs:
assert self.generated_logprobs_raw is not None
(logprobs, logprobs_token_ids) = (
self.generated_logprobs_raw[req_idx][token_idx])
logprobs = [logprobs]
logprobs_token_ids = [logprobs_token_ids]
else:
logprobs = None
logprobs_token_ids = None
if do_prompt_logprobs:
if self.current_idx == 0:
assert self.prompt_logprobs_raw is not None
prompt_logprobs = self.prompt_logprobs_raw[req_idx][0]
prompt_logprobs_token_ids = self.prompt_logprobs_raw[
req_idx][1]
else:
(prompt_logprobs,
prompt_logprobs_token_ids) = (torch.empty(0, 0),
torch.empty(0, 0))
else:
(prompt_logprobs, prompt_logprobs_token_ids) = (None, None)
output = EngineCoreOutput(
request_id=f"request-{req_idx}",
new_token_ids=[generated_token_ids[token_idx]],
finished=False,
logprobs=logprobs,
logprobs_token_ids=logprobs_token_ids,
prompt_logprobs=prompt_logprobs,
prompt_logprobs_token_ids=prompt_logprobs_token_ids,
)
if token_idx == len(generated_token_ids) - 1:
output.finished = True
output.finish_reason = "stopped"
outputs.append(output)

self.current_idx += 1
return outputs


@pytest.mark.parametrize(
"request_output_kind",
[RequestOutputKind.DELTA, RequestOutputKind.FINAL_ONLY])
def test_incremental_detokenization(request_output_kind: RequestOutputKind):
@pytest.mark.parametrize("logprobs,prompt_logprobs",
[(None, None), (NUM_SAMPLE_LOGPROBS, None),
(None, NUM_PROMPT_LOGPROBS),
(NUM_SAMPLE_LOGPROBS, NUM_PROMPT_LOGPROBS)])
def test_incremental_detokenization(
request_output_kind: RequestOutputKind,
logprobs: Optional[int],
prompt_logprobs: Optional[int],
) -> None:
do_generated_logprobs = logprobs is not None
do_prompt_logprobs = prompt_logprobs is not None
detokenizer = Detokenizer(TOKENIZER_NAME)
engine_core = MockEngineCore(GENERATION_TOKENS)
engine_core = MockEngineCore(
generated_tokens_list=GENERATION_TOKENS,
prompt_tokens_list=PROMPT_TOKENS,
generated_logprobs_raw=GENERATION_LOGPROBS
if do_generated_logprobs else None,
prompt_logprobs_raw=PROMPT_LOGPROBS if do_prompt_logprobs else None)

# Make N requests.
requests = [
Expand All @@ -85,7 +170,9 @@ def test_incremental_detokenization(request_output_kind: RequestOutputKind):
spaces_between_special_tokens=False,
output_kind=request_output_kind,
stop=[],
include_stop_str_in_output=False))
include_stop_str_in_output=False,
logprobs=logprobs,
prompt_logprobs=prompt_logprobs))
for idx, (
prompt,
prompt_tokens) in enumerate(zip(PROMPT_STRINGS, PROMPT_TOKENS))
Expand All @@ -107,6 +194,9 @@ def test_incremental_detokenization(request_output_kind: RequestOutputKind):
request_outputs, requests_to_abort = detokenizer.step(outputs)
assert len(requests_to_abort) == 0

# Validate logprob detokenization
validate_requests_logprobs(requests, request_outputs, tokenizer)

# Update tracking.
for request_output in request_outputs:
request_id = request_output.request_id
Expand All @@ -133,9 +223,24 @@ def test_incremental_detokenization(request_output_kind: RequestOutputKind):


@pytest.mark.parametrize("include_stop_str_in_output", [True, False])
def test_stop_string(include_stop_str_in_output: bool):
@pytest.mark.parametrize("logprobs,prompt_logprobs",
[(None, None), (NUM_SAMPLE_LOGPROBS, None),
(None, NUM_PROMPT_LOGPROBS),
(NUM_SAMPLE_LOGPROBS, NUM_PROMPT_LOGPROBS)])
def test_stop_string(
include_stop_str_in_output: bool,
logprobs: Optional[int],
prompt_logprobs: Optional[int],
) -> None:
do_generated_logprobs = logprobs is not None
do_prompt_logprobs = prompt_logprobs is not None
detokenizer = Detokenizer(TOKENIZER_NAME)
engine_core = MockEngineCore(GENERATION_TOKENS)
engine_core = MockEngineCore(
generated_tokens_list=GENERATION_TOKENS,
prompt_tokens_list=PROMPT_TOKENS,
generated_logprobs_raw=GENERATION_LOGPROBS
if do_generated_logprobs else None,
prompt_logprobs_raw=PROMPT_LOGPROBS if do_prompt_logprobs else None)

# Make N requests.
requests = [
Expand All @@ -155,6 +260,8 @@ def test_stop_string(include_stop_str_in_output: bool):
output_kind=RequestOutputKind.DELTA,
stop=STOP_STRINGS,
include_stop_str_in_output=include_stop_str_in_output,
logprobs=logprobs,
prompt_logprobs=prompt_logprobs,
)) for idx, (
prompt,
prompt_tokens) in enumerate(zip(PROMPT_STRINGS, PROMPT_TOKENS))
Expand All @@ -166,6 +273,7 @@ def test_stop_string(include_stop_str_in_output: bool):

gen_strings = {}
aborted = []
i = 0
while True:
# Mock output from the EngineCore.
outputs = engine_core.get_outputs()
Expand All @@ -179,6 +287,9 @@ def test_stop_string(include_stop_str_in_output: bool):
assert request_output.request_id not in aborted
aborted.extend(requests_to_abort)

# Validate logprob detokenization
validate_requests_logprobs(requests, request_outputs, tokenizer)

# Update tracking.
for request_output in request_outputs:
if request_output.finished:
Expand All @@ -190,6 +301,7 @@ def test_stop_string(include_stop_str_in_output: bool):
gen_strings[request_id] = new_text
else:
gen_strings[request_id] += new_text
i += 1

# Confirmed tracked values matches what we expected.
for idx, (ref_gen_str,
Expand Down
Loading
Loading