-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] Whisper model implementation #11280
Changes from all commits
cfbd164
ced0141
248bafb
6c9ee61
7329b2d
77ad7ed
755086b
b38f5b7
ff70bce
3fbd067
9032aa1
ce3a87c
04a0ef4
fd4ed14
26cfede
34c5830
bf111b2
a21470b
b457c01
d81d217
17712a4
b573fa9
6d6cbd9
94a867b
787708a
e943905
606642e
fe8e245
b59fddb
d66cd42
6ba1afc
26fd92a
4566b10
a21334c
1fe41fc
1c16ad2
7282280
3442852
e0cc63e
770534c
d73e004
9672af2
127f46e
edfec27
ba30886
dbd21a4
ab674fa
e920f2d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
import time | ||
|
||
from vllm import LLM, SamplingParams | ||
from vllm.assets.audio import AudioAsset | ||
|
||
# Create a Whisper encoder/decoder model instance | ||
llm = LLM( | ||
model="openai/whisper-large-v3", | ||
max_model_len=448, | ||
max_num_seqs=400, | ||
limit_mm_per_prompt={"audio": 1}, | ||
kv_cache_dtype="fp8", | ||
) | ||
|
||
prompts = [ | ||
{ | ||
"prompt": "<|startoftranscript|>", | ||
"multi_modal_data": { | ||
"audio": AudioAsset("mary_had_lamb").audio_and_sample_rate, | ||
}, | ||
}, | ||
{ # Test explicit encoder/decoder prompt | ||
"encoder_prompt": { | ||
"prompt": "", | ||
"multi_modal_data": { | ||
"audio": AudioAsset("winning_call").audio_and_sample_rate, | ||
}, | ||
}, | ||
"decoder_prompt": "<|startoftranscript|>", | ||
} | ||
] * 1024 | ||
|
||
# Create a sampling params object. | ||
sampling_params = SamplingParams( | ||
temperature=0, | ||
top_p=1.0, | ||
max_tokens=200, | ||
) | ||
|
||
start = time.time() | ||
|
||
# Generate output tokens from the prompts. The output is a list of | ||
# RequestOutput objects that contain the prompt, generated | ||
# text, and other information. | ||
outputs = llm.generate(prompts, sampling_params) | ||
|
||
# Print the outputs. | ||
for output in outputs: | ||
prompt = output.prompt | ||
encoder_prompt = output.encoder_prompt | ||
generated_text = output.outputs[0].text | ||
print(f"Encoder prompt: {encoder_prompt!r}, " | ||
f"Decoder prompt: {prompt!r}, " | ||
f"Generated text: {generated_text!r}") | ||
|
||
duration = time.time() - start | ||
|
||
print("Duration:", duration) | ||
print("RPS:", len(prompts) / duration) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,136 @@ | ||
"""Compare the outputs of HF and vLLM for Whisper models using greedy sampling. | ||
|
||
Run `pytest tests/models/encoder_decoder/audio/test_whisper.py`. | ||
""" | ||
from typing import Optional | ||
|
||
import pytest | ||
|
||
from vllm import LLM, SamplingParams | ||
from vllm.assets.audio import AudioAsset | ||
|
||
from ....utils import fork_new_process_for_each_test, multi_gpu_test | ||
|
||
PROMPTS = [ | ||
{ | ||
"prompt": | ||
"<|startoftranscript|><|en|><|transcribe|><|notimestamps|>", | ||
"multi_modal_data": { | ||
"audio": AudioAsset("mary_had_lamb").audio_and_sample_rate, | ||
}, | ||
}, | ||
{ # Test explicit encoder/decoder prompt | ||
"encoder_prompt": { | ||
"prompt": "", | ||
"multi_modal_data": { | ||
"audio": AudioAsset("winning_call").audio_and_sample_rate, | ||
}, | ||
}, | ||
"decoder_prompt": | ||
"<|startoftranscript|><|en|><|transcribe|><|notimestamps|>", | ||
} | ||
] | ||
|
||
EXPECTED = { | ||
"openai/whisper-tiny": [ | ||
" He has birth words I spoke in the original corner of that. And a" | ||
" little piece of black coat poetry. Mary had a little sandwich," | ||
" sweet, with white and snow. And everyone had it very went the last" | ||
" would sure to go.", | ||
" >> And the old one, fit John the way to Edgar Martinez. >> One more" | ||
" to line down the field line for our base camp. Here comes joy. Here" | ||
" is June and the third base. They're going to wave him in. The throw" | ||
" to the plate will be late. The Mariners are going to play for the" | ||
" American League Championship. I don't believe it. It just continues" | ||
" by all five." | ||
], | ||
"openai/whisper-small": [ | ||
" The first words I spoke in the original pornograph. A little piece" | ||
" of practical poetry. Mary had a little lamb, its fleece was quite a" | ||
" slow, and everywhere that Mary went the lamb was sure to go.", | ||
" And the old one pitch on the way to Edgar Martinez one month. Here" | ||
" comes joy. Here is Junior to third base. They're gonna wave him" | ||
" in. The throw to the plate will be late. The Mariners are going to" | ||
" play for the American League Championship. I don't believe it. It" | ||
" just continues. My, oh my." | ||
], | ||
"openai/whisper-medium": [ | ||
" The first words I spoke in the original phonograph, a little piece" | ||
" of practical poetry. Mary had a little lamb, its fleece was quite as" | ||
" slow, and everywhere that Mary went the lamb was sure to go.", | ||
" And the 0-1 pitch on the way to Edgar Martinez swung on the line" | ||
" down the left field line for Obeyshev. Here comes Joy. Here is" | ||
" Jorgen at third base. They're going to wave him in. The throw to the" | ||
" plate will be late. The Mariners are going to play for the American" | ||
" League Championship. I don't believe it. It just continues. My, oh" | ||
" my." | ||
], | ||
"openai/whisper-large-v3": [ | ||
" The first words I spoke in the original phonograph, a little piece" | ||
" of practical poetry. Mary had a little lamb, its feet were quite as" | ||
" slow, and everywhere that Mary went, the lamb was sure to go.", | ||
" And the 0-1 pitch on the way to Edgar Martinez. Swung on the line." | ||
" Now the left field line for a base hit. Here comes Joy. Here is" | ||
" Junior to third base. They're going to wave him in. The throw to the" | ||
" plate will be late. The Mariners are going to play for the American" | ||
" League Championship. I don't believe it. It just continues. My, oh," | ||
" my." | ||
], | ||
"openai/whisper-large-v3-turbo": [ | ||
" The first words I spoke in the original phonograph, a little piece" | ||
" of practical poetry. Mary had a little lamb, its streets were quite" | ||
" as slow, and everywhere that Mary went the lamb was sure to go.", | ||
" And the 0-1 pitch on the way to Edgar Martinez. Swung on the line" | ||
" down the left field line for a base hit. Here comes Joy. Here is" | ||
" Junior to third base. They're going to wave him in. The throw to the" | ||
" plate will be late. The Mariners are going to play for the American" | ||
" League Championship. I don't believe it. It just continues. My, oh," | ||
" my." | ||
] | ||
} | ||
|
||
|
||
def run_test( | ||
model: str, | ||
*, | ||
tensor_parallel_size: int, | ||
distributed_executor_backend: Optional[str] = None, | ||
) -> None: | ||
prompt_list = PROMPTS * 10 | ||
expected_list = EXPECTED[model] * 10 | ||
|
||
llm = LLM( | ||
model=model, | ||
tensor_parallel_size=tensor_parallel_size, | ||
distributed_executor_backend=distributed_executor_backend, | ||
) | ||
|
||
sampling_params = SamplingParams( | ||
temperature=0, | ||
top_p=1.0, | ||
max_tokens=200, | ||
) | ||
|
||
outputs = llm.generate(prompt_list, sampling_params) | ||
|
||
for output, expected in zip(outputs, expected_list): | ||
print(output.outputs[0].text) | ||
assert output.outputs[0].text == expected | ||
|
||
|
||
@fork_new_process_for_each_test | ||
@pytest.mark.core_model | ||
@pytest.mark.parametrize( | ||
"model", ["openai/whisper-small", "openai/whisper-large-v3-turbo"]) | ||
def test_models(model) -> None: | ||
run_test(model, tensor_parallel_size=1) | ||
|
||
|
||
@multi_gpu_test(num_gpus=2) | ||
@pytest.mark.core_model | ||
@pytest.mark.parametrize("model", ["openai/whisper-large-v3-turbo"]) | ||
@pytest.mark.parametrize("distributed_executor_backend", ["ray", "mp"]) | ||
def test_models_distributed(model, distributed_executor_backend) -> None: | ||
run_test(model, | ||
tensor_parallel_size=2, | ||
distributed_executor_backend=distributed_executor_backend) |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -184,10 +184,16 @@ def _tokenize_prompt( | |
corresponding token IDs. | ||
""" | ||
tokenizer = self.get_tokenizer_group() | ||
|
||
add_special_tokens = None | ||
if self.model_config.hf_config.model_type == "whisper": | ||
# For Whisper, special tokens should be provided by the user based | ||
# on the task and language of their request. Also needed to avoid | ||
# appending an EOS token to the prompt which disrupts generation. | ||
add_special_tokens = False | ||
return tokenizer.encode(request_id=request_id, | ||
prompt=prompt, | ||
lora_request=lora_request) | ||
lora_request=lora_request, | ||
add_special_tokens=add_special_tokens) | ||
|
||
async def _tokenize_prompt_async( | ||
self, | ||
|
@@ -197,10 +203,17 @@ async def _tokenize_prompt_async( | |
) -> List[int]: | ||
"""Async version of :meth:`_tokenize_prompt`.""" | ||
tokenizer = self.get_tokenizer_group() | ||
|
||
return await tokenizer.encode_async(request_id=request_id, | ||
prompt=prompt, | ||
lora_request=lora_request) | ||
add_special_tokens = None | ||
if self.model_config.hf_config.model_type == "whisper": | ||
# For Whisper, special tokens should be provided by the user based | ||
# on the task and language of their request. Also needed to avoid | ||
# appending an EOS token to the prompt which disrupts generation. | ||
add_special_tokens = False | ||
return await tokenizer.encode_async( | ||
request_id=request_id, | ||
prompt=prompt, | ||
lora_request=lora_request, | ||
add_special_tokens=add_special_tokens) | ||
|
||
def _can_process_multimodal(self) -> bool: | ||
model_config = self.model_config | ||
|
@@ -439,8 +452,15 @@ def _build_enc_dec_llm_inputs( | |
assert_never(encoder_inputs) # type: ignore[arg-type] | ||
|
||
if decoder_inputs is None: | ||
dec_token_ids = self._prepare_decoder_input_ids_for_generation( | ||
None) | ||
if self.model_config.hf_config.model_type == "whisper": | ||
# For Whisper models, the text prompt should go to the decoder. | ||
# If no explicit encoder/decoder inputs, then copy the prompt | ||
# from the encoder to the decoder. The encoder tokens are later | ||
# overridden by the audio features. | ||
dec_token_ids = encoder_inputs["prompt_token_ids"].copy() | ||
else: | ||
dec_token_ids = self._prepare_decoder_input_ids_for_generation( | ||
None) | ||
Comment on lines
+455
to
+463
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a way to determine this without model type information? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not sure about generalizing this from a single example. In the long term it may be better to allow the model definition to specify exactly the mapping between input fields and where they go (e.g. encoder/decoder) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed with @aurickq , long-term it is probably best to either (1) have the model definition specify whether to map the input text prompt to the encoder, or (2) add a default behavior only for multi-modal models with cross-attention, wherein the text prompt is always routed to the decoder & the non-text modality is always mapped to the encoder. (I worked on adding encoder/decoder cross-attention support to v0) |
||
decoder_inputs = token_inputs(dec_token_ids) | ||
elif (decoder_inputs["type"] == "token" | ||
or decoder_inputs["type"] == "multimodal"): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove the eos token in the input processor? But maybe we can improve this later via the merged multi-modal processor since I assume HF can handle this automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that
<|startoftranscript|>
is also a special token which will be added by tokenizer by default:It outputs
'<|startoftranscript|><|notimestamps|><|startoftranscript|><|endoftext|>'
, which means that we need to remove a"<|startoftranscript|>"
to get the original prompt as well.Perhaps we need to construct user prompts with non special tokens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there are other special tokens that are added as well. There are also some other challenges
<|notimestamps|>
is 50363 for whisper-small but 50364 for whisper-large-v3. This makes it awkward to try to detect these special tokens from inside the input processor.a. User passed in text
prompt
which is tokenized in vLLM. We need to strip special tokens.b. User passed in tokens
prompt_token_ids
which should not be stripped.Basically, I'm not sure how to reliably remove special tokens after it's already encoded. It seems error-prone and brittle to third-party behavior in huggingface. I understand that having the tokenizer not add special tokens in the first place involves more changes inside vLLM, but IMO it's the most reliable solution that's more robust to future changes in vLLM and huggingface.