Performance Regression in Whisper models when timestamp generation is enabled #1783

MahmoudAshraf97 · 2024-09-18T15:11:18Z

Hello
Several reports mention that WER improves greatly when adding <|notimestamps|> to the initial prompt in whisper decoding aka disabling timestamps generation, I tested this using This and This. You can check mobiusml/faster-whisper#18 (comment) for an example of decoding difference using the same encoder output
There are several other reports including but not limited to:
SYSTRAN/faster-whisper#1010
SYSTRAN/faster-whisper#985

Also generation with timestamps has a lower toks/s and the slowdown increases when increasing the batch size

on the side, we have several PRs waiting for @trungkienbkhn review but he seems to be out of office, it'd be great if one of his colleagues has any information when he might return

The text was updated successfully, but these errors were encountered:

x86Gr · 2024-09-23T08:49:05Z

FYI, when using "without_timestamps=True" on faster-whisper 1.0.3 I get a faster speed but with a lot of skipped sentences.

MahmoudAshraf97 · 2024-09-23T08:57:12Z

I tested this completely independent from faster whisper to make sure it's purely related to CT2, in FW v1.0.3 it uses a batch size of 1 so you shouldn't notice slowdowns related to this options as they start to appear at larger batch sizes, so the missing sentences are probably caused by something else

ozancaglayan · 2024-11-11T11:58:33Z

What's the verdict on this? Can you provide a WAV file and outputs which manifest this issue?

MahmoudAshraf97 · 2024-11-11T12:52:00Z

This is the code to reproduce the problem with this file.
faster-whisper commit 203dddb047fd2c3ed2a520fe1416467a527e0f37 is used, although it should be irrelevant here but mentioned for complete reproducibility

from faster_whisper import WhisperModel, decode_audio
from faster_whisper.vad import VadOptions, get_speech_timestamps, collect_chunks, merge_segments
from faster_whisper.transcribe import pad_or_trim, get_ctranslate2_storage
import torch

audio = decode_audio("tests/data/physicsworks.wav")

vad_options = VadOptions(min_silence_duration_ms=160, max_speech_duration_s=30)
vad_chunks = get_speech_timestamps(audio, vad_options=vad_options)
clip_timestamps = merge_segments(vad_chunks, vad_options)
audio_chunks, chunk_metadata = collect_chunks(audio, clip_timestamps)

model = WhisperModel("large-v2")

features = torch.stack(
    [
        pad_or_trim(
            model.feature_extractor(chunk)[
                ...,
                : chunk.shape[0] // model.feature_extractor.hop_length,
            ]
        )
        for chunk in audio_chunks
    ]
)

prompt_text = "<|startoftranscript|><|en|><|transcribe|>"
no_ts_token_text = "<|notimestamps|>"
prompt_tokens = model.hf_tokenizer.encode(prompt_text,add_special_tokens=False).ids
no_ts_token = model.hf_tokenizer.encode(no_ts_token_text,add_special_tokens=False).ids
eot_token = model.hf_tokenizer.encode('<|endoftext|>', add_special_tokens=False).ids[0]

encoder_output = model.encode(features)
generation_results_with_ts = model.model.generate(encoder_output, [prompt_tokens] * len(features))
generation_results_without_ts = model.model.generate(encoder_output, [prompt_tokens + no_ts_token] * len(features))

for result in generation_results_with_ts:
    tokens = [token for token in result.sequences_ids[0] if token < eot_token] # remove timestamp tokens
    print(model.hf_tokenizer.decode(tokens))

#  Now I want to return to the conservation of mechanical energy. I have here a pendulum. I have an object that weighs 15 kilograms
#  that will be converted to kinetic energy. If I would let it swing from one meter height and you would be there and it would hit you, you'd be dead. 150 joules is enough to kill you.
#  You let it go, you swing it, thereby converting gravitational potential energy into kinetic energy and that way you can demolish a building. You just let it hit and it breaks a building
#  I am such a strong believer of the conservation of mechanical energy that I am willing to put my life on the line. If I release that bulb from a certain height
#  and it swings, then when it reaches here it could not be higher. There is a conversion from gravitational potential energy to kinetic energy back to gravitational potential energy
#  For 100%, I may not trust myself. I'm going to release this object and I hope I will be able to do it at zero speed so that when it comes back, it may touch my chin
#  I will close my eyes. I don't want to see this. So please be very quiet. I almost didn't sleep all night. Three, two, one, zero.


for result in generation_results_without_ts:
    print(model.hf_tokenizer.decode(result.sequences_ids[0]))

#  Now I want to return to the conservation of mechanical energy. I have here a pendulum. I have an object that weighs 15 kilograms and I can lift it up one meter, which I have done now. That means I've done work. Mgh is the work I have done, believe me. I've increased the potential energy of this object. 15 times 10 is about 150 joules.
#  that will be converted to kinetic energy. If I would let it swing from one meter height and you would be there and it would hit you, you'd be dead. 150 joules is enough to kill you. They use these devices called a wrecker ball. They use them to demolish buildings. You lift up a very heavy object, even heavier than this
#  You let it go, you swing it, thereby converting gravitational potential energy into kinetic energy and that way you can demolish a building. You just let it hit... and it breaks a building. And that's the whole idea of wrecking. So you're using, then, the conversion of gravitational potential energy to kinetic energy.
#  I am such a strong believer of the conservation of mechanical energy that I am willing to put my life on the line. If I release that bulb from a certain height then that bulb can never come back to a point where the height is any larger.
#  and it swings, then when it reaches here, it could not be higher. There is a conversion from gravitational potential energy to kinetic energy, back to gravitational potential energy, and it will come to a stop here. And when it swings back, it should not be able to reach any higher, provided that I do not give this object an initial speed when I stand here.
#  For 100%, I may not trust myself. I'm going to release this object and I hope I will be able to do it at zero speed, so that when it comes back it may touch my chin, but it may not crush my chin. I want you to be extremely quiet, because this is no joke. If I don't succeed in giving it zero speed, then this will be my last lecture.
#  I will close my eyes. I don't want to see this. So please be very quiet. I almost didn't sleep all night. Three, two, one, zero. Physics works, and I'm still alive.


import timeit

time_ts = timeit.timeit(
    "model.model.generate(encoder_output, [prompt_tokens] * len(features))",
    globals=globals(),
    number=10,
)
num_tokens_ts = sum(
    [len(result.sequences_ids[0]) for result in generation_results_with_ts]
)
time_no_ts = timeit.timeit(
    "model.model.generate(encoder_output, [prompt_tokens + no_ts_token] * len(features))",
    globals=globals(),
    number=10,
)
num_tokens_no_ts = sum(
    [len(result.sequences_ids[0]) for result in generation_results_without_ts]
)

print(f"Speed with timestamps: {(num_tokens_ts/time_ts)*10:.2f} tokens/s")
# Speed with timestamps: 196.96 tokens/s

print(f"Speed without timestamps: {(num_tokens_no_ts/time_no_ts)*10:.2f} tokens/s")
# Speed without timestamps: 302.72 tokens/s

Vad is used to segment the audio into 30s segments, for accurate reproduction, we cannot use the sequential algorithm of whisper because the chunking relies directly on the generated timestamps, so if we disabled them, the encoded segments will be different

ozancaglayan · 2024-11-11T15:49:06Z

thanks. What do you mean by sequential algorithm of Whisper?

MahmoudAshraf97 · 2024-11-11T16:20:18Z

The sequential longform transcription algorithm, it uses the last generated timestamp token to shift the window, if no timestamps were generated, it adds 30s to the current window

minhthuc2502 added the enhancement New feature or request label Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Regression in Whisper models when timestamp generation is enabled #1783

Performance Regression in Whisper models when timestamp generation is enabled #1783

MahmoudAshraf97 commented Sep 18, 2024

x86Gr commented Sep 23, 2024

MahmoudAshraf97 commented Sep 23, 2024

ozancaglayan commented Nov 11, 2024

MahmoudAshraf97 commented Nov 11, 2024

ozancaglayan commented Nov 11, 2024

MahmoudAshraf97 commented Nov 11, 2024

Performance Regression in Whisper models when timestamp generation is enabled #1783

Performance Regression in Whisper models when timestamp generation is enabled #1783

Comments

MahmoudAshraf97 commented Sep 18, 2024

x86Gr commented Sep 23, 2024

MahmoudAshraf97 commented Sep 23, 2024

ozancaglayan commented Nov 11, 2024

MahmoudAshraf97 commented Nov 11, 2024

ozancaglayan commented Nov 11, 2024

MahmoudAshraf97 commented Nov 11, 2024