Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Regression in Whisper models when timestamp generation is enabled #1783

Open
MahmoudAshraf97 opened this issue Sep 18, 2024 · 6 comments
Labels
enhancement New feature or request

Comments

@MahmoudAshraf97
Copy link
Contributor

Hello
Several reports mention that WER improves greatly when adding <|notimestamps|> to the initial prompt in whisper decoding aka disabling timestamps generation, I tested this using This and This. You can check mobiusml/faster-whisper#18 (comment) for an example of decoding difference using the same encoder output
There are several other reports including but not limited to:
SYSTRAN/faster-whisper#1010
SYSTRAN/faster-whisper#985

Also generation with timestamps has a lower toks/s and the slowdown increases when increasing the batch size

on the side, we have several PRs waiting for @trungkienbkhn review but he seems to be out of office, it'd be great if one of his colleagues has any information when he might return

@minhthuc2502 minhthuc2502 added the enhancement New feature or request label Sep 20, 2024
@x86Gr
Copy link

x86Gr commented Sep 23, 2024

FYI, when using "without_timestamps=True" on faster-whisper 1.0.3 I get a faster speed but with a lot of skipped sentences.

@MahmoudAshraf97
Copy link
Contributor Author

I tested this completely independent from faster whisper to make sure it's purely related to CT2, in FW v1.0.3 it uses a batch size of 1 so you shouldn't notice slowdowns related to this options as they start to appear at larger batch sizes, so the missing sentences are probably caused by something else

@ozancaglayan
Copy link
Contributor

What's the verdict on this? Can you provide a WAV file and outputs which manifest this issue?

@MahmoudAshraf97
Copy link
Contributor Author

This is the code to reproduce the problem with this file.
faster-whisper commit 203dddb047fd2c3ed2a520fe1416467a527e0f37 is used, although it should be irrelevant here but mentioned for complete reproducibility

from faster_whisper import WhisperModel, decode_audio
from faster_whisper.vad import VadOptions, get_speech_timestamps, collect_chunks, merge_segments
from faster_whisper.transcribe import pad_or_trim, get_ctranslate2_storage
import torch

audio = decode_audio("tests/data/physicsworks.wav")

vad_options = VadOptions(min_silence_duration_ms=160, max_speech_duration_s=30)
vad_chunks = get_speech_timestamps(audio, vad_options=vad_options)
clip_timestamps = merge_segments(vad_chunks, vad_options)
audio_chunks, chunk_metadata = collect_chunks(audio, clip_timestamps)

model = WhisperModel("large-v2")

features = torch.stack(
    [
        pad_or_trim(
            model.feature_extractor(chunk)[
                ...,
                : chunk.shape[0] // model.feature_extractor.hop_length,
            ]
        )
        for chunk in audio_chunks
    ]
)

prompt_text = "<|startoftranscript|><|en|><|transcribe|>"
no_ts_token_text = "<|notimestamps|>"
prompt_tokens = model.hf_tokenizer.encode(prompt_text,add_special_tokens=False).ids
no_ts_token = model.hf_tokenizer.encode(no_ts_token_text,add_special_tokens=False).ids
eot_token = model.hf_tokenizer.encode('<|endoftext|>', add_special_tokens=False).ids[0]

encoder_output = model.encode(features)
generation_results_with_ts = model.model.generate(encoder_output, [prompt_tokens] * len(features))
generation_results_without_ts = model.model.generate(encoder_output, [prompt_tokens + no_ts_token] * len(features))

for result in generation_results_with_ts:
    tokens = [token for token in result.sequences_ids[0] if token < eot_token] # remove timestamp tokens
    print(model.hf_tokenizer.decode(tokens))

#  Now I want to return to the conservation of mechanical energy. I have here a pendulum. I have an object that weighs 15 kilograms
#  that will be converted to kinetic energy. If I would let it swing from one meter height and you would be there and it would hit you, you'd be dead. 150 joules is enough to kill you.
#  You let it go, you swing it, thereby converting gravitational potential energy into kinetic energy and that way you can demolish a building. You just let it hit and it breaks a building
#  I am such a strong believer of the conservation of mechanical energy that I am willing to put my life on the line. If I release that bulb from a certain height
#  and it swings, then when it reaches here it could not be higher. There is a conversion from gravitational potential energy to kinetic energy back to gravitational potential energy
#  For 100%, I may not trust myself. I'm going to release this object and I hope I will be able to do it at zero speed so that when it comes back, it may touch my chin
#  I will close my eyes. I don't want to see this. So please be very quiet. I almost didn't sleep all night. Three, two, one, zero.


for result in generation_results_without_ts:
    print(model.hf_tokenizer.decode(result.sequences_ids[0]))

#  Now I want to return to the conservation of mechanical energy. I have here a pendulum. I have an object that weighs 15 kilograms and I can lift it up one meter, which I have done now. That means I've done work. Mgh is the work I have done, believe me. I've increased the potential energy of this object. 15 times 10 is about 150 joules.
#  that will be converted to kinetic energy. If I would let it swing from one meter height and you would be there and it would hit you, you'd be dead. 150 joules is enough to kill you. They use these devices called a wrecker ball. They use them to demolish buildings. You lift up a very heavy object, even heavier than this
#  You let it go, you swing it, thereby converting gravitational potential energy into kinetic energy and that way you can demolish a building. You just let it hit... and it breaks a building. And that's the whole idea of wrecking. So you're using, then, the conversion of gravitational potential energy to kinetic energy.
#  I am such a strong believer of the conservation of mechanical energy that I am willing to put my life on the line. If I release that bulb from a certain height then that bulb can never come back to a point where the height is any larger.
#  and it swings, then when it reaches here, it could not be higher. There is a conversion from gravitational potential energy to kinetic energy, back to gravitational potential energy, and it will come to a stop here. And when it swings back, it should not be able to reach any higher, provided that I do not give this object an initial speed when I stand here.
#  For 100%, I may not trust myself. I'm going to release this object and I hope I will be able to do it at zero speed, so that when it comes back it may touch my chin, but it may not crush my chin. I want you to be extremely quiet, because this is no joke. If I don't succeed in giving it zero speed, then this will be my last lecture.
#  I will close my eyes. I don't want to see this. So please be very quiet. I almost didn't sleep all night. Three, two, one, zero. Physics works, and I'm still alive.


import timeit

time_ts = timeit.timeit(
    "model.model.generate(encoder_output, [prompt_tokens] * len(features))",
    globals=globals(),
    number=10,
)
num_tokens_ts = sum(
    [len(result.sequences_ids[0]) for result in generation_results_with_ts]
)
time_no_ts = timeit.timeit(
    "model.model.generate(encoder_output, [prompt_tokens + no_ts_token] * len(features))",
    globals=globals(),
    number=10,
)
num_tokens_no_ts = sum(
    [len(result.sequences_ids[0]) for result in generation_results_without_ts]
)

print(f"Speed with timestamps: {(num_tokens_ts/time_ts)*10:.2f} tokens/s")
# Speed with timestamps: 196.96 tokens/s

print(f"Speed without timestamps: {(num_tokens_no_ts/time_no_ts)*10:.2f} tokens/s")
# Speed without timestamps: 302.72 tokens/s

Vad is used to segment the audio into 30s segments, for accurate reproduction, we cannot use the sequential algorithm of whisper because the chunking relies directly on the generated timestamps, so if we disabled them, the encoded segments will be different

@ozancaglayan
Copy link
Contributor

thanks. What do you mean by sequential algorithm of Whisper?

@MahmoudAshraf97
Copy link
Contributor Author

The sequential longform transcription algorithm, it uses the last generated timestamp token to shift the window, if no timestamps were generated, it adds 30s to the current window

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants