Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speech not detected by silero vad #1084

Closed
thewh1teagle opened this issue Jul 7, 2024 · 3 comments · Fixed by #1099
Closed

Speech not detected by silero vad #1084

thewh1teagle opened this issue Jul 7, 2024 · 3 comments · Fixed by #1099

Comments

@thewh1teagle
Copy link
Contributor

Hey
First of all, thanks for this great library! I like it a lot.

I created Rust bindings, and while creating the bindings for voice activity detection, I noticed that sometimes it doesn't detect speech although it's there, loud and clear.

So I checked original silero-vad repository and compared it with sherpa-onnx on the same audio file.
Then I noticed that it doesn't detect the speech only with sherpa-onnx but it does detect it with torch.

Reproduce:

  1. Download audio file
wget https://github.com/thewh1teagle/sherpa-rs/raw/main/samples/motivation.wav
  1. Test silero vad with sherpa-onnx on the file:
main.py
# wget https://github.com/snakers4/silero-vad/raw/master/files/silero_vad.onnx
# wget https://github.com/thewh1teagle/sherpa-rs/raw/main/samples/motivation.wav
# pip3 install soundfile numpy sherpa_onnx
# python3 main.py

from pathlib import Path
from typing import Tuple

import numpy as np
import sherpa_onnx
import soundfile as sf 


def load_audio(filename: str) -> Tuple[np.ndarray, int]:
    data, sample_rate = sf.read(
        filename,
        always_2d=True,
        dtype="float32",
    )
    data = data[:, 0]  # use only the first channel
    samples = np.ascontiguousarray(data)
    
    # Add 1 seconds of padding to the end of the samples
    padding_samples = int(sample_rate * 1)
    samples = np.concatenate((samples, np.zeros(padding_samples, dtype=samples.dtype)))
    
    return samples, sample_rate

def main():
    samples, sample_rate = load_audio("motivation.wav")
    config = sherpa_onnx.VadModelConfig()
    config.silero_vad.model = "silero_vad.onnx"
    config.sample_rate = sample_rate

    window_size = config.silero_vad.window_size

    vad = sherpa_onnx.VoiceActivityDetector(config, buffer_size_in_seconds=3)
    while len(samples) > window_size:
        vad.accept_waveform(samples[:window_size])
        samples = samples[window_size:]
        if vad.is_speech_detected():
            while not vad.empty():
                start_sec = vad.front.start / sample_rate
                duration_sec = len(vad.front.samples) / sample_rate
                print(f"start={start_sec}s duration={duration_sec}s")
                vad.pop()

main()

Output:

$ python main.py
start=0.926s duration=1.678s
start=3.774s duration=2.414s
start=7.518s duration=1.998s
  1. Test it again with torch
main.py
# wget https://github.com/thewh1teagle/sherpa-rs/raw/main/samples/motivation.wav
# pip install torch torchaudio
# python3 main.py

import torch 
torch.set_num_threads(1)

model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad', model='silero_vad')
(get_speech_timestamps, _, read_audio, _, _) = utils

wav = read_audio('samples/motivation.wav')
sample_rate = 16000

def convert_samples_to_seconds(timestamps, sample_rate):
    return [{'start': ts['start'] / sample_rate, 'end': ts['end'] / sample_rate} for ts in timestamps]

speech_timestamps = get_speech_timestamps(wav, model)
readable_timestamps = convert_samples_to_seconds(speech_timestamps, sample_rate)
for timestamp in readable_timestamps:
    print(f"start={timestamp['start']} end={timestamp['end']}")

Output:

$ python main.py
start=0.738 end=2.43
start=3.81 end=6.174
start=7.554 end=9.15
start=12.77 end=20.0

Expected behavior:
sherpa-onnx vad result should include the timestamps 12.77 to 20.0


Actual behavior:
It's missing. while it's there when inferencing the model with torch

@thewh1teagle
Copy link
Contributor Author

thewh1teagle commented Jul 9, 2024

I solved the issue.
There's was three differnet issues:

  1. I had to pad the samples with zeros at the end, so the last sample will be detected.
  2. Change the loop logic, to take also remaining samples at the end (the loop logic depends on windows size in examples)
  3. I had to increase buffer_in_seconds parameter, it was 3.0 and introduced overflow with lose. increased to 60.0

Btw. maybe it will be better to add the info of the lose of results in case of overflow

@csukuangfj
Copy link
Collaborator

By the way, the overflow log is just for your information. We will increase the buffer size internally.

@csukuangfj
Copy link
Collaborator

Fixed in #1099

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants