Speaker diarization is the process of partitioning an audio stream containing human speech into homogeneous segments according to the identity of each speaker. It can enhance the readability of an automatic speech transcription by structuring the audio stream into speaker turns and, when used together with speaker recognition systems, by providing the speaker’s true identity. It is used to answer the question "who spoke when?". Speaker diarization is an essential feature for a speech recognition system to enrich the transcription with speaker labels.
In this tutorial we consider how to build speaker diarization pipeline using pyannote.audio
and OpenVINO. pyannote.audio
is an open-source toolkit written in Python for speaker diarization. Based on PyTorch deep learning framework, it provides a set of trainable end-to-end neural building blocks that can be combined and jointly optimized to build speaker diarization pipelines. You can find more information about pyannote pretrained models in model card, repo and paper.
This tutorial uses the pyannote/speaker-diarization pipeline and demonstrates how to integrate OpenVINO for running model inference inside pipeline.
The pipeline accepts input audio file and provides mapping between timestemps and speaker which this speech fragment belongs. Below example of output timemap produced by demo.
If you have not installed all required dependencies, follow the Installation Guide.