Integrate realtime speech to speech translation (S2ST) using seamless-streaming models #2992

qdrop17 · 2024-03-23T10:01:24Z

qdrop17
Mar 23, 2024

Hey folks!

I recently experimented with https://huggingface.co/facebook/seamless-streaming. You can run it locally or check it out on a Hugging Face space: https://huggingface.co/spaces/facebook/seamless-streaming

This enables real-time translation directly from audio to audio with minimal latency (since it doesn't have an intermediate text conversion step, then translation, and then speech synthesis).

I thought this project could be well-suited for integrating this feature into the effect pipeline as it's essentially an effect, right? 😀

Imagine...

You could watch foreign language YouTube videos, and your system would translate them locally in real-time.
You could call someone who speaks another language, and your voice would be translated in real-time to the language of the person you're calling.

Moreover, the system requirements aren't unrealistically high: an Nvidia GPU with 8GB of VRAM is sufficient to run the medium model.

If any maintainers are reading this: How could we approach creating a small proof-of-concept? My expertise lies more in DevOps topics and automation rather than software engineering. Yet, I'm willing to give it a try, of course.

Best regards,

qdrop

wwmm · 2024-03-23T21:07:49Z

wwmm
Mar 23, 2024
Maintainer

Moreover, the system requirements aren't unrealistically high: an Nvidia GPU with 8GB of VRAM is sufficient to run the medium model.

But does it work with other GPU? Having something that only works with nvidia is far from ideal.

If any maintainers are reading this: How could we approach creating a small proof-of-concept? My expertise lies more in DevOps topics and automation rather than software engineering. Yet, I'm willing to give it a try, of course.

As far as I could see they only provide Python examples. So the first impression is that making this working in a C++ app will be painful... And does it work without any kind of internet access. Considering facebook is involved I have my doubts this is fully offline.

IN order to be put in the pipeline this package has to have a C++ or C API that allows it to to receive the audio buffer size and sampling rate every time PipeWire calls the processing callback.

1 reply

qdrop17 Mar 24, 2024
Author

alright, thank you very much for your response!

But does it work with other GPU? Having something that only works with nvidia is far from ideal.

It's a weights file that you download and run locally (or wherever you have the necessary hardware). You can run it without a GPU, but that's probably not fast enough for real-time audio translation. Nvidia has an edge with framework support (CUDA), but AMD and Intel are catching up. However, these advanced capabilities require specific hardware, similar to other large language models (LLMs) that can be easily downloaded and run locally, such as https://github.com/ollama/ollama or https://github.com/lmstudio-ai.

And does it work without any kind of internet access. Considering facebook is involved I have my doubts this is fully offline.

Yes, these models are downloaded and executed locally, functioning seamlessly even when run in an internet-disconnected Docker container. While I believe Meta engages in some truly unethical practices, the world is nuanced—there are also many talented engineers there who contribute significantly to open-source projects.

I think I will continue with some local experiments outside of EasyEffects - for now, I was only able to translate a given file. Next, I will try to tap into the PipeWire system and attempt to translate an audio stream in real-time.

Once those capabilities are validated and the results are satisfactory, I will provide an update here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate realtime speech to speech translation (S2ST) using seamless-streaming models #2992

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Integrate realtime speech to speech translation (S2ST) using seamless-streaming models #2992

qdrop17 Mar 23, 2024

Replies: 1 comment · 1 reply

wwmm Mar 23, 2024 Maintainer

qdrop17 Mar 24, 2024 Author

qdrop17
Mar 23, 2024

Replies: 1 comment 1 reply

wwmm
Mar 23, 2024
Maintainer

qdrop17 Mar 24, 2024
Author