Skip to content
Diego Cardozo edited this page Jun 1, 2020 · 18 revisions


Actual state overview

At this moment the Speech stack has two parts: the speech to text and the text to speech.

Speech to Text Stack

The speech to text consists in 4 components that each is a ROS nodes with topics.

  1. devices/AudioCapturer [python]: It is a node that captures the audio using PyAudio and publishes it to the topic rawAudioChunk.
  2. devices/InputAudio [c++]: A node that takes the chunks of audio and, using RNNoise, checks for a voice, removes the noise, and publishes it to the topic UsefulAudio.
  3. action_selectors/hear [python]: A node that checks if there is an internet connection, to know whether to call the offline or online engine; this can be overridden.
    • The offline engine is in this node, calls DeepSpeech2 with the content of the topic UsefulAudio, converts it to text, and publishes it to the topic RawInput.
    • Regarding the online engine, this node processes the audio of UsefulAudio do a resample of 16KHz and publishes it in a new topic called UsefulAudio16kHZ.
  4. action_selectors/azureSpeechText [c++]: A node that takes the audio published in the topic UsefulAudio16kHZ and send it to the Azure SpeechToText API, receives the text and publishes it to the topic RawInput.

Text to Speech Stack

There is a transition from the ROS package audio_commons.


  • Retrain LM: To reduce and adapt the LM to our case, kenlm is used. With the kenlm's lmplz, filter and build_binary a "fine-tunning" is done to generate a new adapted LM with specific phrases of the competition.
  • Others: An internal dataset using a website has been created to fine-tune the speech model.

Working on

There is work to update the DeepSpeech to the lastest releases and use a newer PaddlePaddle for it. Also, review the use of TFLite to create DP2 to make inference faster. Other things are speaker localization and hot-word detection.

Installation Requirements

Check this wiki page.

More Documents

Comming soon.