Skip to content
Josecisneros001 edited this page Jan 31, 2021 · 18 revisions

Speech

Actual state overview

At this moment the Speech stack has two parts: the speech to text and the text to speech.

Speech to Text Stack

The speech to text consists in 5 components that each is a ROS nodes with topics.

  1. devices/AudioCapturer [python]: It is a node that captures the audio using PyAudio and publishes it to the topic rawAudioChunk.
  2. devices/InputAudio [c++]: A node that takes the chunks of audio and, using RNNoise, checks for a voice, removes the noise, and publishes it to the topic UsefulAudio.
  3. action_selectors/hear [python]: This node receives the requests of STT. It checks if there is an internet connection, to know whether to call the offline or online engine; this can be overridden with FORCE_ENGINE parameter.
    • Online engine: it is in AzureSpeechToText node. For that, this node processes the audio of UsefulAudio do a resample of 16KHz and publishes it in a new topic called UsefulAudioAzure to relay it to that node.
    • Offline engine: it is in DeepSpeech node. For that, this node redirect the audio of UsefulAudio to a new topic called UsefulAudioDeepSpeech to relay it to that node.
  4. action_selectors/AzureSpeechToText [c++]: A node that takes the audio published in the topic UsefulAudioAzure and send it to the Azure SpeechToText API, receives the text and publishes it to the topic RawInput.
  5. action_selectors/DeepSpeech [python]: A node that takes the audio published in the topic UsefulAudioDeepSpeech and calls DeepSpeech2, converts it to text, and publishes it to the topic RawInput.

Launch File

roslaunch src/action_selectors/launch/speech_to_text.launch

Text to Speech Stack

There is a transition from the ROS package audio_commons.

Miscellaneous

  • Retrain LM: To reduce and adapt the LM to our case, kenlm is used. With the kenlm's lmplz, filter and build_binary a "fine-tunning" is done to generate a new adapted LM with specific phrases of the competition. Check it here.
  • Others: An internal dataset using a website has been created to fine-tune the speech model.

Working on

There is work to update the DeepSpeech to the lastest releases and use a newer PaddlePaddle for it. Also, review the use of TFLite to implement DS2 to make inference faster. Other things are speaker localization and hot-word detection.

Installation Requirements

Check this wiki page.

Documents

  • A review of the speech-related technologies we have used and use here.