Skip to content
Josecisneros001 edited this page Jun 18, 2021 · 18 revisions

Speech

Actual state overview

At this moment the Speech stack is composed of two main components:

  • Speech To Text
  • Text To Speech

Speech to Text Stack

Consists of 5 components that each is a ROS nodes with topics.

  1. AudioCapturer

    devices/AudioCapturer [python]: It is a node that captures the audio using PyAudio and publishes it to the topic rawAudioChunk.

  2. GetUsefulAudio

    There are two options to get the useful audio:

    devices/InputAudio [c++]: A node that takes the chunks of audio and, using RNNoise, checks for a voice, cut the audio, removes the noise, and publishes it to the topic UsefulAudio. RNNoise approach fails after running for a while, very long silences affect it.

    devices/UsefulAudio [python]: A node that takes the chunks of audio and, using webrtcvad, checks for a voice, cut the audio and publishes it to the topic UsefulAudio. Webrtcvad approach was made as an alternative that don´t remove silence but obtains the pieces of audio when someone speaks perfectly, it has a very good performance.

  3. Engine Selector

    action_selectors/hear [python]: This node receives the requests of STT. It checks if there is an internet connection, to know whether to call the offline or online engine; this can be overridden with FORCE_ENGINE parameter.

    • Online engine: it is in AzureSpeechToText node. For that, this node processes the audio of UsefulAudio do a resample of 16KHz and publishes it in a new topic called UsefulAudioAzure to relay it to that node.
    • Offline engine: it is in DeepSpeech node. For that, this node redirect the audio of UsefulAudio to a new topic called UsefulAudioDeepSpeech to relay it to that node.
  4. Azure Engine

    action_selectors/AzureSpeechToText [c++]: A node that takes the audio published in the topic UsefulAudioAzure and send it to the Azure SpeechToText API, receives the text and publishes it to the topic RawInput.

  5. DeepSpeech2 Engine

    action_selectors/DeepSpeech [python]: A node that takes the audio published in the topic UsefulAudioDeepSpeech and calls DeepSpeech2, converts it to text, and publishes it to the topic RawInput.

Text To Speech Stack

Consists of 1 component that is a ROS node with topics.

  1. Say

    devices/say [python]: It is a node that say through the speakers what is published under robot_text topic. It has a topic to notify another nodes that the robot is talking inputAudioActive. It uses Google gTTS engine as an online alternative or pyttsx3 as an offline alternative.

Launch File

TMR 2021 version

roslaunch src/action_selectors/launch/conversation_speech.launch

Miscellaneous

  • Retrain LM: To reduce and adapt the LM to our case, kenlm is used. With the kenlm's lmplz, filter and build_binary a "fine-tunning" is done to generate a new adapted LM with specific phrases of the competition. Check it here.
  • Others: An internal dataset using a website has been created to fine-tune the speech model.

Installation Requirements

Check this wiki page.

Documents

  • A review of the speech-related technologies we have used and use here.