Speech

Actual state overview

At this moment the Speech stack has two parts: the speech to text and the text to speech.

Speech to Text Stack

The speech to text consists in 5 components that each is a ROS nodes with topics.

devices/AudioCapturer [python]: It is a node that captures the audio using PyAudio and publishes it to the topic rawAudioChunk.
devices/InputAudio [c++]: A node that takes the chunks of audio and, using RNNoise, checks for a voice, removes the noise, and publishes it to the topic UsefulAudio.
action_selectors/hear [python]: This node receives the requests of STT. It checks if there is an internet connection, to know whether to call the offline or online engine; this can be overridden with FORCE_ENGINE parameter.
- Online engine: it is in AzureSpeechToText node. For that, this node processes the audio of UsefulAudio do a resample of 16KHz and publishes it in a new topic called UsefulAudioAzure to relay it to that node.
- Offline engine: it is in DeepSpeech node. For that, this node redirect the audio of UsefulAudio to a new topic called UsefulAudioDeepSpeech to relay it to that node.
action_selectors/AzureSpeechToText [c++]: A node that takes the audio published in the topic UsefulAudioAzure and send it to the Azure SpeechToText API, receives the text and publishes it to the topic RawInput.
action_selectors/DeepSpeech [python]: A node that takes the audio published in the topic UsefulAudioDeepSpeech and calls DeepSpeech2, converts it to text, and publishes it to the topic RawInput.

Launch File

roslaunch src/action_selectors/launch/speech_to_text.launch

Text to Speech Stack

There is a transition from the ROS package audio_commons.

Miscellaneous

Retrain LM: To reduce and adapt the LM to our case, kenlm is used. With the kenlm's lmplz, filter and build_binary a "fine-tunning" is done to generate a new adapted LM with specific phrases of the competition. Check it here.
Others: An internal dataset using a website has been created to fine-tune the speech model.

Working on

There is work to update the DeepSpeech to the lastest releases and use a newer PaddlePaddle for it. Also, review the use of TFLite to implement DS2 to make inference faster. Other things are speaker localization and hot-word detection.

Installation Requirements

Check this wiki page.

Documents

A review of the speech-related technologies we have used and use here.

Home
System Architecture
Speech
- Requirements
Navigation
- Action Server
- Map Contextualizer
- Base Control
Computers
ROS
- Across multiple machines
Vision
- Object Detection
- Face Recognition
- Clothes Detection
Main Engine and Parser
Robot Structure
Coding
Continuous Integration
Docker Usage
Docs and References

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speech

Speech

Actual state overview

Speech to Text Stack

Launch File

Text to Speech Stack

Miscellaneous

Working on

Installation Requirements

Documents

Clone this wiki locally