-
Notifications
You must be signed in to change notification settings - Fork 5
Speech
At this moment the Speech stack is composed of two main components:
- Speech To Text
- Text To Speech
Consists of 5 components that each is a ROS nodes with topics.
-
AudioCapturer
devices/AudioCapturer [python]: It is a node that captures the audio using PyAudio and publishes it to the topic rawAudioChunk.
-
GetUsefulAudio
There are two options to get the useful audio:
devices/InputAudio [c++]: A node that takes the chunks of audio and, using RNNoise, checks for a voice, cut the audio, removes the noise, and publishes it to the topic UsefulAudio. RNNoise approach fails after running for a while, very long silences affect it.
devices/UsefulAudio [python]: A node that takes the chunks of audio and, using webrtcvad, checks for a voice, cut the audio and publishes it to the topic UsefulAudio. Webrtcvad approach was made as an alternative that don´t remove silence but obtains the pieces of audio when someone speaks perfectly, it has a very good performance.
-
Engine Selector
action_selectors/hear [python]: This node receives the requests of STT. It checks if there is an internet connection, to know whether to call the offline or online engine; this can be overridden with FORCE_ENGINE parameter.
- Online engine: it is in AzureSpeechToText node. For that, this node processes the audio of UsefulAudio do a resample of 16KHz and publishes it in a new topic called UsefulAudioAzure to relay it to that node.
- Offline engine: it is in DeepSpeech node. For that, this node redirect the audio of UsefulAudio to a new topic called UsefulAudioDeepSpeech to relay it to that node.
-
Azure Engine
action_selectors/AzureSpeechToText [c++]: A node that takes the audio published in the topic UsefulAudioAzure and send it to the Azure SpeechToText API, receives the text and publishes it to the topic RawInput.
-
DeepSpeech2 Engine
action_selectors/DeepSpeech [python]: A node that takes the audio published in the topic UsefulAudioDeepSpeech and calls DeepSpeech2, converts it to text, and publishes it to the topic RawInput.
Consists of 1 component that is a ROS node with topics.
-
Say
devices/say [python]: It is a node that say through the speakers what is published under robot_text topic. It has a topic to notify another nodes that the robot is talking inputAudioActive. It uses Google gTTS engine as an online alternative or pyttsx3 as an offline alternative.
roslaunch src/action_selectors/launch/conversation_speech.launch
-
Retrain LM: To reduce and adapt the LM to our case, kenlm is used. With the kenlm's
lmplz
,filter
andbuild_binary
a "fine-tunning" is done to generate a new adapted LM with specific phrases of the competition. Check it here. - Others: An internal dataset using a website has been created to fine-tune the speech model.
Check this wiki page.
- A review of the speech-related technologies we have used and use here.