-
Notifications
You must be signed in to change notification settings - Fork 5
Speech
Josecisneros001 edited this page Jan 31, 2021
·
18 revisions
At this moment the Speech stack has two parts: the speech to text and the text to speech.
The speech to text consists in 5 components that each is a ROS nodes with topics.
- devices/AudioCapturer [python]: It is a node that captures the audio using PyAudio and publishes it to the topic rawAudioChunk.
- devices/InputAudio [c++]: A node that takes the chunks of audio and, using RNNoise, checks for a voice, removes the noise, and publishes it to the topic UsefulAudio.
-
action_selectors/hear [python]: This node receives the requests of STT. It checks if there is an internet connection, to know whether to call the offline or online engine; this can be overridden with FORCE_ENGINE parameter.
- Online engine: it is in AzureSpeechToText node. For that, this node processes the audio of UsefulAudio do a resample of 16KHz and publishes it in a new topic called UsefulAudioAzure to relay it to that node.
- Offline engine: it is in DeepSpeech node. For that, this node redirect the audio of UsefulAudio to a new topic called UsefulAudioDeepSpeech to relay it to that node.
- action_selectors/AzureSpeechToText [c++]: A node that takes the audio published in the topic UsefulAudioAzure and send it to the Azure SpeechToText API, receives the text and publishes it to the topic RawInput.
- action_selectors/DeepSpeech [python]: A node that takes the audio published in the topic UsefulAudioDeepSpeech and calls DeepSpeech2, converts it to text, and publishes it to the topic RawInput.
roslaunch src/action_selectors/launch/speech_to_text.launch
There is a transition from the ROS package audio_commons
.
-
Retrain LM: To reduce and adapt the LM to our case, kenlm is used. With the kenlm's
lmplz
,filter
andbuild_binary
a "fine-tunning" is done to generate a new adapted LM with specific phrases of the competition. Check it here. - Others: An internal dataset using a website has been created to fine-tune the speech model.
There is work to update the DeepSpeech to the lastest releases and use a newer PaddlePaddle for it. Also, review the use of TFLite to implement DS2 to make inference faster. Other things are speaker localization and hot-word detection.
Check this wiki page.
- A review of the speech-related technologies we have used and use here.