-
Notifications
You must be signed in to change notification settings - Fork 5
Speech
Diego Cardozo edited this page Jun 1, 2020
·
18 revisions
At this moment the Speech stack has two parts: the speech to text and the text to speech.
The speech to text consists in 4 components that each is a ROS nodes with topics.
- devices/AudioCapturer [python]: It is a node that captures the audio using PyAudio and publishes it to the topic rawAudioChunk.
- devices/InputAudio [c++]: A node that takes the chunks of audio and, using RNNoise, checks for a voice, removes the noise, and publishes it to the topic UsefulAudio.
-
action_selectors/hear [python]: A node that checks if there is an internet connection, to know whether to call the offline or online engine; this can be overridden.
- The offline engine is in this node, calls DeepSpeech2 with the content of the topic UsefulAudio, converts it to text, and publishes it to the topic RawInput.
- Regarding the online engine, this node processes the audio of UsefulAudio do a resample of 16KHz and publishes it in a new topic called UsefulAudio16kHZ.
- action_selectors/azureSpeechText [c++]: A node that takes the audio published in the topic UsefulAudio16kHZ and send it to the Azure SpeechToText API, receives the text and publishes it to the topic RawInput.
There is a transition from the ROS package audio_commons
.
-
Retrain LM: To reduce and adapt the LM to our case, kenlm is used. With the kenlm's
lmplz
,filter
andbuild_binary
a "fine-tunning" is done to generate a new adapted LM with specific phrases of the competition. - Others: An internal dataset using a website has been created to fine-tune the speech model.
There is work to update the DeepSpeech to the lastest releases and use a newer PaddlePaddle for it. Also, review the use of TFLite to create DP2 to make inference faster. Other things are speaker localization and hot-word detection.
Check this wiki page.
Comming soon.