-
Notifications
You must be signed in to change notification settings - Fork 5
Speech
diegocardozo97 edited this page Jun 1, 2020
·
18 revisions
At this moment the Speech stack has two parts: the speech to text and the text to speech.
The speech to text consists in 4 components that each is a ROS nodes with topics.
- devices/AudioCapturer [python]: It is a node that captures the audio using PyAudio and publishes it to the topic rawAudioChunk.
- devices/InputAudio [c++]: A node that takes the chunks of audio and, using RNNoise, checks for a voice, removes the noise, and publishes it to the topic UsefulAudio.
-
action_selectors/hear [python]: A node that checks if there is an internet connection, to know whether to call the offline or online engine; this can be overridden.
- The offline engine is in this node, calls DeepSpeech2 with the content of the topic UsefulAudio, converts it to text, and publishes it to the topic RawInput.
- Regarding the online engine, this node processes the audio of UsefulAudio do a resample of 16KHz and publishes it in a new topic called UsefulAudio16kHZ.
- action_selectors/azureSpeechText [c++]: A node that takes the audio published in the topic UsefulAudio16kHZ and send it to the Azure SpeechToText API, receives the text and publishes it to the topic RawInput.
There is a transition from the ROS package audio_commons
.
-
Retraine LM: To reduce and adapt the LM to our case, kenlm is used. With the kenlm's
lmplz
,filter
andbuild_binary
a "fine-tunning" is done to generate a new adapted LM with specific phrases of the competition. - Others: An internal dataset using a website has been created to fine-tune the speech model.
There is work to update the DeepSpeech code to use the lastest releases and the latest PaddlePaddle for it. Also, review the use of TFLite to create DP2 to make inference faster. Other things are speaker localization and hot-word detection.
Check this wiki page.
Comming soon.