-
Notifications
You must be signed in to change notification settings - Fork 5
Speech
At this moment the Speech stack has two parts: the speech to text and the text to speech.
The speech to text consists in 3 components that each is a ROS nodes with topics*.
- devices/AudioCapturer[python]: It is a node that captures the audio using PyAudio and publishes it to the topic "rawAudioChunk".
- devices/InputAudio[c++]: A node that takes the chunks of audio and, using RNNoise, checks for a voice, removes the noise, and publishes it to the topic "UsefulAudio".
- action_selectors/hear[python]: A node that takes the voice audio (calling DeepSpeech2), converts to text, and publishes it to the topic "RawInput".
*These are not the actual names of components.
There is a transition from the ROS package audio_commons
.
There is work happening in the complete implemetation of the stack in ROS. Also, there is work to train the models used to make the speech recognition. This is trying to fine-tunning the models with a very specific phrases specifically of the competition. For this, some internal dataset using a website has been created and tools like kenlm are being used.
General
- audio_common (ROS)
RNNoise The version that is being used is here. It is automatically downloaded, build, and added to the code by cmake.
- Linux
- autoconfig
- ar
- make
DeepSpeech2 The one used is the implementation in PaddlePaddle by Baidu. It is a specific forked version here that was copied into the repository. Also, because some problems, there is a specific version needed of PaddlePaddle and no later version of TensorFlow can be there. An example of installing the dependencies is here.
- python2
- paddlepaddle==1.2.1
- pkg-config, libflac-dev, libogg-dev, libvorbis-dev, libboost-dev, swig
- scipy, resampy, SoundFile, python_speech_features
- portaudio19-dev
- pyaudio, pynput
- tensorflow==1.12 (not required, but to note that all this breaks later TF versions)
Comming soon.