Speech

Actual state overview

At this moment the Speech stack has two parts: the speech to text and the text to speech.

Speech to Text Stack

The speech to text consists in 3 components that each is a ROS nodes with topics*.

devices/AudioCapturer[python]: It is a node that captures the audio using PyAudio and publishes it to the topic "rawAudioChunk".
devices/InputAudio[c++]: A node that takes the chunks of audio and, using RNNoise, checks for a voice, removes the noise, and publishes it to the topic "UsefulAudio".
action_selectors/hear[python]: A node that takes the voice audio (calling DeepSpeech2), converts to text, and publishes it to the topic "RawInput".

*These are not the actual names of components.

Text to Speech Stack

There is a transition from the ROS package audio_commons.

Working On

There is work happening in the complete implemetation of the stack in ROS. Also, there is work to train the models used to make the speech recognition. This is trying to fine-tunning the models with a very specific phrases specifically of the competition. For this, some internal dataset using a website has been created and tools like kenlm are being used.

Installation Requirements

General

audio_common (ROS)

RNNoise The version that is being used is here. It is automatically downloaded, build, and added to the code by cmake.

Linux
autoconfig
ar
make

DeepSpeech2 The one used is the implementation in PaddlePaddle by Baidu. It is a specific forked version here that was copied into the repository. Also, because some problems, there is a specific version needed of PaddlePaddle and no later version of TensorFlow can be there. An example of installing the dependencies is here.

python2
paddlepaddle==1.2.1
pkg-config, libflac-dev, libogg-dev, libvorbis-dev, libboost-dev, swig
scipy, resampy, SoundFile, python_speech_features
portaudio19-dev
pyaudio, pynput
tensorflow==1.12 (not required, but to note that all this breaks later TF versions)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speech

Speech

Actual state overview

Speech to Text Stack

Text to Speech Stack

Working On

Installation Requirements

More Documents

Clone this wiki locally