Following are the steps to generate training and testing data. There are several parameters to change in order to match different purpose.
We will release the benchmark of Speech-Separation on the LRS3 dataset as soon as possible.
Our script repository is to make the multi-modal speech separation task have a unified standard in data set generation. So that we can follow up on multi-modal speech separation tasks.
We hope that the LRS3 data set will have a unified generation standard for pure voice separation tasks like the WSJ0 data set.
- ffmpeg 4.2.1
- sox 14.4.2
- numpy 1.17.2
- opencv-python 4.1.2.30
- librosa 0.7.0
- dlib 19.19.0
- face_recognition 1.3.0
- In this method, we use the Lip Reading Sentences 3 (LRS) dataset as our training, validation, and test sets.
Afouras T, Chung J S, Senior A, et al. Deep audio-visual speech recognition[J]. IEEE transactions on pattern analysis and machine intelligence, 2018.
- We just use the train_val and test folders in the LRS3 dataset. These two folders need to be merged before using our script.
- Open ./video_process/
cd video_process
- Then use the video_process.py script to get the video frame, get the image of the lip area, and finally adjust its size to 120 × 120.
python video_process.py
# Change the path in the script to your data path.
video_path = 'valid_mouth.txt' # Collection of files with lips detected
inpath = '../frames' # save video frames path
outpath = '../mouth' # save mouth images path
change_root = '../frames' # resize the frames file path
# You can note this code first.
print('--------------Resize the frames-------------')
resize_img(change_root, (120, 120))
- In order to process the image data faster, we use the following command to store the image data in the numpy data format ".npz".
python video_to_np.py
This file is the lrs3 dataset txt file.
train = open('../train.txt', 'r').readlines()
test = open('../test.txt', 'r').readlines()
val = open('../val.txt', 'r').readlines()
-
Running audio_cut.py code, you can cut the sound of the video through the sox software to get a 2s voice signal.
-
Mix it. We use -5db to 5db to mix the voices of two people. This part of the code refers to the method of deep clustering data mixing.
matlab -nodisplay -r create_wav_2speakers
#You need to change this part in create_wav_2speakers.m
'''
data_type = {'tr','cv','tt'};
wsj0root = ''; % YOUR_PATH/raw_audio
output_dir16k=''; % 16k path
output_dir8k=''; % 8k path
'''