Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with corrupting the clean corpus with noise types ;cra trafic, crowd, machine #4

Open
BilalDendani opened this issue Jun 23, 2019 · 7 comments

Comments

@BilalDendani
Copy link

BilalDendani commented Jun 23, 2019

Hello,
I am trying to corrupt clean TIMIT data set using the library maracas using the following code :

from maracas.dataset import Dataset
import numpy as np
np.random.seed(42)
d = Dataset()

d.add_speech_files('/home/bilal/krProjects/timit', recursive=True)
d.add_noise_files('/home/bilal/krProjects/noiseTypes/carTrafic.wav', name='carTrafic')
d.add_noise_files('/home/bilal/krProjects/noiseTypes/crowd.wav', name='crowd')
d.add_noise_files('/home/bilal/krProjects/noiseTypes/machine.wav', name='machine')
d.generate_dataset([-6, -3, 0, 3, 6], '/home/bilal/krProjects/noise_dataset', files_per_condition=5)

I got the following error when executing the code ;
bilal@myhost$ python corruptCleanDC.py
/home/bilal/krProjects/DAE/DAE_venv/lib64/python3.6/site-packages/numba/decorators.py:29: NumbaDeprecationWarning: autojit is deprecated, use jit instead, which provides the same functionality. For more information visit http://numba.pydata.org/numba-doc/latest/reference/deprecation.html#deprecation-of-numba-autojit
warnings.warn(NumbaDeprecationWarning(msg))
Traceback (most recent call last):
File "corruptCleanDC.py", line 25, in
d.generate_dataset([-6, -3, 0, 3, 6], '/home/bilal/krProjects/noise_dataset', files_per_condition=5)
File "/home/bilal/krProjects/DAE/DAE_venv/lib64/python3.6/site-packages/maracas/dataset.py", line 130, in generate_dataset
files_per_condition=files_per_condition)
File "/home/bilal/krProjects/DAE/DAE_venv/lib64/python3.6/site-packages/maracas/dataset.py", line 100, in generate_condition
speech_files = np.random.choice(self.speech, files_per_condition, replace=False).tolist()
File "mtrand.pyx", line 1168, in mtrand.RandomState.choice
ValueError: Cannot take a larger sample than population when 'replace=False'

@jfsantos
Copy link
Owner

It looks like your speech dataset is empty, which might mean there is a bug in recursive_glob. Can you share the output of ls home/bilal/krProjects/timit and ls /home/bilal/krProjects/timit/**/*.WAV with me?

@BilalDendani
Copy link
Author

BilalDendani commented Jun 23, 2019

@jfsantos thank you for your quick replay.
The output of my timit clean folder is

$ ls timit/
sa1.wav sa2.wav
I just take two clean wav files from Timit corpus for test.

@jfsantos
Copy link
Owner

jfsantos commented Jun 23, 2019 via email

@BilalDendani
Copy link
Author

I changed the parameter files_per_condition = 2 and it shows the following error.
$ python corruptCleanDs.py
/home/bilal/krProjects/DAE/DAE_venv/lib64/python3.6/site-packages/numba/decorators.py:29: NumbaDeprecationWarning: autojit is deprecated, use jit instead, which provides the same functionality. For more information visit http://numba.pydata.org/numba-doc/latest/reference/deprecation.html#deprecation-of-numba-autojit
warnings.warn(NumbaDeprecationWarning(msg))
Condition folder already exists!
-6dB: 0%| | 0/2 [00:00<?, ?it/s]Traceback (most recent call last):
File "corruptCleanDC.py", line 25, in
d.generate_dataset([-6, -3, 0, 3, 6], '/home/bilal/krProjects/noise_dataset', files_per_condition=2)
File "/home/bilal/krProjects/DAE/DAE_venv/lib64/python3.6/site-packages/maracas/dataset.py", line 130, in generate_dataset
files_per_condition=files_per_condition)
File "/home/bilal/krProjects/DAE/DAE_venv/lib64/python3.6/site-packages/maracas/dataset.py", line 105, in generate_condition
x, fs = wavread(f)
File "/home/bilal/krProjects/DAE/DAE_venv/lib64/python3.6/site-packages/maracas/utils.py", line 9, in wavread
fs, x = scipy.io.wavfile.read(filename)
File "/home/bilal/krProjects/DAE/DAE_venv/lib64/python3.6/site-packages/scipy/io/wavfile.py", line 236, in read
file_size, is_big_endian = _read_riff_chunk(fid)
File "/home/bilal/krProjects/DAE/DAE_venv/lib64/python3.6/site-packages/scipy/io/wavfile.py", line 168, in _read_riff_chunk
"understood.".format(repr(str1)))
ValueError: File format b'NIST'... not understood.

@jfsantos
Copy link
Owner

jfsantos commented Jun 23, 2019 via email

@BilalDendani
Copy link
Author

BilalDendani commented Jun 23, 2019

Thank you so much @jfsantos.
I will try to convert TIMIT originally .WAV files to .wav then execute the code for corruption.

@BilalDendani
Copy link
Author

BilalDendani commented Jun 29, 2019

I changed the '.WAV' TIMIT files from NIST format to the wave form '.wav'. Now all is better. I have another issue. I want to generate corrupted noise files for speech files having same names (many speakers pronounce same sentence, so the file name is the same).
When I generated data set. I got only one file (the last one).
The following is an example.
......................
.......................
d.add_speech_files('/run/media/bilal/Data/datasets/DataSet/TIMIT/TRAIN/DR6/MKES0/SA1.wav', recursive=True)
d.add_speech_files('/run/media/bilal/Data/datasets/DataSet/TIMIT/TRAIN/DR7/MTMN0/SA1.wav', recursive=True)
d.generate_dataset([-15, -10, -5, 0, 5,10,15], '/run/media/bilal/fb8b3d1d-9bbf-42d6-b741-ad7e4940ac3e/noise_dataset', files_per_condition=600)
I did not get all files with same name.
I want to save generated data set by saving the path of all speech files accordingly. I want to generate these files on the same path in the output "/run/media/bilal/fb8b3d1d-9bbf-42d6-b741-ad7e4940ac3e/noise_dataset". For example
"/run/media/bilal/fb8b3d1d-9bbf-42d6-b741-ad7e4940ac3e/noise_dataset/NoisyTIMIT/TRAIN/DR6/MKES0/SA1.wav"
"/run/media/bilal/fb8b3d1d-9bbf-42d6-b741-ad7e4940ac3e/noise_dataset/NoisyTIMIT/TRAIN/DR7/MTMN0/SA1.wav"
How can I change the method generate_dataset(self, snrs, output_dir, files_per_condition=None) to save all files with same names ?
Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants