Add list of missing audio files in output log from preprocess #570

marctessier · 2024-10-29T18:07:51Z

Description & Motivation

It would be nice to have a list of all the missing audio files that were given in the filelist and have the information presented in an easy way somewhere in the output log / reports....

Currently to get this information, we have to grep the error logs. Viewing the error log can be a bit difficult for some I think because of the way information is displayed.... ( to view live with a " tail -f" ) but when viewing in vi | less it jumps around because of the way line breaks are being done on parts of it with long lines...

`[U20-GPSC5]:$ grep missing PREP_head.e3160702

2024-10-29 10:26:16.933 | WARNING | everyvoice.preprocessor.preprocessor:process_one_audio:469 - File '{'basename': 'LJ00abc-0001', 'language': 'eng', 'speaker': 'speaker_0', 'characters': 'printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the exhibition', 'phones': 'pɹɪntɪŋ, ɪn ðʌ oʊnli sɛns wɪð wɪtʃ wi ɑɹ æt pɹɛzʌnt kʌnsɜ˞nd, dɪfɜ˞z fɹʌm moʊst ɪf nɑt fɹʌm ɔl ðʌ ɑɹts ʌnd kɹæfts ɹɛpɹɪzɛntɪd ɪn ðʌ ɛksʌbɪʃʌn'}' is missing and will not be processed.`

example below what I mean with the line breaks.. ( ^M ) ( this is not an issue , it is more of a "feature" for viewing live / pretty display...)

it/s]^MProcessing text on 1 CPU: 9007it [00:01, 6875.77it/s]^MProcessing text on 1 CPU: 9695it [00:01, 6875.37it/s]^MProcessing text on 1 CPU: 10385it [00:01, 6882.11it/s]^MProcessing text on 1 CPU: 11074it [00:01, 6744.46it/s]^MProcessing text on 1 CPU: 11750it [00:01, 6708.99it/s]^MProcessing text on 1 CPU: 12422it [00:01, 6703.26it/s]^MProcessing text on 1 CPU: 13093it [00:01, 6654.58it/s]^MProcessing text on 1 CPU: 13099it [00:01, 6768.58it/s]
2024-10-29 10:26:37.259 | INFO     | everyvoice.preprocessor.preprocessor:preprocess:1185 - Processing spec on 40 CPUs...
^MProcessing spec on 40 CPUs:   0%|          | 0/13099 [00:00<?, ?it/s]^MProcessing spec on 40 CPUs:   0%|          | 4/13099 [00:00<28:19,  7.70it/s]^MProcessing spec on 40 CPUs:   0%|          | 32/13099 [00:00<03:18, 65.68it/s]^MProcessing spec on 40 CPUs:   1%|          | 80/13099 [00:00<01:20, 160.79it/s]^MProcessing spec on 40 CPUs:   1%|▏         | 180/13099 [00:02<03:22, 63.80it/s]^MProcessing spec on 40 CPUs:   4%|▎         | 480/13099 [00:02<00:54, 231.01it/s]^MProcessing spec on 40 CPUs:   5%|▌         | 680/13099 [00:02<00:34, 361.82it/s]^MProcessing spec on 40 CPUs:   7%|▋         | 880/13099 [00:03<00:23, 513.52it/s]^MProcessing spec on 40 CPUs:   9%|▉         | 1180/13099 [00:03<00:14, 805.43it/s]^MProcessing spec on 40 CPUs:  11%|█▏        | 1480/13099 [00:03<00:10, 1119.36it/s]^MProcessing spec on 40 CPUs:  14%|█▍        | 1880/13099 [00:03<00:07, 1522.58it/s]^MProcessing spec on 40 CPUs:  17%|█▋        | 2180/13099 [00:03<00:06, 1791.92it/s]^MProcessing spec on 40 CPUs:  20%|█▉        | 2580/13099 [00:03<00:05, 2095.91it/s]^MProcessing spec on 40 CPUs:  22%|██▏       | 2880/13099 [00:03<00:04, 2223.93it/s]^MProcessing spec on 40 CPUs:  24%|██▍       | 3180/13099 [00:03<00:04, 2360.82it/s]^MProcessing spec on 40 CPUs:  27%|██▋       | 3480/13099 [00:04<00:03, 2457.27it/s]^MProcessing spec on 40 CPUs:  30%|██▉       | 3880/13099 [00:04<00:03, 2633.64it/s]^MProcessing spec on 40 CPUs:  33%|███▎      | 4280/13099 [00:04<00:03, 2552.28it/s]^MProcessing spec on 40 CPUs:  35%|███▍      | 4580/13099 [00:04<00:03, 2476.78it/s]^MProcessing spec on 40 CPUs:  37%|███▋      | 4880/13099 [00:04<00:04, 1757.52it/s]^MProcessing spec on 40 CPUs:  40%|███▉      | 5180/13099 [00:04<00:04, 1607.97it/s]^MProcessing spec on 40 CPUs:  41%|████      | 5380/13099 [00:05<00:04, 1632.89it/s]^MProcessing spec on 40 CPUs:

Pitch

List of missing audio files in the outlog report. Not just the total. En easy way for a person to investigate VS having to fiddle with the error logs to capture the information.

Alternatives

An alternative could be to have a separate "filelist" file generated at the end with all the lines that had missing audio?

This is scope creep but we could also include the audio files that are too long or too short in that files....

This is would be to inform the user what "data" is not being included in the training / validation data :-) from the given filelist.... ( Might need one report per filelist to make even easier to decipher ???? )

Additional context

No response

The text was updated successfully, but these errors were encountered:

marctessier added the enhancement New feature or request label Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add list of missing audio files in output log from preprocess #570

Add list of missing audio files in output log from preprocess #570

marctessier commented Oct 29, 2024

Add list of missing audio files in output log from preprocess #570

Add list of missing audio files in output log from preprocess #570

Comments

marctessier commented Oct 29, 2024

Description & Motivation

Pitch

Alternatives

Additional context