Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add list of missing audio files in output log from preprocess #570

Open
marctessier opened this issue Oct 29, 2024 · 0 comments
Open

Add list of missing audio files in output log from preprocess #570

marctessier opened this issue Oct 29, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@marctessier
Copy link
Collaborator

Description & Motivation

It would be nice to have a list of all the missing audio files that were given in the filelist and have the information presented in an easy way somewhere in the output log / reports....

Currently to get this information, we have to grep the error logs. Viewing the error log can be a bit difficult for some I think because of the way information is displayed.... ( to view live with a " tail -f" ) but when viewing in vi | less it jumps around because of the way line breaks are being done on parts of it with long lines...

`[U20-GPSC5]:$ grep missing PREP_head.e3160702

2024-10-29 10:26:16.933 | WARNING | everyvoice.preprocessor.preprocessor:process_one_audio:469 - File '{'basename': 'LJ00abc-0001', 'language': 'eng', 'speaker': 'speaker_0', 'characters': 'printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the exhibition', 'phones': 'pɹɪntɪŋ, ɪn ðʌ oʊnli sɛns wɪð wɪtʃ wi ɑɹ æt pɹɛzʌnt kʌnsɜ˞nd, dɪfɜ˞z fɹʌm moʊst ɪf nɑt fɹʌm ɔl ðʌ ɑɹts ʌnd kɹæfts ɹɛpɹɪzɛntɪd ɪn ðʌ ɛksʌbɪʃʌn'}' is missing and will not be processed.`

example below what I mean with the line breaks.. ( ^M ) ( this is not an issue , it is more of a "feature" for viewing live / pretty display...)

it/s]^MProcessing text on 1 CPU: 9007it [00:01, 6875.77it/s]^MProcessing text on 1 CPU: 9695it [00:01, 6875.37it/s]^MProcessing text on 1 CPU: 10385it [00:01, 6882.11it/s]^MProcessing text on 1 CPU: 11074it [00:01, 6744.46it/s]^MProcessing text on 1 CPU: 11750it [00:01, 6708.99it/s]^MProcessing text on 1 CPU: 12422it [00:01, 6703.26it/s]^MProcessing text on 1 CPU: 13093it [00:01, 6654.58it/s]^MProcessing text on 1 CPU: 13099it [00:01, 6768.58it/s]
2024-10-29 10:26:37.259 | INFO     | everyvoice.preprocessor.preprocessor:preprocess:1185 - Processing spec on 40 CPUs...
^MProcessing spec on 40 CPUs:   0%|          | 0/13099 [00:00<?, ?it/s]^MProcessing spec on 40 CPUs:   0%|          | 4/13099 [00:00<28:19,  7.70it/s]^MProcessing spec on 40 CPUs:   0%|          | 32/13099 [00:00<03:18, 65.68it/s]^MProcessing spec on 40 CPUs:   1%|          | 80/13099 [00:00<01:20, 160.79it/s]^MProcessing spec on 40 CPUs:   1%|▏         | 180/13099 [00:02<03:22, 63.80it/s]^MProcessing spec on 40 CPUs:   4%|▎         | 480/13099 [00:02<00:54, 231.01it/s]^MProcessing spec on 40 CPUs:   5%|▌         | 680/13099 [00:02<00:34, 361.82it/s]^MProcessing spec on 40 CPUs:   7%|▋         | 880/13099 [00:03<00:23, 513.52it/s]^MProcessing spec on 40 CPUs:   9%|▉         | 1180/13099 [00:03<00:14, 805.43it/s]^MProcessing spec on 40 CPUs:  11%|█▏        | 1480/13099 [00:03<00:10, 1119.36it/s]^MProcessing spec on 40 CPUs:  14%|█▍        | 1880/13099 [00:03<00:07, 1522.58it/s]^MProcessing spec on 40 CPUs:  17%|█▋        | 2180/13099 [00:03<00:06, 1791.92it/s]^MProcessing spec on 40 CPUs:  20%|█▉        | 2580/13099 [00:03<00:05, 2095.91it/s]^MProcessing spec on 40 CPUs:  22%|██▏       | 2880/13099 [00:03<00:04, 2223.93it/s]^MProcessing spec on 40 CPUs:  24%|██▍       | 3180/13099 [00:03<00:04, 2360.82it/s]^MProcessing spec on 40 CPUs:  27%|██▋       | 3480/13099 [00:04<00:03, 2457.27it/s]^MProcessing spec on 40 CPUs:  30%|██▉       | 3880/13099 [00:04<00:03, 2633.64it/s]^MProcessing spec on 40 CPUs:  33%|███▎      | 4280/13099 [00:04<00:03, 2552.28it/s]^MProcessing spec on 40 CPUs:  35%|███▍      | 4580/13099 [00:04<00:03, 2476.78it/s]^MProcessing spec on 40 CPUs:  37%|███▋      | 4880/13099 [00:04<00:04, 1757.52it/s]^MProcessing spec on 40 CPUs:  40%|███▉      | 5180/13099 [00:04<00:04, 1607.97it/s]^MProcessing spec on 40 CPUs:  41%|████      | 5380/13099 [00:05<00:04, 1632.89it/s]^MProcessing spec on 40 CPUs:

Pitch

List of missing audio files in the outlog report. Not just the total. En easy way for a person to investigate VS having to fiddle with the error logs to capture the information.

Alternatives

An alternative could be to have a separate "filelist" file generated at the end with all the lines that had missing audio?

This is scope creep but we could also include the audio files that are too long or too short in that files....

This is would be to inform the user what "data" is not being included in the training / validation data :-) from the given filelist.... ( Might need one report per filelist to make even easier to decipher ???? )

Additional context

No response

@marctessier marctessier added the enhancement New feature or request label Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant