You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It would be nice to have a list of all the missing audio files that were given in the filelist and have the information presented in an easy way somewhere in the output log / reports....
Currently to get this information, we have to grep the error logs. Viewing the error log can be a bit difficult for some I think because of the way information is displayed.... ( to view live with a " tail -f" ) but when viewing in vi | less it jumps around because of the way line breaks are being done on parts of it with long lines...
`[U20-GPSC5]:$ grep missing PREP_head.e3160702
2024-10-29 10:26:16.933 | WARNING | everyvoice.preprocessor.preprocessor:process_one_audio:469 - File '{'basename': 'LJ00abc-0001', 'language': 'eng', 'speaker': 'speaker_0', 'characters': 'printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the exhibition', 'phones': 'pɹɪntɪŋ, ɪn ðʌ oʊnli sɛns wɪð wɪtʃ wi ɑɹ æt pɹɛzʌnt kʌnsɜ˞nd, dɪfɜ˞z fɹʌm moʊst ɪf nɑt fɹʌm ɔl ðʌ ɑɹts ʌnd kɹæfts ɹɛpɹɪzɛntɪd ɪn ðʌ ɛksʌbɪʃʌn'}' is missing and will not be processed.`
example below what I mean with the line breaks.. ( ^M ) ( this is not an issue , it is more of a "feature" for viewing live / pretty display...)
it/s]^MProcessing text on 1 CPU: 9007it [00:01, 6875.77it/s]^MProcessing text on 1 CPU: 9695it [00:01, 6875.37it/s]^MProcessing text on 1 CPU: 10385it [00:01, 6882.11it/s]^MProcessing text on 1 CPU: 11074it [00:01, 6744.46it/s]^MProcessing text on 1 CPU: 11750it [00:01, 6708.99it/s]^MProcessing text on 1 CPU: 12422it [00:01, 6703.26it/s]^MProcessing text on 1 CPU: 13093it [00:01, 6654.58it/s]^MProcessing text on 1 CPU: 13099it [00:01, 6768.58it/s]
2024-10-29 10:26:37.259 | INFO | everyvoice.preprocessor.preprocessor:preprocess:1185 - Processing spec on 40 CPUs...
^MProcessing spec on 40 CPUs: 0%| | 0/13099 [00:00<?, ?it/s]^MProcessing spec on 40 CPUs: 0%| | 4/13099 [00:00<28:19, 7.70it/s]^MProcessing spec on 40 CPUs: 0%| | 32/13099 [00:00<03:18, 65.68it/s]^MProcessing spec on 40 CPUs: 1%| | 80/13099 [00:00<01:20, 160.79it/s]^MProcessing spec on 40 CPUs: 1%|▏ | 180/13099 [00:02<03:22, 63.80it/s]^MProcessing spec on 40 CPUs: 4%|▎ | 480/13099 [00:02<00:54, 231.01it/s]^MProcessing spec on 40 CPUs: 5%|▌ | 680/13099 [00:02<00:34, 361.82it/s]^MProcessing spec on 40 CPUs: 7%|▋ | 880/13099 [00:03<00:23, 513.52it/s]^MProcessing spec on 40 CPUs: 9%|▉ | 1180/13099 [00:03<00:14, 805.43it/s]^MProcessing spec on 40 CPUs: 11%|█▏ | 1480/13099 [00:03<00:10, 1119.36it/s]^MProcessing spec on 40 CPUs: 14%|█▍ | 1880/13099 [00:03<00:07, 1522.58it/s]^MProcessing spec on 40 CPUs: 17%|█▋ | 2180/13099 [00:03<00:06, 1791.92it/s]^MProcessing spec on 40 CPUs: 20%|█▉ | 2580/13099 [00:03<00:05, 2095.91it/s]^MProcessing spec on 40 CPUs: 22%|██▏ | 2880/13099 [00:03<00:04, 2223.93it/s]^MProcessing spec on 40 CPUs: 24%|██▍ | 3180/13099 [00:03<00:04, 2360.82it/s]^MProcessing spec on 40 CPUs: 27%|██▋ | 3480/13099 [00:04<00:03, 2457.27it/s]^MProcessing spec on 40 CPUs: 30%|██▉ | 3880/13099 [00:04<00:03, 2633.64it/s]^MProcessing spec on 40 CPUs: 33%|███▎ | 4280/13099 [00:04<00:03, 2552.28it/s]^MProcessing spec on 40 CPUs: 35%|███▍ | 4580/13099 [00:04<00:03, 2476.78it/s]^MProcessing spec on 40 CPUs: 37%|███▋ | 4880/13099 [00:04<00:04, 1757.52it/s]^MProcessing spec on 40 CPUs: 40%|███▉ | 5180/13099 [00:04<00:04, 1607.97it/s]^MProcessing spec on 40 CPUs: 41%|████ | 5380/13099 [00:05<00:04, 1632.89it/s]^MProcessing spec on 40 CPUs:
Pitch
List of missing audio files in the outlog report. Not just the total. En easy way for a person to investigate VS having to fiddle with the error logs to capture the information.
Alternatives
An alternative could be to have a separate "filelist" file generated at the end with all the lines that had missing audio?
This is scope creep but we could also include the audio files that are too long or too short in that files....
This is would be to inform the user what "data" is not being included in the training / validation data :-) from the given filelist.... ( Might need one report per filelist to make even easier to decipher ???? )
Additional context
No response
The text was updated successfully, but these errors were encountered:
Description & Motivation
It would be nice to have a list of all the missing audio files that were given in the filelist and have the information presented in an easy way somewhere in the output log / reports....
Currently to get this information, we have to grep the error logs. Viewing the error log can be a bit difficult for some I think because of the way information is displayed.... ( to view live with a " tail -f" ) but when viewing in
vi | less
it jumps around because of the way line breaks are being done on parts of it with long lines...`[U20-GPSC5]:$ grep missing PREP_head.e3160702
2024-10-29 10:26:16.933 | WARNING | everyvoice.preprocessor.preprocessor:process_one_audio:469 - File '{'basename': 'LJ00abc-0001', 'language': 'eng', 'speaker': 'speaker_0', 'characters': 'printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the exhibition', 'phones': 'pɹɪntɪŋ, ɪn ðʌ oʊnli sɛns wɪð wɪtʃ wi ɑɹ æt pɹɛzʌnt kʌnsɜ˞nd, dɪfɜ˞z fɹʌm moʊst ɪf nɑt fɹʌm ɔl ðʌ ɑɹts ʌnd kɹæfts ɹɛpɹɪzɛntɪd ɪn ðʌ ɛksʌbɪʃʌn'}' is missing and will not be processed.`
example below what I mean with the line breaks.. (
^M
) ( this is not an issue , it is more of a "feature" for viewing live / pretty display...)Pitch
List of missing audio files in the outlog report. Not just the total. En easy way for a person to investigate VS having to fiddle with the error logs to capture the information.
Alternatives
An alternative could be to have a separate "filelist" file generated at the end with all the lines that had missing audio?
This is scope creep but we could also include the audio files that are
too long
ortoo short
in that files....This is would be to inform the user what "data" is not being included in the training / validation data :-) from the given filelist.... ( Might need one report per filelist to make even easier to decipher ???? )
Additional context
No response
The text was updated successfully, but these errors were encountered: