Model change: use audio-<md5hash of sentences>.wav instead of audio-<idx>.wav as filenames #106

Torlek · 2024-12-02T22:46:15Z

I’ve been thinking about the current system for numbering the audio files (audio-xxx.wav) using an index, and I believe it’s suboptimal. It complicates tasks like inserting or reordering files, and we don’t gain much since we already have the index in the JSON. After looking at model.update_audiobook(), I think there’s a more efficient solution.

My suggestion is to use the MD5 hash of the sentences as the identifier for audio files instead of the index. This approach would offer several benefits:

It ensures that audio content matches the corresponding sentences.
We can still reconstruct the data structure if necessary.
It allows us to easily rearrange or reorder audio files without worrying about breaking the structure.
It enables file reuse in cases of duplicate sentences.

What do you think? Is there anything I might have overlooked? Would you like me to go ahead and implement this, in a way that does not break old audiobooks?

P.S. I’m also considering adding a feature to mark sentences as chapter starts (either manually or using a regular expression). This would help us navigate the sentence list with a go to next/last chapter button and could be used to split output files or mark chapters in an M4B file. Thoughts?

JarodMica · 2024-12-03T09:13:17Z

I’ve been thinking about the current system for numbering the audio files (audio-xxx.wav) using an index, and I believe it’s suboptimal. It complicates tasks like inserting or reordering files, and we don’t gain much since we already have the index in the JSON. After looking at model.update_audiobook(), I think there’s a more efficient solution.

I would definitely agree there's a more optimal way. For example, there are quirks with the current update method where based on the parser, it will still delete audio for sentence that were already present (parser needs work) and then the issue of renaming requires renaming audio files for all indexes which I can see being very inefficient for large books (and I'm assuming this is how you may have run into it.)

The idea behind suffixes was that it makes it human readable, but that could be redundant as long as we keep track of the audio file.

However, one intentional part of the data structure is to tie the audio to not only the sentence, but to each parameter in the dictionary. Hashing the audio file would be a nice way to prevent having to rename it, but the hash would need to identify what dictionary it's tied to as well. In a sense, I'm using the audio file name as the ID of the dictionary.

Just some of my thoughts below:

It ensures that audio content matches the corresponding sentences.

It enables file reuse in cases of duplicate sentences.

For these two points, reuse for the same sentence may not be ideal as the TTS engines are non-deterministic when used with random seed. It may be desired to have variation between one sentence vs another with speaker_id or seed.

We can still reconstruct the data structure if necessary.

It allows us to easily rearrange or reorder audio files without worrying about breaking the structure.

I think the above thoughts also cover my thinking on this point, but given the coupling between all parameters in the dictionary, I'm not sure if changing the audio naming would resolve this issue.

One thought that came to my head was that I could store all generations in some type of "database" with each generation tied to a unique ID. Then, ordering sentence would be a simple matter of just changing the IDs around.

I'll have to think this through more, but I'm all ears as well!

What do you think? Is there anything I might have overlooked? Would you like me to go ahead and implement this, in a way that does not break old audiobooks?

If I'm missing anything, lmk, but based on some of my thoughts above, I'd have to think through more of it before I'd merge a change.

P.S. I’m also considering adding a feature to mark sentences as chapter starts (either manually or using a regular expression). This would help us navigate the sentence list with a go to next/last chapter button and could be used to split output files or mark chapters in an M4B file. Thoughts?

I think this is a great idea!

Off topic, but txt files are pretty primitive as well, so I'm looking to add first epub support, then PDF to the project sometime in the future.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model change: use audio-<md5hash of sentences>.wav instead of audio-<idx>.wav as filenames #106

Model change: use audio-<md5hash of sentences>.wav instead of audio-<idx>.wav as filenames #106

Torlek commented Dec 2, 2024

JarodMica commented Dec 3, 2024 •

edited

Loading

Model change: use audio-<md5hash of sentences>.wav instead of audio-<idx>.wav as filenames #106

Model change: use audio-<md5hash of sentences>.wav instead of audio-<idx>.wav as filenames #106

Comments

Torlek commented Dec 2, 2024

JarodMica commented Dec 3, 2024 • edited Loading

JarodMica commented Dec 3, 2024 •

edited

Loading