Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

descript/encodec is too slow in dataloader #14

Open
vuong-ts opened this issue Dec 8, 2023 · 7 comments
Open

descript/encodec is too slow in dataloader #14

vuong-ts opened this issue Dec 8, 2023 · 7 comments

Comments

@vuong-ts
Copy link

vuong-ts commented Dec 8, 2023

Hi @p0p4k ,

I see that the process time of dac encode in dev/descript_codec branch is too slow on CPU in Dataloader. How can we speedup this process?

def batched_encodec(self, wav):
    with torch.no_grad():
         self.encodec.eval()
         wav = self.resampler(wav) # resample to 24khz
         signal = AudioSignal(wav, 24000)
         x = self.encodec.preprocess(signal.audio_data, signal.sample_rate)
         _, _, latents, _, _ = self.encodec.encode(x)
     return latents
@vuong-ts vuong-ts changed the title descript/encodec is too slow in dataworker descript/encodec is too slow in dataloader Dec 8, 2023
@p0p4k
Copy link
Owner

p0p4k commented Dec 8, 2023

Yes, it is supposed to be done in collate function or inside the model itself, so we can take advantage of batches. In this implementation, I think I am doing one at a time (?)

@vuong-ts
Copy link
Author

vuong-ts commented Dec 8, 2023

Run self.encodec on GPU can speed up the process but I got CUDA error.

RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

@p0p4k
Copy link
Owner

p0p4k commented Dec 8, 2023

You can load a dac encoder on each of your gpu, then send data to corresponding gpu and calculate it. Another temporary option is to do it and save in a path file for all your audio files in preprocess and then use it for training.

@vuong-ts
Copy link
Author

vuong-ts commented Dec 8, 2023

So, I loaded the DAC model into pflowTTS, but I haven't had success in training the entire pflowTTS with DAC:

  • random segmentation fault.
  • NaN loss after 1/2 epochs.

@vuong-ts
Copy link
Author

So, i manage to train pflow with dac encodec.

I use the following code to decode audio with dac.

dataset: ljspeech
epoch: 200
text: On one occasion a disturbance was raised which was not quelled until windows had been broken and forms and tables burnt.
with torch.no_grad():
    dac_encodec.eval()
    wav = resampler(wav) # resample to 24khz
    signal = AudioSignal(wav, 24000)
    x = dac_encodec.preprocess(signal.audio_data, signal.sample_rate).to(device)
    _, _, latents, _, _ = dac_encodec.encode(x)
#
output = synthesise(text, latents)
pred_latents = output["decoder_outputs"]
pred_latents = pred_latents.reshape(1, 32, 8, -1) # B x N x D x T
pred_latents = pred_latents.to(device)
#
z_q = 0
for i in range(32):
    z_q_i, indices = dac_encodec.quantizer.quantizers[i].decode_latents(pred_latents[:, i, :, :])
    z_q_i = dac_encodec.quantizer.quantizers[i].out_proj(z_q_i)
    z_q += z_q_i
#
# Synthesize text_to_speech
idp.Audio(dac_encodec.decode(z_q).squeeze(dim=0).detach().cpu().numpy(), rate=24_000)

The audio output is not as good as mel-spectrogram representation.
https://drive.google.com/file/d/18hDs-mL8mqwmuVTQd8ZMfFsfWFsfxMW9/view

Can I ask your comment on this @p0p4k?

@vuong-ts
Copy link
Author

vuong-ts commented Dec 14, 2023

@rafaelvalle
Regarding the neural codec presentation, Can I ask you questions as you are one of the authors 😊

  1. Have you tried to train Pflow with an audio codec code like VALL-E instead of mel-spectrogram?
  2. Is the training of neural codec representation significantly slower than that of mel-spectrogram?

@p0p4k
Copy link
Owner

p0p4k commented Dec 14, 2023

@vuong-ts The latest meta paper AudioBox uses OT-CFM on encodec latents. But the twist is, pre-training the TTS model with lots of encodec data (~60k). Pflow tts without the loss-mask, MAS and text conditioning is almost equivalent to pre-training it. Then we can finetune on text conditioning. That might be the solution. Essentially give a masked wav -> masked latent -> train ot-cfm to predict the entire latent (like BERT) and then downstream it for TTS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants