Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consolidate spectrogram dimensions #572

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

consolidate spectrogram dimensions #572

wants to merge 1 commit into from

Conversation

roedoejet
Copy link
Member

Previously, there was an issue where:

  • During preprocessing we saved spectrogram tensors as [K, T]
  • During text-to-spec synthesis we saved spectrogram tensors as [T, K]
  • During spec-to-wav synthesis we expected spectrogram tensors as [T, K]

#513 noticed this problem when synthesizing from certain freq-oriented tensors. Our models work with time-oriented tensors, but it's more standard to have frequency/Mel-band oriented tensors when saving spectrograms (this is the default in torchaudio, librosa etc). Since the output files should be as interoperable as possible, I've consolidated our read/write operations to use [K, T] orientation throughout (i.e. changing text-to-spec synthesis to output [K, T] tensors and expecting [K, T] tensors during spec-to-wav synthesis.

I also moved a log message that said "Loading Vocoder from None" which was annoying. And I replaced writing the wav files with scipy to torchaudio since I started getting some bit depth errors with the spec-to-wav synthesis.

PR Goal?

Ideally this should just work going forward. You should be able to:

  1. synthesize text to wav
  2. synthesize text to spec, then synthesize spec to wav from the generated spec
  3. synthesize spec to wav from a preprocessed spec file
  4. synthesize from a spec file generated prior to this PR using the --time-oriented flag
  5. synthesize audio and spectrograms during training

Fixes?

#513

Feedback sought?

Sanity. I've tested the above expectations 1-5 but please try one or some of them to corroborate and write a comment for which things you tested.

Priority?

medium-high (synthesizing from non-time-oriented spectrograms causes an error right now)

Tests added?

How to test?

Try doing some of the things described in the PR Goal

Confidence?

medium

Version change?

This is a breaking change but we'll just include it in alpha.

Related PRs?

EveryVoiceTTS/FastSpeech2_lightning#94
EveryVoiceTTS/HiFiGAN_iSTFT_lightning#39

Copy link

semanticdiff-com bot commented Oct 30, 2024

Review changes with  SemanticDiff

Changed Files
File Status
  everyvoice/model/feature_prediction/FastSpeech2_lightning  0% smaller
  everyvoice/model/vocoder/HiFiGAN_iSTFT_lightning  0% smaller

Copy link
Contributor

github-actions bot commented Oct 30, 2024

CLI load time: 0:00.32
Pull Request HEAD: 43f861019bb44d82afef675da547a59801626a2c
Imports that take more than 0.1 s:
import time: self [us] | cumulative | imported package
import time:      1021 |     103039 |     typer.main
import time:       281 |     121786 |   typer
import time:       234 |     101213 |       everyvoice.model.feature_prediction.FastSpeech2_lightning.fs2.cli.cli
import time:       172 |     101815 |     everyvoice.model.feature_prediction.FastSpeech2_lightning.fs2.cli
import time:        17 |     101832 |   everyvoice.model.feature_prediction.FastSpeech2_lightning.fs2.cli.preprocess
import time:      7400 |     257250 | everyvoice.cli

Copy link

codecov bot commented Oct 30, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 76.74%. Comparing base (45d0685) to head (43f8610).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #572   +/-   ##
=======================================
  Coverage   76.74%   76.74%           
=======================================
  Files          46       46           
  Lines        3445     3445           
  Branches      470      470           
=======================================
  Hits         2644     2644           
  Misses        700      700           
  Partials      101      101           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@marctessier
Copy link
Collaborator

@roedoejet I think we might have an issues with this.

For my first test, I trained a Vocoder ( not issues here) . Then I tried to use that vocoder for training the FP model but it keeps on crashing with this message below. ( see attachment for the full error log. ) ( vocoder_path: ../logs_and_checkpoints/VocoderExperiment/base/checkpoints/voc.ckpt )
LJ-FP.e3162427.txt

I will try other things like you listed to see how that behaves.

   1524 │   │   if not (self._backward_hooks or self._backward_pre_hooks or s │
│   1525 │   │   │   │   or _global_backward_pre_hooks or _global_backward_hoo │
│   1526 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks │
│ ❱ 1527 │   │   │   return forward_call(*args, **kwargs)                      │
│   1528 │   │                                                                 │
│   1529 │   │   try:                                                          │
│   1530 │   │   │   result = None                                             │
│                                                                              │
│ /gpfs/fs5/nrc/nrc-fs1/ict/others/u/tes001/miniforge3/envs/EveryVoice_dev.ap_ │
│ 513/lib/python3.10/site-packages/torch/nn/modules/conv.py:310 in forward     │
│                                                                              │
│    307 │   │   │   │   │   │   self.padding, self.dilation, self.groups)     │
│    308 │                                                                     │
│    309 │   def forward(self, input: Tensor) -> Tensor:                       │
│ ❱  310 │   │   return self._conv_forward(input, self.weight, self.bias)      │
│    311                                                                       │
│    312                                                                       │
│    313 class Conv2d(_ConvNd):                                                │
│                                                                              │
│ /gpfs/fs5/nrc/nrc-fs1/ict/others/u/tes001/miniforge3/envs/EveryVoice_dev.ap_ │
│ 513/lib/python3.10/site-packages/torch/nn/modules/conv.py:306 in             │
│ _conv_forward                                                                │
│                                                                              │
│    303 │   │   │   return F.conv1d(F.pad(input, self._reversed_padding_repea │
│    304 │   │   │   │   │   │   │   weight, bias, self.stride,                │
│    305 │   │   │   │   │   │   │   _single(0), self.dilation, self.groups)   │
│ ❱  306 │   │   return F.conv1d(input, weight, bias, self.stride,             │
│    307 │   │   │   │   │   │   self.padding, self.dilation, self.groups)     │
│    308 │                                                                     │
│    309 │   def forward(self, input: Tensor) -> Tensor:                       │
╰──────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Given groups=1, weight of size [512, 80, 7], expected input[1, 
696, 80] to have 80 channels, but got 696 channels instead

Loading EveryVoice modules: 100%|██████████| 4/4 [00:13<00:00,  3.35s/it]   
srun: error: ib14gpu-002: task 0: Exited with exit code 1

@marctessier
Copy link
Collaborator

FYI, I also get the same issue when using hifigan_universal_v1_everyvoice.ckpt

I managed to get the FP training to work by removing the reference to vocoder " vocoder_path: " in config/everyvoice-text-to-spec.yaml

@roedoejet
Copy link
Member Author

roedoejet commented Oct 31, 2024

@marctessier - did you maybe not re-run preprocessing? The old mel spectrograms that were calculated will have to be re-processed (everyvoice preprocess config/everyvoice-text-to-spec.yaml -s spec -O)

EDIT: nevermind! I see what you mean. training worked for me, but when I added a vocoder checkpoint it failed during the validation step - nice catch! I fixed this and it should now be ready to go.

@joanise
Copy link
Member

joanise commented Nov 1, 2024

Since this is a breaking change, and it's possible some users will have preprocessed files saved, I'd like to see some heuristic tests that gives a friendly error message if the input looks transposed, with instructions telling the user what to rerun.

@joanise
Copy link
Member

joanise commented Dec 9, 2024

This looks good to me, actually. It'll need rebasing in the submodules and a small conflict resolution.
My request for user-friendly messaging is already address in your original code, I just tested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants