Homebrewed VITS-3 with extra flow to improve text encoder's projected normalizing flow distribution and prior loss (WIP 🚧)
- Python >= 3.10
- Clone this repository
- Install python requirements. Please refer requirements.txt
- You may need to install espeak first:
apt-get install espeak
- You may need to install espeak first:
- Download datasets
- Download and extract the LJ Speech dataset, then rename or create a link to the dataset folder:
ln -s /path/to/LJSpeech-1.1/wavs DUMMY1
- For mult-speaker setting, download and extract the VCTK dataset, and downsample wav files to 22050 Hz. Then rename or create a link to the dataset folder:
ln -s /path/to/VCTK-Corpus/downsampled_wavs DUMMY2
- Download and extract the LJ Speech dataset, then rename or create a link to the dataset folder:
- Build Monotonic Alignment Search and run preprocessing if you use your own datasets.
# Cython-version Monotonoic Alignment Search
cd monotonic_align
python setup.py build_ext --inplace
# Preprocessing (g2p) for your own datasets. Preprocessed phonemes for LJ Speech and VCTK have been already provided.
# python preprocess.py --text_index 1 --filelists filelists/ljs_audio_text_train_filelist.txt filelists/ljs_audio_text_val_filelist.txt filelists/ljs_audio_text_test_filelist.txt
# python preprocess.py --text_index 2 --filelists filelists/vctk_audio_sid_text_train_filelist.txt filelists/vctk_audio_sid_text_val_filelist.txt filelists/vctk_audio_sid_text_test_filelist.txt
- model forward pass (dry-run)
import torch
from models import SynthesizerTrn
net_g = SynthesizerTrn(
n_vocab=256,
spec_channels=80,
segment_size=8192,
inter_channels=192,
hidden_channels=192,
filter_channels=768,
n_heads=2,
n_layers=6,
kernel_size=3,
p_dropout=0.1,
resblock="1",
resblock_kernel_sizes=[3, 7, 11],
resblock_dilation_sizes=[[1, 3, 5], [1, 3, 5], [1, 3, 5]],
upsample_rates=[8, 8, 2, 2],
upsample_initial_channel=512,
upsample_kernel_sizes=[16, 16, 4, 4],
n_speakers=0,
gin_channels=0,
use_sdp=True,
use_transformer_flows=True,
# (choose from "pre_conv", "fft", "mono_layer_inter_residual", "mono_layer_post_residual")
transformer_flow_type="fft",
use_spk_conditioned_encoder=True,
use_noise_scaled_mas=True,
use_duration_discriminator=True,
)
x = torch.LongTensor([[1, 2, 3],[4, 5, 6]]) # token ids
x_lengths = torch.LongTensor([3, 2]) # token lengths
y = torch.randn(2, 80, 100) # mel spectrograms
y_lengths = torch.Tensor([100, 80]) # mel spectrogram lengths
net_g(
x=x,
x_lengths=x_lengths,
y=y,
y_lengths=y_lengths,
)
# calculate loss and backpropagate
# LJ Speech
python train.py -c configs/vits3_ljs_nosdp.json -m ljs_base # no-sdp; (recommended)
python train.py -c configs/vits3_ljs_base.json -m ljs_base # with sdp;
# VCTK
python train_ms.py -c configs/vits3_vctk_base.json -m vctk_base
- [] Train for LJ Speech and get sample audio