This is an implementation of the MelGAN network by Marco Pasini developed by Michael Schaufelberger (michaelschufi), Raphael Mailänder (mailaenderli), and Gabriel Koch (elessar-ch) during our Bachelor thesis at the Institute of Embedded Systems at the ZHAW School of Engineering.
The implementation uses Python with TensorFlow and Librosa.
The project framework itself is inspired from unmix-io.
We use conda and Anaconda Project to manage dependencies. This allows specifying conda and pip dependencies at the same time. We only use pip packages and use only conda for providing Python and NVIDIA CUDA drivers.
-
Install Conda for your operating system
(Miniconda should probably also work, but was not tested)Windows:
choco install -y anaconda3
-
Initialize Conda
conda init {bash, cmd.exe, fish, powershell, tcsh, xonsh, zsh}
-
Install Anaconda Project Make sure you run this inside the base conda environment on Windows (permission problems otherwise).
conda install anaconda-project
-
Specify that anaconda-project should use the global env directory to create its environments.
Windows (Anaconda 3):
[Environment]::SetEnvironmentVariable("ANACONDA_PROJECT_ENVS_PATH", "YOUR_ANACONDA_PATH\envs", [System.EnvironmentVariableTarget]::Machine)
Otherwise, the conda environment doesn't have a name and it makes the integration with tools like VS Code brake in certain cases.
We strongly recommend using GPUs or TPUs to train the network.
To install the drivers necessary, consult the TensorFlow documentation.
The implementation was tested with
- cudatoolkit version 11.0.221
- cudnn version 8.0.4
installed via conda
conda install -c nvidia cudatoolkit==11.0.221 cudnn==8.0.4
-
Install packages To install all packages, run the following:
anaconda-project prepare
-
Activate the conda environment
conda activate melgan-UwpwzX7tcrj7eAR1RJVec
Note: You might need to do this multiple times, since there's a bug that Pip dependencies are not linked when installed right after the initial conda environment creation.
An installation of FFmpeg is required for preprocessing. For printing the model summary, pydot (and GraphViz) is required.
Out of 3 datasets we used during our thesis, 1 of them is publically available and 1 partially.
- GTZAN - Includes 100 songs per 10 different genres, 30 seconds each (1GB)
- Download from Kaggle
- We mainly tested the Jazz and Classical genres.
- Guitar and Piano Datasets (Instrumental-only)
- The Maestro Dataset v2 from Kaggle consists of 200 hours of piano recordings with midi transcriptions.
- We used only songs from 2015 as they do not include other instruments.
- The guitar dataset we used, was from our personal collection.
- The Maestro Dataset v2 from Kaggle consists of 200 hours of piano recordings with midi transcriptions.
The repository is a simple framework that allows for training different model implementations and predicting them, as well as pre- and post-processing.
main.py
- The main script used to start/resume a training session and/or predicting songs.save-predicted-spectrogram.py
- Script to convert predicted spectrograms to audio files.models
- Contains all models/networks- The models implement the base class
BaseModel
inmodels/base.py
- The models implement the base class
logger
- Just a static logging classhelpers
- Some helper functionsdata
- All datasets reside in here. See below.config
- Configuration files that specify hyperparameters, the dataset to be used, model architecture (number of filter, layers, etc.) and more.
Before starting the training, the audio files from the dataset have to be turned into spectrograms.
You can do that the following way:
- Unpack the dataset into the
data/source
folder (arbitrarily nested).
e.g.data/source/GTZAN/jazz
ordata/source/GTZAN/classical
- Open the
spectrogram_generator.py
file and specify the constants at the top of the file. - At the very bottom of the file, specify the following:
- Use
to resample the audio files to the specified sample rate (
convert(extension: str = "mp3", overwrite=False)
SAMPLE_RATE
) and convert from stereo to mono. - Use
to create a dataset after resampling the audio files.
createDataset(folder: str, limit: int = None, extension: str = "mp3", mel: bool = False)
- Use
- Run the script
python data/spectrogram_generator.py
You can test the generated dataset before training with data/dataset-tester.py
.
Configuration files specify hyperparameters, the dataset to be used, model architecture (number of filter, layers, etc.) and more.
They are stored in config/files
.
We have prepared two example configuration files for the GTZAN dataset. One uses mel spectrograms and the other does not.
The configuration is loaded at the start of the main.py
script.
It has basic type-validation, to, for example, prevent accidental usage of integers instead of floats.
The configuration variables are used throughout the code by calling the Config()
function:
from config.config import Config
print(Config().training.model.name)
To start a training you have to specify a configuration file (see above).
Run the following command to start a training run
python main.py --config config/files/example_GTZAN-mel.json
To resume a run, specify the path to the run folder created:
python main.py --run out/2021-06-01-15-32-30_GTZAN-mel
To periodically run predictions during training you have two options:
- Specifying a single file using the commandline argument
--predictfile
like: - Specifying a folder, which holds files to predict.
- Create a folder anywhere in the project and place the songs in there.
- Specify the path top the folder in the configuration file under
environment.rickSamplesDirectory
. - Run the training with
--predictricksamples
.
To predict a file, run
python main.py --run out/2021-06-01-15-32-30_GTZAN-mel --predictfile path/to/file.wav --predictonly
To predict the prepared folder, run
python main.py --run out/2021-06-01-15-32-30_GTZAN-mel --predictricksamples --predictonly
During training, predictions are automatically stored inside the run's output folder.
To not slow down the training process, they are not automatically converted into audio files during training.
Use the convert-predicted-spectrogram.py
script to do that.
Simply specify the output folder and the epoch in which the predictions were created and the dataset parameters.
The predictions itself can be found in the output folder under out/
.