This repository contains the code for the paper MAE-AST: Masked Autoencoding Audio Spectrogram Transformer. Pretrained checkpoints to be hosted in the coming few days.
This repository contains three folders: config, mae_ast, and s3prl. Config contains a default pre-training config for the mae-ast. The mae_ast folder contains the main code for the model, and runs under fairseq. This includes a criterion, task, data loading, and models. The s3prl folder provides the upstream model and configuration for fine-tuning the MAE-AST on Superb tasks under the S3prl repository. This repository does not include fine-tuning code for AudioSet, Librispeech, and KS2, which are instead evaluated under the SSAST library with no settings changed.
Please email [email protected] for questions.
Below are the two 12-layer models used in the overall results section of the paper, with a masking ratio of 75%. Clicking the link attempts to display the model checkpoints as a text file. Use wget or open the link in a new tab and save.
Download | Model | Layers | Masking | AS | ESC-50 | KS2 | KS1 | SID | ER |
---|---|---|---|---|---|---|---|---|---|
Checkpoint | MAE-AST Patch | 12 | Chunked | 0.306 | 0.900 | 0.979 | 0.958 | - | 0.598 |
Checkpoint | MAE-AST Frame | 12 | Random | 0.230 | 0.889 | 0.980 | 0.973 | 0.633 | 0.621 |
Pretraining on fairseq is done as follows
Run the following commands with conda to set up an environment for pretraining. This assumes that fairseq is downloaded to the home directory
conda create -n fairseq_mae_ast python=3.9
conda activate fairseq_mae_ast
pip install soundfile
cd ~/fairseq
pip install -e ./
conda install tensorboardX
conda install av -c conda-forge
pip install sortedcontainers
pip install tensorboard
The dataset code takes in a directory which contains the files train.tsv, valid.tsv, and test.tsv, containing paths to the train, valid, and test data respectively. Each of train.tsv, valid.tsv, and test.tsv are tab separated value files with a /
on the first line, followed by lines with (audio file paths, tab, length in frames of that audio file). For example, train.tsv starts with:
/
/path/to/AudioSet/unbalanced/6XUF56FlKvg.mkv 479232
/path/to/data/AudioSet/unbalanced/eJS_911G6ps.mkv 477696
and test.tsv starts with:
/
/path/to/LibriSpeech/data/test-other/3331/159609/3331-159609-0002.flac 225600
/path/to/LibriSpeech/data/test-other/3331/159609/3331-159609-0021.flac 165920
The dataset expects either mkv or flac files as input.
Let MAE-AST-Public be the base directory of this repository
Run the following to set up enviroment variables
conda activate fairseq_mae_ast
cd ~/MAE-AST-Public
export HYDRA_FULL_ERROR=1
data_dir=/path/to/directory_with_train_valid_test_tsv_input_files
config_dir=/path/to/MAE-AST-Public/config/pretrain
user_dir=/path/to/MAE-AST-Public/mae_ast
The following run commands overwrite the default pretrain configuration, and contain the most important settings to change.
The code for configuration settings is at the top of mae_ast/models/mae_ast.py
and mae_ast/tasks/mae_ast_pretraining.py
. The main model logic (model forward pass) is in the middle of mae_ast/models/mae_ast.py
Default Model Patch (12 Layer).
fairseq-hydra-train \
--config-dir ${config_dir} --config-name mae_ast common.user_dir=${user_dir} task.data=${data_dir} model._name=mae_ast criterion._name=mae_ast \
model.encoder_layers=12 model.decoder_layers=2 \
model.random_mask_prob=0.75 task.mask_type="chunk_mask" \
model.ast_kernel_size_chan=16 model.ast_kernel_size_time=16 model.ast_kernel_stride_chan=16 model.ast_kernel_stride_time=16 \
criterion.classification_weight=1 criterion.reconstruction_weight=10 \
distributed_training.distributed_world_size=1 distributed_training.nprocs_per_node=1 \
common.log_interval=200 checkpoint.save_interval_updates=25000 \
optimization.max_update=550000 dataset.max_tokens=8388608 optimization.lr=[0.0001]\
hydra.run.dir=/path/to/output_model_directory
Default Model Frame (12 Layer). Changing the kernel sizes and strides determines frame vs patch models.
fairseq-hydra-train \
--config-dir ${config_dir} --config-name mae_ast common.user_dir=${user_dir} task.data=${data_dir} model._name=mae_ast criterion._name=mae_ast \
model.encoder_layers=12 model.decoder_layers=2 \
model.random_mask_prob=0.75 task.mask_type="random_mask" \
model.ast_kernel_size_chan=128 model.ast_kernel_size_time=2 model.ast_kernel_stride_chan=128 model.ast_kernel_stride_time=2 \
criterion.classification_weight=1 criterion.reconstruction_weight=10 \
distributed_training.distributed_world_size=1 distributed_training.nprocs_per_node=1 \
common.log_interval=200 checkpoint.save_interval_updates=25000 \
optimization.max_update=550000 dataset.max_tokens=8388608 optimization.lr=[0.0001]\
hydra.run.dir=/path/to/output_model_directory
The random mask probability is 1.45 due to overlap in Wav2Vec2-style masking (specified by task.mask_type="retain_spans"), which creates an average 75% masking ratio. Set the random mask probability to 0.74 for an average of 50% masking. For all other mask types, the random mask probability directly corresponds to the amount of tokens masked.
fairseq-hydra-train \
--config-dir ${config_dir} --config-name mae_ast common.user_dir=${user_dir} task.data=${data_dir} model._name=mae_ast criterion._name=mae_ast \
model.encoder_layers=12 model.decoder_layers=2 \
model.random_mask_prob=1.45 task.mask_type="retain_spans" \
model.ast_kernel_size_chan=128 model.ast_kernel_size_time=2 model.ast_kernel_stride_chan=128 model.ast_kernel_stride_time=2 \
criterion.classification_weight=1 criterion.reconstruction_weight=10 \
distributed_training.distributed_world_size=1 distributed_training.nprocs_per_node=1 \
common.log_interval=200 checkpoint.save_interval_updates=25000 \
optimization.max_update=550000 dataset.max_tokens=8388608 optimization.lr=[0.0001]\
hydra.run.dir=/path/to/output_model_directory
The s3prl directory contains an example for fine-tuning the MAE-AST on superb, plus a readme with specific fine-tuning settings. s3prl/mae_ast/hubconf.py takes in a checkpoint generated during pretraining and uses it on downstream tasks.