This repo hosts the code and models of "BAT: Learning to Reason about Spatial Sounds with Large Language Models" [Accepted by ICML 2024 bib].
conda env create -f environment.yml
bash timm_patch/patch.sh
We provide Balanced train
and Evaluation
set for your convenience. You can download from SpatialAudio.
For the Unbalanced train
set, please refer to Official AudioSet.
Metadata can be downloaded from metadata.
AudioSet
βββ balanced_train
β βββ audio
β β βββ Y00M9FhCet6s.wav
β β βββ Y00mE-lhe_R8.wav
β β βββ ...
βββ eval
β βββ audio
β β βββ Y007P6bFgRCU.wav
β β βββ Y00AGIhlv-w0.wav
β β βββ ...
Please refer to weights-generation or use the one we provided.
Please visit mp3d_reverberation and download manually. Below is an example of the directory structure of the reverberation data.
/path/to/reverb_root
βββ train_reverberation.json
βββ eval_reverberation.json
βββ binaural
β βββ 17DRP5sb8fy
β β βββ 0.npy
β β βββ 10.npy
β β βββ 17DRP5sb8fy.json
β β βββ ...
β βββ 1LXtFkjw3qL
β β βββ 0.npy
β β βββ 10.npy
β β βββ 1LXtFkjw3qL.json
β β βββ ...
βββ mono
β βββ 17DRP5sb8fy
β βββ ...
reverb_type=binaural # or mono / ambisonics (will be supported soon)
bash scripts/finetune-20k.sh $reverb_type
# bash scripts/finetune-2m.sh $reverb_type (if you do have 2M AudioSet data)
We provide a finetuned checkpoint. You can do inference by
bash scripts/inf.sh
# [11:38:56] Test: [ 0/34] eta: 0:02:28 time: 4.3705 data: 1.5862 max mem: 3805
# [11:39:15] Test: [33/34] eta: 0:00:00 time: 0.5546 data: 0.0026 max mem: 3850
# [11:39:15] Test: Total time: 0:00:23 (0.6922 s / it)
# [11:39:22] mAP: 0.497411
# [11:39:23] Accuracy of the network on the 17148 test images: 0.4974
# [11:39:23] distance accuracy: 67.62
# [11:39:23] doa error (20 degree): 24.21
# [11:39:23] doa angular error: 18.00
The TODOs left will be completed before the end of June 2024.
- Environment setup
- Upload pretrained weights
- Fix numba output bug
- Update training data
- Replace tensorboard with W&B
- Inference colab
@article{zheng2024bat,
author = {Zheng, Zhisheng and Peng, Puyuan and Ma, Ziyang and Chen, Xie and Choi, Eunsol and Harwath, David},
title = {BAT: Learning to Reason about Spatial Sounds with Large Language Models},
journal = {arXiv preprint arXiv:2402.01591},
year = {2024},
}
The codebase is based on the Audio-MAE repo.
This project is under the CC-BY 4.0 license. See LICENSE for details.