Skip to content

πŸ¦‡ Encoder of BAT (Learning to Reason about Spatial Sounds with Large Language Models)

License

Notifications You must be signed in to change notification settings

zszheng147/Spatial-AST

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

51 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Spatial-AST

This repo hosts the code and models of "BAT: Learning to Reason about Spatial Sounds with Large Language Models" [Accepted by ICML 2024 bib].

Installation

conda env create -f environment.yml
bash timm_patch/patch.sh

Data Preparation

AudioSet (Anechoic Audio Source)

We provide Balanced train and Evaluation set for your convenience. You can download from SpatialAudio. For the Unbalanced train set, please refer to Official AudioSet.

Metadata can be downloaded from metadata.

AudioSet
β”œβ”€β”€ balanced_train
β”‚   └── audio
β”‚   β”‚   β”œβ”€β”€ Y00M9FhCet6s.wav
β”‚   β”‚   β”œβ”€β”€ Y00mE-lhe_R8.wav
β”‚   β”‚   β”œβ”€β”€ ...
β”œβ”€β”€ eval
β”‚   └── audio
β”‚   β”‚   β”œβ”€β”€ Y007P6bFgRCU.wav
β”‚   β”‚   β”œβ”€β”€ Y00AGIhlv-w0.wav
β”‚   β”‚   β”œβ”€β”€ ...

Weights

Please refer to weights-generation or use the one we provided.

Reverberation

Please visit mp3d_reverberation and download manually. Below is an example of the directory structure of the reverberation data.

/path/to/reverb_root
β”œβ”€β”€ train_reverberation.json
β”œβ”€β”€ eval_reverberation.json
β”œβ”€β”€ binaural
β”‚   β”œβ”€β”€ 17DRP5sb8fy
β”‚   β”‚   β”œβ”€β”€ 0.npy
β”‚   β”‚   β”œβ”€β”€ 10.npy
β”‚   β”‚   β”œβ”€β”€ 17DRP5sb8fy.json
β”‚   β”‚   β”œβ”€β”€ ...
β”‚   β”œβ”€β”€ 1LXtFkjw3qL
β”‚   β”‚   β”œβ”€β”€ 0.npy
β”‚   β”‚   β”œβ”€β”€ 10.npy
β”‚   β”‚   β”œβ”€β”€ 1LXtFkjw3qL.json
β”‚   β”‚   β”œβ”€β”€ ...
β”œβ”€β”€ mono
β”‚   β”œβ”€β”€ 17DRP5sb8fy
β”‚   β”œβ”€β”€ ...

Train a new model

reverb_type=binaural # or mono / ambisonics (will be supported soon)
bash scripts/finetune-20k.sh $reverb_type
# bash scripts/finetune-2m.sh $reverb_type (if you do have 2M AudioSet data)

Inference

We provide a finetuned checkpoint. You can do inference by

bash scripts/inf.sh

# [11:38:56] Test:  [ 0/34]  eta: 0:02:28    time: 4.3705  data: 1.5862  max mem: 3805
# [11:39:15] Test:  [33/34]  eta: 0:00:00    time: 0.5546  data: 0.0026  max mem: 3850
# [11:39:15] Test: Total time: 0:00:23 (0.6922 s / it)
# [11:39:22] mAP: 0.497411
# [11:39:23] Accuracy of the network on the 17148 test images: 0.4974
# [11:39:23] distance accuracy: 67.62
# [11:39:23] doa error (20 degree): 24.21
# [11:39:23] doa angular error: 18.00

TODO

The TODOs left will be completed before the end of June 2024.

  • Environment setup
  • Upload pretrained weights
  • Fix numba output bug
  • Update training data
  • Replace tensorboard with W&B
  • Inference colab

Citation

@article{zheng2024bat,
  author    = {Zheng, Zhisheng and Peng, Puyuan and Ma, Ziyang and Chen, Xie and Choi, Eunsol and Harwath, David},
  title     = {BAT: Learning to Reason about Spatial Sounds with Large Language Models},
  journal   = {arXiv preprint arXiv:2402.01591},
  year      = {2024},
}

Reference

The codebase is based on the Audio-MAE repo.

License

This project is under the CC-BY 4.0 license. See LICENSE for details.

About

πŸ¦‡ Encoder of BAT (Learning to Reason about Spatial Sounds with Large Language Models)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published