Spatial-AST

This repo hosts the code and models of "BAT: Learning to Reason about Spatial Sounds with Large Language Models" [Accepted by ICML 2024 bib].

Installation

conda env create -f environment.yml
bash timm_patch/patch.sh

Data Preparation

AudioSet (Anechoic Audio Source)

We provide Balanced train and Evaluation set for your convenience. You can download from SpatialAudio. For the Unbalanced train set, please refer to Official AudioSet.

Metadata can be downloaded from metadata.

AudioSet
├── balanced_train
│   └── audio
│   │   ├── Y00M9FhCet6s.wav
│   │   ├── Y00mE-lhe_R8.wav
│   │   ├── ...
├── eval
│   └── audio
│   │   ├── Y007P6bFgRCU.wav
│   │   ├── Y00AGIhlv-w0.wav
│   │   ├── ...

Weights

Please refer to weights-generation or use the one we provided.

Reverberation

Please visit mp3d_reverberation and download manually. Below is an example of the directory structure of the reverberation data.

/path/to/reverb_root
├── train_reverberation.json
├── eval_reverberation.json
├── binaural
│   ├── 17DRP5sb8fy
│   │   ├── 0.npy
│   │   ├── 10.npy
│   │   ├── 17DRP5sb8fy.json
│   │   ├── ...
│   ├── 1LXtFkjw3qL
│   │   ├── 0.npy
│   │   ├── 10.npy
│   │   ├── 1LXtFkjw3qL.json
│   │   ├── ...
├── mono
│   ├── 17DRP5sb8fy
│   ├── ...

Train a new model

reverb_type=binaural # or mono / ambisonics (will be supported soon)
bash scripts/finetune-20k.sh $reverb_type
# bash scripts/finetune-2m.sh $reverb_type (if you do have 2M AudioSet data)

Inference

We provide a finetuned checkpoint. You can do inference by

bash scripts/inf.sh

# [11:38:56] Test:  [ 0/34]  eta: 0:02:28    time: 4.3705  data: 1.5862  max mem: 3805
# [11:39:15] Test:  [33/34]  eta: 0:00:00    time: 0.5546  data: 0.0026  max mem: 3850
# [11:39:15] Test: Total time: 0:00:23 (0.6922 s / it)
# [11:39:22] mAP: 0.497411
# [11:39:23] Accuracy of the network on the 17148 test images: 0.4974
# [11:39:23] distance accuracy: 67.62
# [11:39:23] doa error (20 degree): 24.21
# [11:39:23] doa angular error: 18.00

TODO

The TODOs left will be completed before the end of June 2024.

Citation

@article{zheng2024bat,
  author    = {Zheng, Zhisheng and Peng, Puyuan and Ma, Ziyang and Chen, Xie and Choi, Eunsol and Harwath, David},
  title     = {BAT: Learning to Reason about Spatial Sounds with Large Language Models},
  journal   = {arXiv preprint arXiv:2402.01591},
  year      = {2024},
}

Reference

The codebase is based on the Audio-MAE repo.

License

This project is under the CC-BY 4.0 license. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
assets		assets
data		data
scripts		scripts
timm_patch		timm_patch
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
engine_finetune.py		engine_finetune.py
environment.yml		environment.yml
main_finetune.py		main_finetune.py
spatial_ast.py		spatial_ast.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spatial-AST

Installation

Data Preparation

AudioSet (Anechoic Audio Source)

Weights

Reverberation

Train a new model

Inference

TODO

Citation

Reference

License

About

Releases

Packages

Languages

License

zszheng147/Spatial-AST

Folders and files

Latest commit

History

Repository files navigation

Spatial-AST

Installation

Data Preparation

AudioSet (Anechoic Audio Source)

Weights

Reverberation

Train a new model

Inference

TODO

Citation

Reference

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages