Leaderboard

Audio Understanding Models: Speech + Text → Text

Aduio Understanding and Generation Models: Speech → Speech

Audio Understanding Models

rank	model	score	asr	ast
🏅	Gemini-1.5-Pro	66	94	38
🥈	GPT-4o-Realtime	56	85	30
🥉	qwen2-audio-instruction	55	78	32
4	Gemini-1.5-Flash	42	55	29
5	Qwen-Audio-Chat	3	-5	11

Aduio Understanding and Generation Models

rank	model	Semantic	Acoustic	AudioArena
🏅	GPT-4o-Realtime	69	84	1212
🥈	GLM-4-Voice	42	80	1099
🥉	Mini-Omni	16	64	961
4	Llama-Omni	17	54	909
5	Moshi	2	66	823

Support datasets

Changelog🔥

[2024/11/11] We support gpt-4o-realtime-preview-2024-10-01(use as gpt4o_audio)
[2024/10/8] We support 30+ datasets!
[2024/9/7] We support vocalsound, MELD benchmark!
[2024/9/6] We support Qwen/Qwen2-Audio-7B, Qwen/Qwen2-Audio-7B-Instruct models!

Overview

AudioEvals is an open-source framework designed for the evaluation of large audio models (Audio LLMs). With this tool, you can easily evaluate any Audio LLM in one go.

Not only do we offer a ready-to-use solution that includes a collection of audio benchmarks and evaluation methodologies, but we also provide the capability for you to customize your evaluations.

Quick Start

ready env

git clone https://github.com//AduioEval.git
cd AduioEval
conda create -n aduioeval python=3.10 -y
conda activate aduioeval
pip install -r requirments.txt

run

export PYTHONPATH=$PWD:$PYTHONPATH
mkdir log/
# eval gemini model only when you are in USA
export GOOGLE_API_KEY=$your-key
python audio_evals/main.py --dataset KeSpeech-sample --model gemini-pro

# eval qwen-audio api model
export DASHSCOPE_API_KEY=$your-key
python audio_evals/main.py --dataset KeSpeech-sample --model qwen-audio

# eval qwen2-audio  offline model in local
pip install -r requirments-offline-model.txt
python audio_evals/main.py --dataset KeSpeech-sample --model qwen2-audio-offline

res

After program executed, you will get the performance in console and detail result as below:

- res
    |-- $time-$name-$dataset.jsonl

Performance

() is offical performance

Usage

To run the evaluation script, use the following command:

python audio_evals/main.py --dataset <dataset_name> --model <model_name>

Dataset Options

The --dataset parameter allows you to specify which dataset to use for evaluation. The following options are available:

tedlium-release1
tedlium-release2
tedlium-release3
catdog
audiocaps
covost2-en-ar
covost2-en-ca
covost2-en-cy
covost2-en-de
covost2-en-et
covost2-en-fa
covost2-en-id
covost2-en-ja
covost2-en-lv
covost2-en-mn
covost2-en-sl
covost2-en-sv
covost2-en-ta
covost2-en-tr
covost2-en-zh
covost2-zh-en
covost2-it-en
covost2-fr-en
covost2-es-en
covost2-de-en
GTZAN
TESS
nsynth
meld-emo
meld-sentiment
clotho-aqa
ravdess-emo
ravdess-gender
COVID-recognizer
respiratory-crackles
respiratory-wheezes
KeSpeech
audio-MNIST
librispeech-test-clean
librispeech-dev-clean
librispeech-test-other
librispeech-dev-other
mls_dutch
mls_french
mls_german
mls_italian
mls_polish
mls_portuguese
mls_spanish
heartbeat_sound
vocalsound
fleurs-zh
voxceleb1
voxceleb2
chord-recognition
wavcaps-audioset
wavcaps-freesound
wavcaps-soundbible
air-foundation
air-chat
desed
peoples-speech
WenetSpeech-test-meeting
WenetSpeech-test-net
gigaspeech
aishell-1
cv-15-en
cv-15-zh
cv-15-fr
cv-15-yue

support dataset detail

<dataset_name>	name	task	domain	metric
tedlium-*	tedlium	ASR(Automatic Speech Recognition)	speech	wer
clotho-aqa	ClothoAQA	AQA(AudioQA)	sound	acc
catdog	catdog	AQA	sound	acc
mls-*	multilingual_librispeech	ASR	speech	wer
KeSpeech	KeSpeech	ASR	speech	cer
librispeech-*	librispeech	ASR	speech	wer
fleurs-*	FLEURS	ASR	speech	wer
aisheel1	AISHELL-1	ASR	speech	wer
WenetSpeech-*	WenetSpeech	ASR	speech	wer
covost2-*	covost2	STT(Speech Text Translation)	speech	BLEU
GTZAN	GTZAN	MQA(MusicQA)	music	acc
TESS	TESS	EMO(emotional recognition)	speech	acc
nsynth	nsynth	MQA	music	acc
meld-emo	meld	EMO	speech	acc
meld-sentiment	meld	SEN(sentiment recognition)	speech	acc
ravdess-emo	ravdess	EMO	speech	acc
ravdess-gender	ravdess	GEND(gender recognition)	speech	acc
COVID-recognizer	COVID	MedicineCls	medicine	acc
respiratory-*	respiratory	MedicineCls	medicine	acc
audio-MNIST	audio-MNIST	AQA	speech	acc
heartbeat_sound	heartbeat	MedicineCls	medicine	acc
vocalsound	vocalsound	MedicineCls	medicine	acc
voxceleb*	voxceleb	GEND	speech	acc
chord-recognition	chord	MQA	music	acc
wavcaps-*	wavcaps	AC(AudioCaption)	sound	acc
air-foundation	AIR-BENCH	AC,GEND,MQA,EMO	sound,music,speech	acc
air-chat	AIR-BENCH	AC,GEND,MQA,EMO	sound,music,speech	GPT4-score
desed	desed	AQA	sound	acc
peoples-speech	peoples-speech	ASR	speech	wer
gigaspeech	gigaspeech	ASR	speech	wer
cv-15-*	common voice 15	ASR	speech	wer

eval your dataset: docs/how add a dataset.md

Model Options

The --model parameter allows you to specify which model to use for evaluation. The following options are available:

qwen2-audio: Use the Qwen2 Audio model.
gemini-pro: Use the Gemini 1.5 Pro model.
gemini-1.5-flash: Use the Gemini 1.5 Flash model.
qwen-audio: Use the qwen2-audio-instruct Audio API model.

eval your model: docs/how eval your model.md

Contact us

If you have questions, suggestions, or feature requests regarding AudioEvals, please submit GitHub Issues to jointly build an open and transparent UltraEval evaluation community.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Leaderboard

Audio Understanding Models

Aduio Understanding and Generation Models

Support datasets

Changelog🔥

Overview

Quick Start

ready env

run

res

Performance

Usage

Dataset Options

support dataset detail

Model Options

Contact us

Citation**

Files

README.md

Latest commit

History

README.md

File metadata and controls

Leaderboard

Audio Understanding Models

Aduio Understanding and Generation Models

Support datasets

Changelog🔥

Overview

Quick Start

ready env

run

res

Performance

Usage

Dataset Options

support dataset detail

Model Options

Contact us

Citation**