This repository contains the implementation of our novel approach to music emotion recognition across multiple datasets using Large Language Model (LLM) embeddings. Our method involves:
- Computing LLM embeddings for emotion labels
- Clustering similar labels across datasets
- Mapping music features (MERT) to the LLM embedding space
- Introducing alignment regularization to improve cluster dissociation and enhance generalization to unseen datasets for zero-shot classification
-
Clone the repository:
git clone https://github.com/AMAAI-Lab/cross-dataset-emotion-alignment.git cd cross-dataset-emotion-alignment
-
Set up a Conda environment:
conda env create -f environment.yaml conda activate cross-dataset-emotion-alignment
-
Option A: Download the MTG-Jamendo, CAL500, and Emotify datasets and place them in the
data/
directory.Option B: Use the preprocessed data from Hugging Face.
-
If using Option A, preprocess the data and compute MERT features:
python src/data/preprocess.py --audio_duration 10
Configure your training with various options:
-
Baseline dataset training:
# Single dataset with label augmentation python src/train.py data.combine_train_datasets=["mtg"] version=your_version +experiment=label_aug # Multi-dataset without label augmentation python src/train.py data.combine_train_datasets=["mtg", "CAL500"] version=your_version
-
Label clustering (includes label augmentation by default):
python src/train.py +cluster=mtg_CAL500 data.train_combine_mode=max_size_cycle
-
Alignment regularization:
python src/train.py +cluster=reg_mtg_CAL500 model.regularization_alpha=2.5 data.train_combine_mode=max_size_cycle
Available options:
- Datasets:
mtg
,CAL500
,emotify
- Clusters:
mtg_CAL500
,mtg_emotify
,CAL500_emotify
- Regularizations:
reg_mtg_CAL500
,reg_mtg_emotify
,reg_CAL500_emotify
To evaluate a trained model, replace train.py
with eval.py
in the command and specify the checkpoint path. You can also use a wandb artifact to load the model by uncommenting the wandb-related configs in conf/eval.yaml
and providing your wandb project, entity, and artifact name. By default, evaluation is performed on all three datasets. To evaluate on other datasets, modify data/multi.yaml
.
Example commands:
-
Evaluate on a single dataset:
python src/eval.py data.combine_train_datasets=["mtg"] ckpt_path=path/to/checkpoint.ckpt
-
Evaluate with clustering:
python src/eval.py +cluster=mtg_CAL500 data.train_combine_mode=max_size_cycle ckpt_path=path/to/checkpoint.ckpt
-
Evaluate with alignment regularization:
python src/eval.py +cluster=reg_mtg_CAL500 model.regularization_alpha=2.5 data.train_combine_mode=max_size_cycle ckpt_path=path/to/checkpoint.ckpt
We present our results for both segment-level and song-level predictions. Song-level results are obtained through majority voting of segment-level predictions.
Phases | Train-MTG+CAL, Test-EMO | Train-CAL+EMO, Test-MTG | Train-EMO+MTG, Test-CAL |
---|---|---|---|
Baseline 1 | 0.324 | 0.0215 | 0.255 |
Baseline 2 (Clustering) | 0.341 | 0.0240 | 0.262 |
Alignment Regularisation | 0.402 (λ = 2.5) | 0.0248 (λ = 2.5) | 0.262 (λ = 1) |
Phases | Train-MTG+CAL, Test-EMO | Train-CAL+EMO, Test-MTG | Train-EMO+MTG, Test-CAL |
---|---|---|---|
Baseline 1 | 0.315 | 0.0129 | 0.252 |
Baseline 2 (Clustering) | 0.346 | 0.0175 | 0.267 |
Alignment Regularisation | 0.400 (λ = 2.5) | 0.0202 (λ = 2.5) | 0.229 (λ = 1) |
These results demonstrate the effectiveness of our approach, particularly the alignment regularization technique, in improving cross-dataset generalization for music emotion recognition tasks.