Music Semantic Understanding Benchmark

Zenodo-Link

It is a repository that shares dataset to create reproducible results for music semantic understanding. We propose a preprocessor for making KV Style (key-values) annotation file, track split file, and resampler. This will help the re-implementation of the research. This is the dataset repository for the paper: Toward Universal Text-to-Music Retrieval

Quick Start

bash scripts/download_splits.sh

Example of annotation.json

Magnatagatune

{
    "2": {
        "track_id": "2",
        "tag": [
            "classical",
            "strings",
            "opera",
            "violin"
        ],
        "extra_tag": [
            "classical",
            "strings",
            "opera",
            "violin"
        ],
        "title": "BWV54 - I Aria",
        "artist_name": "American Bach Soloists",
        "release": "J.S. Bach Solo Cantatas",
        "path": "f/american_bach_soloists-j_s__bach_solo_cantatas-01-bwv54__i_aria-30-59.mp3"
        }
}

GTZAN

{
    "blues.00029.wav": {
        "artist_name": "Kelly Joe Phelps",
        "title": "The House Carpenter",
        "key": "minor d",
        "tempo": 126,
        "tag": "blues",
        "track_id": "blues.00029.wav"
        }
}

Dataset

The selection criteria are as follows: if a dataset has 1) commercial music for retrieval, 2) publicly assessed (at least upon request) and 3) categorical single or multi-label annotations for supporting text-based retrieval scenarios.

Dataset	# of Clip	# of Label	Avg.Tag	Task	Src
MTAT1	25,860	50	2.70	Tagging	Link, Split
MTAT2	21,108	50	3.30	Tagging	Link, Split
MTG top50s	54,380	50	3.07	Tagging	Link
MTG Genre	55,094	87	2.44	Genre	Link
FMA Small	8,000	8	1	Genre	Link
GTZAN	930	10	1	Genre	Link, Split
MTG Inst	24,976	40	2.57	Instrument	Link
KVT	6,787	42	22.78	Vocal	Link
MTG Mood Theme	17,982	56	1.77	Mood/Theme	Link
Emotify	400	9	1	Mood	Link

We summarize all the datasets and tasks in Table. MagnaTagATune (MTAT) consists of 25k music clips from 5,223 unique songs. Following a previous work, we use their published splits and top~50 tags. We do not compare result with previous works using different split. MTG-Jamendo (MTG) contains 55,094 full audio tracks with 183 tags about genre, instrument, and mood/theme. We use the official splits (split-0) in each category for tagging, genre, instrument, and mood/theme tasks. For single-label genre classification, we use the fault-filtered version of GTZAN (GZ) and the `small' version of Free Music Archive (FMA-Small). For the vocal attribute recognition task, we use K-pop Vocal Tag (KVT) dataset. It consists of 6,787 vocal segments from K-pop music tracks. All the segments are annotated with 42 semantic tags describing various vocal style including pitch range, timbre, playing techniques, and gender. For the categorical mood recognition task, we use Emotify dataset. It consists of 400 excerpts in 4 genres with 9 emotional categories.

Dataset

PyTorch dataset based on our preprocessing.We provide two types of Pytorch Dataset: 1. a waveform loader for training and embedding extraction 2. an embedding loader for probing tasks.

class GTZAN_Dataset(Dataset):
    def __init__(self, data_path, split, sr, duration, num_chunks):
        """
            data_path (str): location of msu-benchmark
            split (str): one of {TRAIN, VALID, TEST}
            sr (int): sampling rate of waveform - 16000
            num_chunks (int): chunk size of inference audio
            get_split (Fn): get one of TRAIN, VALID, TEST splits
            get_file_list (Fn): list of data file
        """
        self.data_path = data_path
        self.split = split
        self.sr = sr
        self.input_length = int(sr * duration)
        self.num_chunks = num_chunks
        self.get_split()
        self.get_file_list()
    
    def get_split(self):
        ...
    
    def get_file_list(self):
        ...

    def tag_to_binary(self, text):
        ...

    def audio_load(self, track_id):
        audio = np.load(os.path.join(self.data_path, "gtzan", "npy", track_id.replace(".wav",".npy")), mmap_mode='r')
        random_idx = random.randint(0, audio.shape[-1]-self.input_length)
        audio = torch.from_numpy(np.array(audio[random_idx:random_idx+self.input_length]))
        return audio

    def get_train_item(self, index):
        item = self.fl[index]
        tag_list = item['tag']
        binary = self.tag_to_binary(tag_list)
        audio_tensor = self.audio_load(str(item['track_id']))
        return {
            "audio":audio_tensor, 
            "binary":binary            
            }

    def get_eval_item(self, index):
        item = self.fl[index]
        tag_list = item['tag']
        binary = self.tag_to_binary(tag_list)
        text = ", ".join(tag_list)
        tags = self.list_of_label
        track_id = item['track_id']
        audio = np.load(os.path.join(self.data_path, "gtzan", "npy", track_id.replace(".wav",".npy")), mmap_mode='r')
        hop = (len(audio) - self.input_length) // self.num_chunks
        audio = np.stack([np.array(audio[i * hop : i * hop + self.input_length]) for i in range(self.num_chunks)]).astype('float32')
        return {
            "audio":audio, 
            "track_id":track_id, 
            "tags":tags, 
            "binary":binary, 
            "text":text
            }

    def __getitem__(self, index):
        if (self.split=='TRAIN') or (self.split=='VALID'):
            return self.get_train_item(index)
        else:
            return self.get_eval_item(index)
            
    def __len__(self):
        return len(self.fl)

Re-implementation

# download split and audio files
bash scripts/download.sh

# preprocessing all
cd preprocessing
python main.py

Dataset Request

If you have difficulty accessing the dataset audio file, please contact- [email protected]

Citation

Please consider citing our paper in your publications if the project helps your research. BibTeX reference is as follow.

@inproceedings{doh2023toward,
  title={Toward Universal Text-to-Music Retrieval},
  author={Doh, SeungHeon and Won, Minz and Choi, Keunwoo and Nam, Juhan},
  booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
dataset_embs		dataset_embs
dataset_wavs		dataset_wavs
preprocessing		preprocessing
scripts		scripts
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Music Semantic Understanding Benchmark

Quick Start

Example of annotation.json

Dataset

Dataset

Re-implementation

Dataset Request

Citation

About

Releases

Packages

Contributors 2

Languages

seungheondoh/msu-benchmark

Folders and files

Latest commit

History

Repository files navigation

Music Semantic Understanding Benchmark

Quick Start

Example of annotation.json

Dataset

Dataset

Re-implementation

Dataset Request

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages