It is a repository that shares dataset to create reproducible results for music semantic understanding. We propose a preprocessor for making KV Style (key-values) annotation
file, track split
file, and resampler
. This will help the re-implementation of the research. This is the dataset repository for the paper: Toward Universal Text-to-Music Retrieval
bash scripts/download_splits.sh
Magnatagatune
{
"2": {
"track_id": "2",
"tag": [
"classical",
"strings",
"opera",
"violin"
],
"extra_tag": [
"classical",
"strings",
"opera",
"violin"
],
"title": "BWV54 - I Aria",
"artist_name": "American Bach Soloists",
"release": "J.S. Bach Solo Cantatas",
"path": "f/american_bach_soloists-j_s__bach_solo_cantatas-01-bwv54__i_aria-30-59.mp3"
}
}
GTZAN
{
"blues.00029.wav": {
"artist_name": "Kelly Joe Phelps",
"title": "The House Carpenter",
"key": "minor d",
"tempo": 126,
"tag": "blues",
"track_id": "blues.00029.wav"
}
}
The selection criteria are as follows: if a dataset has 1) commercial music for retrieval, 2) publicly assessed (at least upon request) and 3) categorical single or multi-label annotations for supporting text-based retrieval scenarios.
Dataset | # of Clip | # of Label | Avg.Tag | Task | Src |
---|---|---|---|---|---|
MTAT1 | 25,860 | 50 | 2.70 | Tagging | Link, Split |
MTAT2 | 21,108 | 50 | 3.30 | Tagging | Link, Split |
MTG top50s | 54,380 | 50 | 3.07 | Tagging | Link |
MTG Genre | 55,094 | 87 | 2.44 | Genre | Link |
FMA Small | 8,000 | 8 | 1 | Genre | Link |
GTZAN | 930 | 10 | 1 | Genre | Link, Split |
MTG Inst | 24,976 | 40 | 2.57 | Instrument | Link |
KVT | 6,787 | 42 | 22.78 | Vocal | Link |
MTG Mood Theme | 17,982 | 56 | 1.77 | Mood/Theme | Link |
Emotify | 400 | 9 | 1 | Mood | Link |
We summarize all the datasets and tasks in Table. MagnaTagATune (MTAT) consists of 25k music clips from 5,223 unique songs. Following a previous work, we use their published splits and top~50 tags. We do not compare result with previous works using different split. MTG-Jamendo (MTG) contains 55,094 full audio tracks with 183 tags about genre, instrument, and mood/theme. We use the official splits (split-0) in each category for tagging, genre, instrument, and mood/theme tasks. For single-label genre classification, we use the fault-filtered version of GTZAN (GZ) and the `small' version of Free Music Archive (FMA-Small). For the vocal attribute recognition task, we use K-pop Vocal Tag (KVT) dataset. It consists of 6,787 vocal segments from K-pop music tracks. All the segments are annotated with 42 semantic tags describing various vocal style including pitch range, timbre, playing techniques, and gender. For the categorical mood recognition task, we use Emotify dataset. It consists of 400 excerpts in 4 genres with 9 emotional categories.
PyTorch dataset based on our preprocessing.We provide two types of Pytorch Dataset: 1. a waveform loader for training and embedding extraction 2. an embedding loader for probing tasks.
class GTZAN_Dataset(Dataset):
def __init__(self, data_path, split, sr, duration, num_chunks):
"""
data_path (str): location of msu-benchmark
split (str): one of {TRAIN, VALID, TEST}
sr (int): sampling rate of waveform - 16000
num_chunks (int): chunk size of inference audio
get_split (Fn): get one of TRAIN, VALID, TEST splits
get_file_list (Fn): list of data file
"""
self.data_path = data_path
self.split = split
self.sr = sr
self.input_length = int(sr * duration)
self.num_chunks = num_chunks
self.get_split()
self.get_file_list()
def get_split(self):
...
def get_file_list(self):
...
def tag_to_binary(self, text):
...
def audio_load(self, track_id):
audio = np.load(os.path.join(self.data_path, "gtzan", "npy", track_id.replace(".wav",".npy")), mmap_mode='r')
random_idx = random.randint(0, audio.shape[-1]-self.input_length)
audio = torch.from_numpy(np.array(audio[random_idx:random_idx+self.input_length]))
return audio
def get_train_item(self, index):
item = self.fl[index]
tag_list = item['tag']
binary = self.tag_to_binary(tag_list)
audio_tensor = self.audio_load(str(item['track_id']))
return {
"audio":audio_tensor,
"binary":binary
}
def get_eval_item(self, index):
item = self.fl[index]
tag_list = item['tag']
binary = self.tag_to_binary(tag_list)
text = ", ".join(tag_list)
tags = self.list_of_label
track_id = item['track_id']
audio = np.load(os.path.join(self.data_path, "gtzan", "npy", track_id.replace(".wav",".npy")), mmap_mode='r')
hop = (len(audio) - self.input_length) // self.num_chunks
audio = np.stack([np.array(audio[i * hop : i * hop + self.input_length]) for i in range(self.num_chunks)]).astype('float32')
return {
"audio":audio,
"track_id":track_id,
"tags":tags,
"binary":binary,
"text":text
}
def __getitem__(self, index):
if (self.split=='TRAIN') or (self.split=='VALID'):
return self.get_train_item(index)
else:
return self.get_eval_item(index)
def __len__(self):
return len(self.fl)
# download split and audio files
bash scripts/download.sh
# preprocessing all
cd preprocessing
python main.py
If you have difficulty accessing the dataset audio file, please contact- [email protected]
Please consider citing our paper in your publications if the project helps your research. BibTeX reference is as follow.
@inproceedings{doh2023toward,
title={Toward Universal Text-to-Music Retrieval},
author={Doh, SeungHeon and Won, Minz and Choi, Keunwoo and Nam, Juhan},
booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
year={2023}
}