Mixat is a dataset of Emirati speech code-mixed with English. The dataset consists of 15 hours of speech derived from two public podcasts featuring native Emirati speakers. The data collection process, annotation, and dataset statistics are described in detail in the accompanying paper. If you use this data set, please cite the following paper:
@inproceedings{al-ali-aldarmaki-2024-mixat,
title = "Mixat: A Data Set of Bilingual Emirati-{E}nglish Speech",
author = "Al Ali, Maryam Khalifa and
Aldarmaki, Hanan",
booktitle = "Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.sigul-1.26",
pages = "222--226"
}
- Total duration: 15 hours
- Number of sentences: 5,307
- Percentage of code-switched sentences: 36%
- Average Code Mixing Index (CMI): 0.11
-
This part consists of conversational, multi-speaker utterances from The Direction podcast.
Mixat - Part 1: Download
- Total sentences: 3,723
- Code-switched sentences: 1,258
-
This part consists of narrated utterances by a single female speaker, from the Think With Hessa podcast.
Mixat - Part 2: Download
- Total sentences: 1,584
- Code-switched sentences: 805
The Mixat dataset is publicly available for research purposes. We recommend using Part 1 for training, and Part 2 for testing.
Mixat - Part 1.zip
contains the audio files, in .wav format, for Part 1.Mixat - Part 2.zip
contains the audio files, in .wav format, for Part 2.metadata.csv
contains the text transcriptions for both parts.
This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
We thank Mr. Mohammad Al Awadhi, host of The Direction podcast, and Ms. Hessa Alsuwaidi, host of Think With Hessa podcast for allowing us to use their content for creating a dataset to support academic research on Emirati speech.