Mixat: A Data Set of Bilingual Emirati-English Speech

Mixat is a dataset of Emirati speech code-mixed with English. The dataset consists of 15 hours of speech derived from two public podcasts featuring native Emirati speakers. The data collection process, annotation, and dataset statistics are described in detail in the accompanying paper. If you use this data set, please cite the following paper:

@inproceedings{al-ali-aldarmaki-2024-mixat,
    title = "Mixat: A Data Set of Bilingual Emirati-{E}nglish Speech",
    author = "Al Ali, Maryam Khalifa  and
      Aldarmaki, Hanan",
    booktitle = "Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.sigul-1.26",
    pages = "222--226"
}

Data Set Statistics

Total duration: 15 hours
Number of sentences: 5,307
Percentage of code-switched sentences: 36%
Average Code Mixing Index (CMI): 0.11

Breakdown by podcast:

Mixat - Part 1: Download
This part consists of conversational, multi-speaker utterances from The Direction podcast.
- Total sentences: 3,723
- Code-switched sentences: 1,258
Mixat - Part 2: Download
This part consists of narrated utterances by a single female speaker, from the Think With Hessa podcast.
- Total sentences: 1,584
- Code-switched sentences: 805

Usage

The Mixat dataset is publicly available for research purposes. We recommend using Part 1 for training, and Part 2 for testing.

File Structure

Mixat - Part 1.zip contains the audio files, in .wav format, for Part 1.
Mixat - Part 2.zip contains the audio files, in .wav format, for Part 2.
metadata.csv contains the text transcriptions for both parts.

License

This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Acknowledgement

We thank Mr. Mohammad Al Awadhi, host of The Direction podcast, and Ms. Hessa Alsuwaidi, host of Think With Hessa podcast for allowing us to use their content for creating a dataset to support academic research on Emirati speech.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
LICENSE		LICENSE
README.md		README.md
metadata.csv		metadata.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mixat: A Data Set of Bilingual Emirati-English Speech

Data Set Statistics

Breakdown by podcast:

Mixat - Part 1: Download

Mixat - Part 2: Download

Usage

File Structure

License

Acknowledgement

About

Releases

Packages

Contributors 2

License

mbzuai-nlp/mixat

Folders and files

Latest commit

History

Repository files navigation

Mixat: A Data Set of Bilingual Emirati-English Speech

Data Set Statistics

Breakdown by podcast:

Mixat - Part 1: Download

Mixat - Part 2: Download

Usage

File Structure

License

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages