PolyWER: A Holistic Evaluation Framework for Code-Switched Speech Recognition
Code-switching in speech can be correctly transcribed in various forms, including different ways of transliteration of the embedded language. Traditional metrics such as Word Error Rate (WER) are too strict to address this challenge. We introduce PolyWER, a framework for evaluating ASR systems to handle language-mixing. PolyWER accepts transcriptions of code-mixed segments in different forms, including transliterations and translations.
Python version: 3.10+
git clone https://github.com/mbzuai-nlp/PolyWER.git
cd PolyWER
conda create -n polywer python=3.10
conda activate polywer
pip install -r requirements.txt
python toy_example.py
The toy example includes two sentences with their 3 transcription dimensions and outputs the different metrics we've used in our evaluation (PolyWER, WER, CER, BLEU, BERTScore). To run multiRefWer, please do the following:
git clone https://github.com/qcri/multiRefWER
cd multiRefWER
mrwer.py -e <polywer_path>/ref_og <polywer_path>/ref_lit <polywer_path>/ref_lat <polywer_path>/hyp
Please note that we had to modify the mrwer code slightly to be able to run it (adding parentheses to the print statements and commenting out sys.reload)
We used the Mixat dataset for our experiments. The original dataset only contains the transcriptions with the English code-switching in latin characters. We augment these transcriptions with two additional dimensions: transliterations and translations. These can be found on HuggingFace.
>>> from datasets import load_dataset
>>> mixat = load_dataset("sqrk/mixat-tri")
>>> mixat
DatasetDict({
train: Dataset({
features: ['audio', 'transcript', 'transliteration', 'translation', 'language', 'duration_ms'],
num_rows: 3727
})
test: Dataset({
features: ['audio', 'transcript', 'transliteration', 'translation', 'language', 'duration_ms'],
num_rows: 1585
})
})
This data is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
If you use PolyWER, please cite the following paper:
@inproceedings{,
}
If you use the Mixat dataset (audio and\or text), please also cite the following paper:
@inproceedings{al-ali-aldarmaki-2024-mixat,
title = "Mixat: A Data Set of Bilingual Emirati-{E}nglish Speech",
author = "Al Ali, Maryam Khalifa and
Aldarmaki, Hanan",
booktitle = "Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.sigul-1.26",
pages = "222--226"
}