Skip to content

mbzuai-nlp/PolyWER

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PolyWER

This repository contains the implementation of the paper:

PolyWER: A Holistic Evaluation Framework for Code-Switched Speech Recognition

    CC BY-NC-SA 4.0


EMNLP 2024

PolyWER

Code-switching in speech can be correctly transcribed in various forms, including different ways of transliteration of the embedded language. Traditional metrics such as Word Error Rate (WER) are too strict to address this challenge. We introduce PolyWER, a framework for evaluating ASR systems to handle language-mixing. PolyWER accepts transcriptions of code-mixed segments in different forms, including transliterations and translations.

Environment & Installation

Python version: 3.10+

git clone https://github.com/mbzuai-nlp/PolyWER.git
cd PolyWER
conda create -n polywer python=3.10
conda activate polywer
pip install -r requirements.txt
python toy_example.py

The toy example includes two sentences with their 3 transcription dimensions and outputs the different metrics we've used in our evaluation (PolyWER, WER, CER, BLEU, BERTScore). To run multiRefWer, please do the following:

git clone https://github.com/qcri/multiRefWER 
cd multiRefWER
mrwer.py -e <polywer_path>/ref_og <polywer_path>/ref_lit <polywer_path>/ref_lat <polywer_path>/hyp 

Please note that we had to modify the mrwer code slightly to be able to run it (adding parentheses to the print statements and commenting out sys.reload)

Dataset

We used the Mixat dataset for our experiments. The original dataset only contains the transcriptions with the English code-switching in latin characters. We augment these transcriptions with two additional dimensions: transliterations and translations. These can be found on HuggingFace.

>>> from datasets import load_dataset
>>> mixat = load_dataset("sqrk/mixat-tri")
>>> mixat
DatasetDict({
    train: Dataset({
        features: ['audio', 'transcript', 'transliteration', 'translation', 'language', 'duration_ms'],
        num_rows: 3727
    })
    test: Dataset({
        features: ['audio', 'transcript', 'transliteration', 'translation', 'language', 'duration_ms'],
        num_rows: 1585
    })
})

License

This data is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

CC BY-NC-SA 4.0

Acknowledgements

If you use PolyWER, please cite the following paper:

@inproceedings{,
  
}

If you use the Mixat dataset (audio and\or text), please also cite the following paper:

@inproceedings{al-ali-aldarmaki-2024-mixat,
    title = "Mixat: A Data Set of Bilingual Emirati-{E}nglish Speech",
    author = "Al Ali, Maryam Khalifa  and
      Aldarmaki, Hanan",
    booktitle = "Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.sigul-1.26",
    pages = "222--226"
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages