Massively Multilingual Language Models (MMLMs) have recently gained popularity due to their surprising effectiveness in cross-lingual transfer. While there has been much work in evaluating these models for their performance on a variety of tasks and languages, little attention has been paid on how well calibrated these models are with respect to the confidence in their predictions. We first investigate the calibration of MMLMs in the zero-shot setting and observe a clear case of miscalibration in low-resource languages or those which are typologically diverse from English. Next, we empirically show that calibration methods like temperature scaling and label smoothing do reasonably well towards improving calibration in the zero-shot scenario. We also find that few-shot examples in the language can further help reduce the calibration errors, often substantially. Overall, our work contributes towards building more reliable multilingual models by highlighting the issue of their miscalibration, understanding what language and model specific factors influence it, and pointing out the strategies to improve the same.
- Compatible with Python3.7
- The necessary packages can be install through requirements.txt.
Install VirtualEnv using the following (optional):
$ [sudo] pip install virtualenv
We recommend creating a virtual environment(optional):
$ virtualenv -p python3 venv
$ source venv/bin/activate
Finally, install the required packages by running:
pip install -r requirements.txt
Download XNLI and PAWS-X datasets from the link and place them in data/
directory. There is no need of downloading XCOPA, COPA and SIQA datasets, they are loaded using Hugging Face's datasets library. MARC dataset should be downloaded from here and pasted into data/
directory.
For vanilla fine-tuning of mBERT on XNLI and measuring calibration across lanaguages run
python -m src.run_sentence_cls \
--mmlm bert-base-multilingual-uncased \
--dataset xnli \
--lr 3e-5 \
--num_epochs 3 \
--max_train_samples 40000
To run with label smoothing (LS)
python -m src.run_sentence_cls \
--mmlm bert-base-multilingual-uncased \
--dataset xnli \
--lr 3e-5 \
--num_epochs 3 \
--max_train_samples 40000 \
--alpha_smoothing 0.1
Running Label Smoothing + Temperature Scaling (LS + TS)
python -m src.run_sentence_cls \
--mmlm bert-base-multilingual-uncased \
--dataset xnli \
--lr 3e-5 \
--num_epochs 3 \
--max_train_samples 40000 \
--alpha_smoothing 0.1 \
--temp_scaling \
--cal_lang en \
--cal_size 500
Other examples can be found in scripts/experiments_xnli.sh
and scripts/experiments_xcopa.sh
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.