This repo contains the code, data, and instructions to reproduce the results in the paper RoBERTa Low Resource Fine Tuning for Sentiment Analysis in Albanian.
If you want to use sentiment model in your own work, follow the instructions using the zensols.edusenti repository. If you use our model or API, please cite our paper.
The source code used Python 3.9.9 using the CUDA 11 drivers. To reproduce the results:
- Clone the paper repo:
git clone https://github.com/uic-nlp-lab/edusenti
- Go into it and prepare the corpus:
cd edusent ; mkdir -p corpus/finetune
- Download and extract the Pretraining Corpus:
wget -O - https://zenodo.org/records/10778230/files/albanian-sq.sqlite3.bz2 | bzip2 -cd > corpus/finetune/sq.sqlite3
- Install dependencies:
pip install --use-deprecated=legacy-resolver -r src/requirements.txt
- Confirm the fine-tune sentiment corpus is readable:
./harness.py finestats
- Vectorize English sentiment corpus batches:
./harness.py batch --override edusenti_default.lang=en
- Vectorize English sentiment corpus batches:
./harness.py batch --override edusenti_default.lang=sq
- Train and test the Albanian model on GloVE 50D embeddings:
./harness.py traintest
- Train and test the English model:
./harness.py traintest --override edusenti_default.lang=en
Use the Jupyter Notebook to train all the variations (and configurations) of the model and print the results.
Note that the repository has a lot of commands and code for creating the
Pretraining Corpus. However, those steps can
be skipped with the wget
download command above.
Important: The focus on this work was Albanian and English was only used for comparison. For this reason, the attention was on Albanian for reproduction of results and not English, which is why the English sentiment dataset splits were not recorded.
Both the Albanian (sq) and English (en) EduSenti corpus are available in this file.
The Albanian pretraining corpus used for pertaining large language models is an SQLite (v3) database with the following tables:
corp_src
: the sources of the Albanian textcorp_doc
: the corpus source (names) and source filesdoc
: joins from sentences to corpus document source (corp_doc
)sent
: the Albanian sentences with tokenization and token length
This query shows how to get the corpus sources and constituent counts:
select cs.id as name, cs.url, count(*) as count
from corp_src as cs, corp_doc as cd, doc as d, sent as s
where cd.name = cs.id and
cd.doc_id = d.rowid and
cd.doc_id = s.doc_id
group by cs.id;
See the corpus creation SQL for useful queries and to see how it was procured/cleaned.
If you use this project in your research please use the following BibTeX entry:
@inproceedings{nuci-etal-2024-roberta-low,
title = "{R}o{BERT}a Low Resource Fine Tuning for Sentiment Analysis in {A}lbanian",
author = "Nuci, Krenare Pireva and
Landes, Paul and
Di Eugenio, Barbara",
editor = "Calzolari, Nicoletta and
Kan, Min-Yen and
Hoste, Veronique and
Lenci, Alessandro and
Sakti, Sakriani and
Xue, Nianwen",
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
month = may,
year = "2024",
address = "Torino, Italy",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.lrec-main.1233",
pages = "14146--14151"
}
Copyright (c) 2024 Paul Landes and Krenare Pireva Nuci