Skip to content

Publicly available largest Sentiment Analysis dataset for Bangla with example code

Notifications You must be signed in to change notification settings

AridHasan/MUBASE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

MUBASE - MUltiplatform BAngla SEntiment

The MUBASE dataset is a multiplatform dataset consisting of Tweets and Facebook posts, which are manually annotated with sentiment polarity. The annotation agreement of this manually annotated dataset shows an agreement score of 0.84, indicating a perfect agreement among the annotators.

Dataset

Data format

Each file uses the tsv format. A row within the tsv adheres to the following structure:

id	text	label

Where:

  • id: an index or id of the text
  • text: text
  • label: Positive, Negative, or Neutral
Example
14737	এখান থেকে সবাই শিক্ষা নিতে পারি ।	Positive

Code folder includes all the scripts that we used for our experiments. To run the scripts, please follow the bellow commands:

Classical Algorithm

To run SVM:

python code/classical_model.py -a svm -i ./Dataset/MUBASE_train.tsv -v ./Dataset/MUBASE_dev.tsv -t ./Dataset/MUBASE_test.tsv -o svm.txt -m model/

To run RF:

python code/classical_model.py -a rf -i ./Dataset/MUBASE_train.tsv -v ./Dataset/MUBASE_dev.tsv -t ./Dataset/MUBASE_test.tsv -o svm.txt -m model/

Fine-tuned Feed Forward Net with Embeddings

To run Fine-tuned FF:

python code/ffnet_pytorch_lightning.py --train data/train_ada_embeddings.jsonl --dev data/dev_ada_embeddings.jsonl --test data/test_embeddings.jsonl --model-dir ./model/ --results-dir em_output/

Fine-Tuned LLMs

To run BERT-m:

python code/run_glue_v1.py \
  --model_name_or_path bert-base-multilingual-uncased \
  --train_file ./Dataset/MUBASE_train.csv \
  --validation_file ./Dataset/MUBASE_dev.csv \
  --test_file ./Dataset/MUBASE_test.csv \
  --do_predict \
  --max_seq_length 256 \
  --per_device_train_batch_size 16 \
  --learning_rate 2e-5 \
  --num_train_epochs 10 \
  --output_dir ./bert-base-multilingual-uncased/ \
  --do_eval \
  --do_train \
  --overwrite_output_dir

To run BERT-bn:

python code/run_glue_v1.py \
  --model_name_or_path csebuetnlp/banglabert \
  --train_file ./Dataset/MUBASE_train.csv \
  --validation_file ./Dataset/MUBASE_dev.csv \
  --test_file ./Dataset/MUBASE_test.csv \
  --do_predict \
  --max_seq_length 256 \
  --per_device_train_batch_size 16 \
  --learning_rate 2e-5 \
  --num_train_epochs 10 \
  --output_dir ./banglabert/ \
  --do_eval \
  --do_train \
  --overwrite_output_dir

To run XLM-RoBERTa-base:

python code/run_glue_v1.py \
  --model_name_or_path xlm-roberta-base \
  --train_file ./Dataset/MUBASE_train.csv \
  --validation_file ./Dataset/MUBASE_dev.csv \
  --test_file ./Dataset/MUBASE_test.csv \
  --do_predict \
  --max_seq_length 256 \
  --per_device_train_batch_size 16 \
  --learning_rate 2e-5 \
  --num_train_epochs 10 \
  --output_dir ./xlm-roberta-base/ \
  --do_eval \
  --do_train \
  --overwrite_output_dir

To run XLM-RoBERTa-large:

python code/run_glue_v1.py \
  --model_name_or_path xlm-roberta-large \
  --train_file ./Dataset/MUBASE_train.csv \
  --validation_file ./Dataset/MUBASE_dev.csv \
  --test_file ./Dataset/MUBASE_test.csv \
  --do_predict \
  --max_seq_length 256 \
  --per_device_train_batch_size 16 \
  --learning_rate 2e-5 \
  --num_train_epochs 10 \
  --output_dir ./xlm-roberta-large/ \
  --do_eval \
  --do_train \
  --overwrite_output_dir

Bloomz

To run Bloomz model, it requires to host in a server.

Citation

@article{hasan2023zero,
  title={Zero-and Few-Shot Prompting with LLMs: A Comparative Study with Fine-tuned Models for Bangla Sentiment Analysis},
  author={Hasan, Md Arid and Das, Shudipta and Anjum, Afiyat and Alam, Firoj and Anjum, Anika and Sarker, Avijit and Noori, Sheak Rashed Haider},
  journal={arXiv preprint arXiv:2308.10783},
  year={2023}
}

License

MUBASE is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

You should have received a copy of the license along with this work. If not, see http://creativecommons.org/licenses/by-nc-sa/4.0/.

About

Publicly available largest Sentiment Analysis dataset for Bangla with example code

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published