POS Tagging with Hidden Markov Models
This repository contains an elegant implementation of a POS tagger based on Hidden Markov Models. To get a better view of the internal structure of the developed modules, check the jupypter notebook in this repository.
The HMMTagger class in HMMTagger.py relies on clean and well-document implementation of another class that represents a discrete probability distribution with different smoothing techniques. The probabilistic components of the HMM can smoothed using simple add-one smoothing or more advanced absolute discounting.
Assuming that we have a training data in ConLL format (check this folder for more info), a POS tagger can be trained as follows:
$ python train_tagger.py \
-train_dir data/ \
-train_fileid 'de-train.tt' \
-smoothing 'abs_disc' \
-d 0.3 \
-tagger_file 'POS_tagger'
After training a tagger using the previous step, the tagger object would be saved in disk and can be used to tag a test dataset:
$ python test_tagger.py \
-test_dir data/ \
-test_fileid de-test.t \
-POS_tagger POS_tagger \
-tagged_output tagged_abs_disc.tt
The python file eval.py can used to quantitatively evaluate the performance of the tagger:
$ python eval.py data/de-eval.tt tagged_abs_disc.tt
Output:
Comparing gold file "data/de-eval.tt" and system file "tagged_abs_disc.tt"
Precision, recall, and F1 score:
DET 0.9092 0.9761 0.9415
NOUN 0.8476 0.9835 0.9105
VERB 0.9605 0.8712 0.9137
ADP 0.9632 0.9762 0.9697
. 0.9983 0.9992 0.9987
CONJ 0.9544 0.8974 0.9250
PRON 0.9391 0.8309 0.8817
ADV 0.9234 0.7893 0.8511
ADJ 0.7993 0.6485 0.7160
NUM 0.9906 0.7778 0.8714
PRT 0.8730 0.8730 0.8730
X 0.2000 0.0909 0.1250
Accuracy: 0.9136