Skip to content

Latest commit

 

History

History
60 lines (45 loc) · 2.06 KB

README.md

File metadata and controls

60 lines (45 loc) · 2.06 KB

POS Tagging with Hidden Markov Models

This repository contains an elegant implementation of a POS tagger based on Hidden Markov Models. To get a better view of the internal structure of the developed modules, check the jupypter notebook in this repository.

The HMMTagger class in HMMTagger.py relies on clean and well-document implementation of another class that represents a discrete probability distribution with different smoothing techniques. The probabilistic components of the HMM can smoothed using simple add-one smoothing or more advanced absolute discounting.

(1) Training a POS Tagger

Assuming that we have a training data in ConLL format (check this folder for more info), a POS tagger can be trained as follows:

$ python  train_tagger.py \
  -train_dir data/ \
  -train_fileid 'de-train.tt' \
  -smoothing 'abs_disc' \
  -d 0.3 \
  -tagger_file 'POS_tagger'

(2) Using a trained POS Tagger

After training a tagger using the previous step, the tagger object would be saved in disk and can be used to tag a test dataset:

$ python  test_tagger.py \
  -test_dir data/ \
  -test_fileid de-test.t \
  -POS_tagger POS_tagger \
  -tagged_output tagged_abs_disc.tt

(3) Evaluating the tagger

The python file eval.py can used to quantitatively evaluate the performance of the tagger:

$ python eval.py data/de-eval.tt tagged_abs_disc.tt

Output:

Comparing gold file "data/de-eval.tt" and system file "tagged_abs_disc.tt"

Precision, recall, and F1 score:

  DET 0.9092 0.9761 0.9415
 NOUN 0.8476 0.9835 0.9105
 VERB 0.9605 0.8712 0.9137
  ADP 0.9632 0.9762 0.9697
    . 0.9983 0.9992 0.9987
 CONJ 0.9544 0.8974 0.9250
 PRON 0.9391 0.8309 0.8817
  ADV 0.9234 0.7893 0.8511
  ADJ 0.7993 0.6485 0.7160
  NUM 0.9906 0.7778 0.8714
  PRT 0.8730 0.8730 0.8730
    X 0.2000 0.0909 0.1250

Accuracy: 0.9136