KArgen
is the generalization implementation for my Master's Thesis:
Automatic Knowledge Acquisition for the Special Cargo Services Domain with Unsupervised Entity and Relation Extraction
Code structure adopted from: anago
The generalization part provides a model that can be used for entity/relation extraction from special cargo text. The training set was created automatically via KArgo. The model architecture can be seen here:
This repository contains the following folders:
- data/kargo: all datasets for NER/EE/RE in CONLL format. Multi-task modeling as proposed by Bekoulis et al. (2018).
- train: training sets as produced by KArgo
not_terms_only
: dataset contains all sentences, including sentences without entities (for EE)terms_only
: dataset contains only sentences with at least one entity (for EE)
- dev_rel, test_rel: development and test set 1
- online_rel: test set 2 (online documents, based on HTML/PDF excerpts)
- train: training sets as produced by KArgo
- kargen: source code folder for KArgen
- crf.py: CRF layer implementation for Keras, based on keras-contrib
- models.py: model structure and wrapper for simplified Hiearchical Multi-task Learning from hmtl
- preprocessing.py: preprocessing pipeline for sequential deep learning model
- trainer.py: training routine for KArgen model, including callbacks.
- main.py: example of KArgen training and evaluation routine, including saving/loading models.
- infer.ipynb: example of extraction with the trained models, visualization with displaCy
- results.ipynb: notebook for visualizing model training/evaluation results, can be seen here
A comparison of Precision/Recall/F-score for model trained with automatic training set (Auto
) and development set (Manual
), for test set 1 (holdout news articles):
and for test set 2 (online documents):