Skip to content

Historical documents with annotated Named Entities to evaluate NER-tagger

Notifications You must be signed in to change notification settings

ole-meiners/NER-evaluation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 

Repository files navigation

Evaluation corpus for Named Entity Recognition using late medieval and early modern German ego-documents

This repository contains a gold-standard evaluation corpus for Named Entity Recognition. The corpus consists of randomized samples of four late medieval and early modern German ego-documents:

  • NUL: Ankenbauer, Norbert, Hrsg. Paesi novamente retrovati - Newe unbekanthe landte. Eine digitale Edition früher Entdeckerberichte. Bd. 10. Editiones Electronicae Guelferbytanae. Wolfenbüttel: Herzog August Bibliothek, 2012.
  • AD: Ralle, Inga Hanna, David Maus, und Jacqueline Krone. Selbstzeugnisse der Frühen Neuzeit in der Herzog August Bibliothek. Digitale Edition des Diariums von Herzog August dem Jüngeren. Herausgegeben von Herzog August Bibliothek, 2017. http://diglib.hab.de/edoc/ed000225/start.htm.
  • LP: Forschungsstelle für Personalschriften, Marburg, Hrsg. AutoThür - Eine digitale Edition autobiographischer Texte aus Thüringer Leichenpredigten, 2013. http://www.personalschriften.de/leichenpredigten/digitale-editionen/autothuer.html.
  • BE: Prell, Martin, und Julia Schmidt-Funke, Hrsg. Digitale Edition der Briefe Erdmuthe Benignas von Reuß-Ebersdorf (1670-1732). Jena, 2017 [Work in Progress]. http://erdmuthe.thulb.uni-jena.de.

The corpus consists of 24332 tokens (753 sentences) and contains annotations for Persons (413) and Places (757). Organizations were not annotated, as it was not possible to define a sufficiently clear-cut concept of organization for the late medieval and early modern texts. Due to the special characteristics of the texts and of pre- and early modern names, special annotation guidelines were developed and used for the annotation. These are to be understood as a draft and can be found here:

The files are formated according to the specifications of the HIPE Scorer (Version 2.0) of the HIPE ('Identifying Historical People, Places and other Entities') shared tasks on NE processing on historical documents which is similar to the CoNLL-U format. Entities are annotated using the IOB tagging scheme. Further information on the format and the HIPE scorer tool can be found under: https://github.com/hipe-eval/HIPE-scorer

The corpus was created as part of my bachelor's thesis in "Information Management and Information Technology" submitted in 2022 at the Institute of Library and Information Science at HU Berlin. Further information on the corpus, its compilation, the texts used, the annotations and annotation guidelines and the use of the corpus for the evaluation of NER models for historical German can be found here (in German only): [Link]

About

Historical documents with annotated Named Entities to evaluate NER-tagger

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published