Skip to content

Latest commit

 

History

History
53 lines (37 loc) · 6.68 KB

README.md

File metadata and controls

53 lines (37 loc) · 6.68 KB

Reichsanzeiger NLP

This is work in progress. The goal is creating a NLP ground truth corpus based on the OCR ground truth data for the historical newspaper Deutscher Reichsanzeiger und Preußischer Staatsanzeiger (1819-1945). It was scanned and OCR-ed at UB Mannheim.

Ongoing work

  • ✅ Convert the unprocessed text lines from Reichsanzeiger PAGE XML files to separate lines in TXT files [via blatt to_txt]. See data/text_raw/.
  • ✅ Remove hyphens & line breaks from the text lines from Reichsanzeiger files and save them as plain text in TXT files [via blatt to_txt]. See data/text_unhyphenated/.
  • ✅ Split plain text without line breaks & without hyphens into sentences & save it as one sentence per line TSV files [via blatt to_tsv]. See data/sentences_raw/.
  • ✅ Correct sentence splitting manually and remove "noisy data" (e.g., tables). See data/sentences_checked/.
  • ✅ Import plain text (one sentence per line) to INCEpTION
  • ✅ Create the annotation guidelines
  • ✅ Create a tagset and annotation layer in INCEpTION according to the annotation guidelines. See inception/tagsets/ and inception/layers.
  • ✅ Annotate plain text according to the annotation guidelines
  • ✅ Export the annotations in INCEpTION formats (e.g., UIMA CAS XMI). See data/.
  • ✅ Create a convertor from XMI to IOB format and convert XMI files into IOB files (created cas2iob)
  • ⏳ Curate the annotations from two annotators
  • 🔜 Train baseline models for NER/NEL

Annotation Software: INCEpTION

We tested INCEpTION, neat and MedTator. INCEpTION is chosen as the most advanced among them.

When we annotate old German plain text in INCEpTION and MedTator and export annotations in IOB format, tokenization is often incorrect. In these cases one can use neat as tokenization corrector.

If we import plain text with one sentence per line instead of just plain text into INCEpTION, the annotations exported into IOB format have a decent quality of tokenization.

Annotation Guidelines

We decided to develop the annotation guidelines iteratively based on the existing annotation guidelines for historical German texts as well as via analysing the sample pages from the Reichsanzeiger.

Related work

HIPE competition on historical texts – Identifying Historical People, Places and other Entities

Existing NER/NEL corpora for historical German

Dataset Text type Century Project Annotation Guidelines Annotation Tool Tasks Tagset License
AjMC Commentaries XIX Ajax MultiCommentary Zenodo INCEpTION NER, NEL pers, work, loc, object, date, scope License: CC BY 4.0
HIPE-2020 Newspaper mid XIX - mid XX CLEF-HIPE-2020 Zenodo INCEpTION NER, NEL pers, org, prod, time, loc License: CC BY-NC-SA 4.0
Newseye Newspaper mid XIX - mid XX Newseye Zenodo Transkribus NER, NEL PER, LOC, ORG, HumanProd License: CC BY 4.0
SoNAR Newspaper mid XIX - mid XX SoNAR Zenodo neat NER, NEL PER, LOC, ORG License: CC BY 4.0

Reichsanzeiger at UB Mannheim