NER for Laura's accounts from FromThePage #17

chpollin · 2020-11-25T12:46:48Z

Phase 1: Training the Model

Becaues Laura's accounts are not tabular it's easy to transcribe them in FromThePage, but hard to mark them up. Fortunately, her accounts are very standardized in their wording, making them ideal for NER based on statistical models. Here's what we'd recommend for training a model:

Define the kinds of entities you are looking for; i.e. debtor, creditor, purchase date, execution date, etc. (You should do this together)
Create a NER project in DataTurks.io
Export Laura's text from FromThePage as plaintext; upload to the DataTurks project. (We guess 10 pages would be enough, but you could try fewer.)
Laura would use DataTurks to mark up each of the entities on those 10 pages as one of the defined types.
Once this is done, DataTurks can generate a SpaCy model which Christopher can download.

A caveat: We wrote this based on some work we did in 2018; the service we reference in Phase 1, DataTurks, does not seem to be available anymore (or, you have to run it yourself). The same basic approach should work, but you'll have to find a good tagging platform. (LightTag might be an option, or something off of this list: https://bohemian.ai/blog/text-annotation-tools-which-one-pick-2020/)

Useful References

A similar SpaCy model was created for this project:

But I'm not seeing much specific to the model generation.

Phase 2: Running the Model

Christopher takes the model and applies it to the rest of the text.
Christopher can then script a python program (or notebook) to take the model and apply it to the rest of the text/exported transcription pages.

Useful References:
https://content.fromthepage.com/machine-learning-to-extract-entities-from-ancient-greek-and-other-languages/)
https://github.com/prosopograthon2019/informationextraction/blob/master/spacy.ipynb

Phase 3: Applying the results

Export a TEI file of Laura's project from FromThePage
Read the Spacy output format and apply the identified entities & entity types to TEI file.

Useful references:
see NDAR Stanford NLP code where we apply TEI tags

chpollin added the data acquisition label Nov 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NER for Laura's accounts from FromThePage #17

NER for Laura's accounts from FromThePage #17

chpollin commented Nov 25, 2020 •

edited

Loading

NER for Laura's accounts from FromThePage #17

NER for Laura's accounts from FromThePage #17

Comments

chpollin commented Nov 25, 2020 • edited Loading

chpollin commented Nov 25, 2020 •

edited

Loading