You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Becaues Laura's accounts are not tabular it's easy to transcribe them in FromThePage, but hard to mark them up. Fortunately, her accounts are very standardized in their wording, making them ideal for NER based on statistical models. Here's what we'd recommend for training a model:
Define the kinds of entities you are looking for; i.e. debtor, creditor, purchase date, execution date, etc. (You should do this together)
Create a NER project in DataTurks.io
Export Laura's text from FromThePage as plaintext; upload to the DataTurks project. (We guess 10 pages would be enough, but you could try fewer.)
Laura would use DataTurks to mark up each of the entities on those 10 pages as one of the defined types.
Once this is done, DataTurks can generate a SpaCy model which Christopher can download.
A caveat: We wrote this based on some work we did in 2018; the service we reference in Phase 1, DataTurks, does not seem to be available anymore (or, you have to run it yourself). The same basic approach should work, but you'll have to find a good tagging platform. (LightTag might be an option, or something off of this list: https://bohemian.ai/blog/text-annotation-tools-which-one-pick-2020/)
But I'm not seeing much specific to the model generation.
Phase 2: Running the Model
Christopher takes the model and applies it to the rest of the text.
Christopher can then script a python program (or notebook) to take the model and apply it to the rest of the text/exported transcription pages.
Becaues Laura's accounts are not tabular it's easy to transcribe them in FromThePage, but hard to mark them up. Fortunately, her accounts are very standardized in their wording, making them ideal for NER based on statistical models. Here's what we'd recommend for training a model:
Define the kinds of entities you are looking for; i.e. debtor, creditor, purchase date, execution date, etc. (You should do this together)
Create a NER project in DataTurks.io
Export Laura's text from FromThePage as plaintext; upload to the DataTurks project. (We guess 10 pages would be enough, but you could try fewer.)
Laura would use DataTurks to mark up each of the entities on those 10 pages as one of the defined types.
Once this is done, DataTurks can generate a SpaCy model which Christopher can download.
A caveat: We wrote this based on some work we did in 2018; the service we reference in Phase 1, DataTurks, does not seem to be available anymore (or, you have to run it yourself). The same basic approach should work, but you'll have to find a good tagging platform. (LightTag might be an option, or something off of this list: https://bohemian.ai/blog/text-annotation-tools-which-one-pick-2020/)
Useful References
A similar SpaCy model was created for this project:
But I'm not seeing much specific to the model generation.
Christopher takes the model and applies it to the rest of the text.
Christopher can then script a python program (or notebook) to take the model and apply it to the rest of the text/exported transcription pages.
Useful References:
https://content.fromthepage.com/machine-learning-to-extract-entities-from-ancient-greek-and-other-languages/)
https://github.com/prosopograthon2019/informationextraction/blob/master/spacy.ipynb
Export a TEI file of Laura's project from FromThePage
Read the Spacy output format and apply the identified entities & entity types to TEI file.
Useful references:
see NDAR Stanford NLP code where we apply TEI tags
The text was updated successfully, but these errors were encountered: