Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NER for Laura's accounts from FromThePage #17

Open
3 tasks
chpollin opened this issue Nov 25, 2020 · 0 comments
Open
3 tasks

NER for Laura's accounts from FromThePage #17

chpollin opened this issue Nov 25, 2020 · 0 comments

Comments

@chpollin
Copy link
Collaborator

chpollin commented Nov 25, 2020

  • Phase 1: Training the Model

Becaues Laura's accounts are not tabular it's easy to transcribe them in FromThePage, but hard to mark them up. Fortunately, her accounts are very standardized in their wording, making them ideal for NER based on statistical models. Here's what we'd recommend for training a model:

Define the kinds of entities you are looking for; i.e. debtor, creditor, purchase date, execution date, etc. (You should do this together)
Create a NER project in DataTurks.io
Export Laura's text from FromThePage as plaintext; upload to the DataTurks project. (We guess 10 pages would be enough, but you could try fewer.)
Laura would use DataTurks to mark up each of the entities on those 10 pages as one of the defined types.
Once this is done, DataTurks can generate a SpaCy model which Christopher can download.

A caveat: We wrote this based on some work we did in 2018; the service we reference in Phase 1, DataTurks, does not seem to be available anymore (or, you have to run it yourself). The same basic approach should work, but you'll have to find a good tagging platform. (LightTag might be an option, or something off of this list: https://bohemian.ai/blog/text-annotation-tools-which-one-pick-2020/)

Useful References

A similar SpaCy model was created for this project:

But I'm not seeing much specific to the model generation.

  • Phase 2: Running the Model

Christopher takes the model and applies it to the rest of the text.
Christopher can then script a python program (or notebook) to take the model and apply it to the rest of the text/exported transcription pages.

Useful References:
https://content.fromthepage.com/machine-learning-to-extract-entities-from-ancient-greek-and-other-languages/)
https://github.com/prosopograthon2019/informationextraction/blob/master/spacy.ipynb

  • Phase 3: Applying the results

Export a TEI file of Laura's project from FromThePage
Read the Spacy output format and apply the identified entities & entity types to TEI file.

Useful references:
see NDAR Stanford NLP code where we apply TEI tags

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant