-
Notifications
You must be signed in to change notification settings - Fork 460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sub/superscript are displayed as plain text characters in the TEI output #160
Comments
Hey there, I had a quick question. I just started tinkering with grobid and I was wondering if the superscript/subscript identification can be added through training such as giving the following training data: β-cell Endoplasmic Reticulum Ca2+ <titleStmt>
<title level="a" type="main">β-cell Endoplasmic Reticulum Ca<sup>2+</sup></title>
</titleStmt> thanks for the input |
subscript and superscript flags are attached to the tokens so we could serialize with |
I'm starting to work on implementing this feature. What should be done when the token contains combinations? Like Also it seems that the place to add this part would be in the @kermitt2 any advice on this? |
With the current recognition, the "style" features could support indeed in principle at least italic, bold, superscript/subscript. The TEI guidelines introduce In TEI, there's also I think what's complicated are the relations and the possible clash with other structures/tagging.
|
I think it's now implemented by injecting I've also tried to modularise a bit the code in methods, so that could be unit tested as different components. I tried not to run the realignment of the code 😅 which usually make a mess... I'm sending some examples: |
First re-flexion, identify piece of text as sub/superscript based on position, fonts, etc.
The text was updated successfully, but these errors were encountered: