-
Notifications
You must be signed in to change notification settings - Fork 460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with superscript/subscript #1005
Comments
@keto33 Pdfalto recognises superscript/subscript, indeed. Grobid too (they can be accessed via the LayoutToken objects), but it does not yet output in the XML (the change is currently worked in PR #936). The change is quite complex and will take some time to be merged in master, but you can try it out nevertheless. |
@lfoppiano, I had seed #936 before when I was looking for issues related to subscript. However, I did not quite catch how to implement it. Is it already implemented in 0.7.3? |
Hi @keto33
Serialization of superscript/subscript in the TEI XML is implemented in PR #936, but it is foreseen to be merged in version 0.8.0, not 0.7.3. The superscript/subscript should be working well, but in this branch serializing bold/italic is more complicated and require more tests.
Currently yes.
It's possible yes, some document editor/publication generate PDF where PDF element flow is not the reading order for some special tokens. Could you maybe share the landing page of the articles where you saw these problems? I might have a subscription to access them and reproduce the error. |
@kermitt2, thanks for following up. I encountered the problem of subscript/superscript in several papers. More complicated subscripts (e.g., This paper https://doi.org/10.1016/S0167-2738(00)00327-1 has most of the problems I mentioned. You just need to take a look at the abstract. If you do not have a subscription, I can post it here. I just didn't want to upload copyrighted materials on your project page without your permission. And if you are interested in more examples, I can provide them. |
@keto33 There is a version that is accessible without subscription here: https://zenodo.org/record/1259881 In general, the subscripts are recognized, however, it does depends on how the PDF document was constructed. Are these papers all from Elsevier? See the picture below, you can see it by highlighting and you can see that the subscript are sorted after the line and not within: This issue should be added in Pdfalto If I'm not wrong |
The documentation states PDFALTO recognises superscript/subscript.
First, how does GROBID format superscript/subscript? I have not seen
<sub>
or<sup>
in the output.Second, in my practice, superscript/subscript is printed with space even in formula blocks in the form of
H 2 O
with no effect. Is it the intended behaviour?Third, I noticed that superscript/subscript is sometimes misplaced. For example,
MnO<sub>2</sub> film
is printed asMnO film 2
. I can share examples but cannot upload the PDFs as I am not the copyright holder.I use GROBID 0.7.2 using the command:
The text was updated successfully, but these errors were encountered: