-
Notifications
You must be signed in to change notification settings - Fork 460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Help with PDF generation from Word #839
Comments
Hi @pakojil Thank you very much for the feedback on Grobid ! Having a XSLT to transform Grobid's TEI into JATS would be really nice. It was discussed at some point (see issue #98), but it seems not progressing (there is a JATS -> TEI available on the contrary ;). On my side, I left this work to others to concentrate on more core issues in Grobid (I don't have a lot of time for Grobid unfortunately).
About this second question, I think I am also not going to be very helpful... First it's better to export PDF from Word. There is a working branch supporting docx input via transformation to PDF and grobid processing of the PDF using Apache POI (#515). However, the performance was not satisfactory, with a failure rate of around 5% for me and very slow transformation process. I was planning to test docx4j. There is also no open source solution for Then I don't think there are many options for the Word "save as PDF". Quality of the PDF has no impact (it's just for the quality of embedded images). It would be interesting indeed to identify Word templates that work better with Grobid. |
Hi @kermitt2 Thank you very much for your kind answer. I work on a double stage. On the one hand, we have a portal with more than 40 scientific journals. But, on the other hand, and it is what interests me the most right now, we try to provide a template in Word to the authors, so that the correct tagging of the document is automated to the maximum, leaving only minor adjustments for the editors of the journals or technical person involved. With your information, I am better oriented, and I continue to do so. Thank you very much again and greetings |
Hi Patrice
First of all, I would like to thank you and the rest of the project collaborators for the great effort you make.
I am exhaustively checking Grobid for the production of JATS versions of the articles.
Regarding the conversion of historical PDFs, it is clear to me that, despite the training (which I honestly don't know how to use properly), everything has a solution via XSLT transformation from TEI to JATS.
However, I am trying to create a template in Word for future articles. This is due to the fact that I am unable to find any tool that works properly for direct conversion from docx to JATS, and I have tried practically all of them.
My question is if there is any existing template, and if, on the other hand, there is some way to generate the PDF so that Grobid better identify the fields for the generation of the TEI version.
I mean, is there any version of Word better than another? What is the most suitable PDF producer for this?
Thank you very much and apologize for any inconvenience.
The text was updated successfully, but these errors were encountered: