Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help with PDF generation from Word #839

Open
pakojil opened this issue Oct 12, 2021 · 2 comments
Open

Help with PDF generation from Word #839

pakojil opened this issue Oct 12, 2021 · 2 comments
Labels
enhancement question There's no such thing as a stupid question

Comments

@pakojil
Copy link

pakojil commented Oct 12, 2021

Hi Patrice
First of all, I would like to thank you and the rest of the project collaborators for the great effort you make.
I am exhaustively checking Grobid for the production of JATS versions of the articles.

Regarding the conversion of historical PDFs, it is clear to me that, despite the training (which I honestly don't know how to use properly), everything has a solution via XSLT transformation from TEI to JATS.

However, I am trying to create a template in Word for future articles. This is due to the fact that I am unable to find any tool that works properly for direct conversion from docx to JATS, and I have tried practically all of them.

My question is if there is any existing template, and if, on the other hand, there is some way to generate the PDF so that Grobid better identify the fields for the generation of the TEI version.
I mean, is there any version of Word better than another? What is the most suitable PDF producer for this?

Thank you very much and apologize for any inconvenience.

@kermitt2
Copy link
Owner

Hi @pakojil

Thank you very much for the feedback on Grobid !

Having a XSLT to transform Grobid's TEI into JATS would be really nice. It was discussed at some point (see issue #98), but it seems not progressing (there is a JATS -> TEI available on the contrary ;). On my side, I left this work to others to concentrate on more core issues in Grobid (I don't have a lot of time for Grobid unfortunately).

if there is some way to generate the PDF so that Grobid better identify the fields for the generation of the TEI version.
I mean, is there any version of Word better than another? What is the most suitable PDF producer for this?

About this second question, I think I am also not going to be very helpful...

First it's better to export PDF from Word. There is a working branch supporting docx input via transformation to PDF and grobid processing of the PDF using Apache POI (#515). However, the performance was not satisfactory, with a failure rate of around 5% for me and very slow transformation process. I was planning to test docx4j. There is also no open source solution for .doc for the moment. So I would say the best solution is using the proprietary Word PDF export.

Then I don't think there are many options for the Word "save as PDF". Quality of the PDF has no impact (it's just for the quality of embedded images).

It would be interesting indeed to identify Word templates that work better with Grobid.
In general using different font sizes for title, section headers, and using large paragraph separation and indents always help Grobid :)
Due to the lack of training data for social science and the humanities, references as footnotes are not well supported for the moment.
Finally using common fonts (avoiding proprietary fonts that can't be easily embedded) and avoiding special characters if possible (that might not be solved properly via unicode mapping) always help.

@pakojil
Copy link
Author

pakojil commented Oct 19, 2021

Hi @kermitt2

Thank you very much for your kind answer.

I work on a double stage. On the one hand, we have a portal with more than 40 scientific journals.
Of these, the vast majority are from the humanities and social sciences.
There is a part of the challenge which is trying to get a valid JATS from historically uploaded PDFs (one part of them are scanned, and the other is generated directly from various sources).
I'm working in that, thinking about it.

But, on the other hand, and it is what interests me the most right now, we try to provide a template in Word to the authors, so that the correct tagging of the document is automated to the maximum, leaving only minor adjustments for the editors of the journals or technical person involved.

With your information, I am better oriented, and I continue to do so.
I will comment on my progress, if it happens.

Thank you very much again and greetings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement question There's no such thing as a stupid question
Projects
None yet
Development

No branches or pull requests

2 participants