Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word document instead of PDF #313

Open
sarankup opened this issue Apr 28, 2018 · 5 comments
Open

Word document instead of PDF #313

sarankup opened this issue Apr 28, 2018 · 5 comments

Comments

@sarankup
Copy link

Hi,

At present, I have all documents as DOCX (Microsoft Word files) which I convert to PDF in order to run the GROBID XML conversion. Is there any possibility of using DOCX as input?

In case of PDF is the only input-option it is reliable that the full-text extraction is always reliable in terms 100% content integrity even if XML markup is incorrect. We are fine, in case of any incorrect XML markup, but not if there are any content loss.

@lfoppiano
Copy link
Collaborator

@sarankup Currently PDF format is the only input for documents in Grobid.
Supporting several format is quite demanding to implement and, moreover, to maintain so Grobid supports the more widely format for articles and monographs.

@kermitt2
Copy link
Owner

kermitt2 commented Jun 6, 2018

one way to support docx file would be to create an XML parser similar to the current parser for pdf2xml (https://github.com/kermitt2/grobid/blob/master/grobid-core/src/main/java/org/grobid/core/sax/PDF2XMLSaxHandler.java) or the future one for ALTO (https://github.com/kermitt2/grobid/blob/pdfalto_integration/grobid-core/src/main/java/org/grobid/core/sax/PDFALTOSaxHandler.java).

An alternative is to create a stylesheet for transforming docx to ALTO file.

But I have no clue if any of these options are doable :)

@axfelix
Copy link

axfelix commented Jun 20, 2018

It would probably be more sustainable long term to just add an unoconv service to Grobid for converting Word documents to PDF so they can go through the existing Grobid toolchain. I'm a longtime contributor to https://github.com/MartinPaulEve/meTypeset which parses Word documents to JATS XML through a series of complicated python and XSLT rules -- it's included in https://github.com/pkp/ots along with Grobid, and different parsers are used depending on input -- but it's harder to support in the long term because Microsoft's XML is very idiosyncratic and changes often.

@kermitt2
Copy link
Owner

kermitt2 commented Jul 5, 2019

My idea now would be to convert docx into ALTO. In principle any fixed-layout document could be transformed into ALTO, which is now the GROBID standard input.

@kermitt2
Copy link
Owner

see PR #515

@kermitt2 kermitt2 mentioned this issue Aug 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants