-
Notifications
You must be signed in to change notification settings - Fork 461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Word document instead of PDF #313
Comments
@sarankup Currently PDF format is the only input for documents in Grobid. |
one way to support docx file would be to create an XML parser similar to the current parser for pdf2xml (https://github.com/kermitt2/grobid/blob/master/grobid-core/src/main/java/org/grobid/core/sax/PDF2XMLSaxHandler.java) or the future one for ALTO (https://github.com/kermitt2/grobid/blob/pdfalto_integration/grobid-core/src/main/java/org/grobid/core/sax/PDFALTOSaxHandler.java). An alternative is to create a stylesheet for transforming docx to ALTO file. But I have no clue if any of these options are doable :) |
It would probably be more sustainable long term to just add an unoconv service to Grobid for converting Word documents to PDF so they can go through the existing Grobid toolchain. I'm a longtime contributor to https://github.com/MartinPaulEve/meTypeset which parses Word documents to JATS XML through a series of complicated python and XSLT rules -- it's included in https://github.com/pkp/ots along with Grobid, and different parsers are used depending on input -- but it's harder to support in the long term because Microsoft's XML is very idiosyncratic and changes often. |
My idea now would be to convert docx into ALTO. In principle any fixed-layout document could be transformed into ALTO, which is now the GROBID standard input. |
see PR #515 |
Hi,
At present, I have all documents as DOCX (Microsoft Word files) which I convert to PDF in order to run the GROBID XML conversion. Is there any possibility of using DOCX as input?
In case of PDF is the only input-option it is reliable that the full-text extraction is always reliable in terms 100% content integrity even if XML markup is incorrect. We are fine, in case of any incorrect XML markup, but not if there are any content loss.
The text was updated successfully, but these errors were encountered: