Word document instead of PDF #313

sarankup · 2018-04-28T14:00:41Z

Hi,

At present, I have all documents as DOCX (Microsoft Word files) which I convert to PDF in order to run the GROBID XML conversion. Is there any possibility of using DOCX as input?

In case of PDF is the only input-option it is reliable that the full-text extraction is always reliable in terms 100% content integrity even if XML markup is incorrect. We are fine, in case of any incorrect XML markup, but not if there are any content loss.

lfoppiano · 2018-05-28T18:37:22Z

@sarankup Currently PDF format is the only input for documents in Grobid.
Supporting several format is quite demanding to implement and, moreover, to maintain so Grobid supports the more widely format for articles and monographs.

kermitt2 · 2018-06-06T20:52:08Z

one way to support docx file would be to create an XML parser similar to the current parser for pdf2xml (https://github.com/kermitt2/grobid/blob/master/grobid-core/src/main/java/org/grobid/core/sax/PDF2XMLSaxHandler.java) or the future one for ALTO (https://github.com/kermitt2/grobid/blob/pdfalto_integration/grobid-core/src/main/java/org/grobid/core/sax/PDFALTOSaxHandler.java).

An alternative is to create a stylesheet for transforming docx to ALTO file.

But I have no clue if any of these options are doable :)

axfelix · 2018-06-20T00:03:49Z

It would probably be more sustainable long term to just add an unoconv service to Grobid for converting Word documents to PDF so they can go through the existing Grobid toolchain. I'm a longtime contributor to https://github.com/MartinPaulEve/meTypeset which parses Word documents to JATS XML through a series of complicated python and XSLT rules -- it's included in https://github.com/pkp/ots along with Grobid, and different parsers are used depending on input -- but it's harder to support in the long term because Microsoft's XML is very idiosyncratic and changes often.

kermitt2 · 2019-07-05T20:36:09Z

My idea now would be to convert docx into ALTO. In principle any fixed-layout document could be transformed into ALTO, which is now the GROBID standard input.

kermitt2 · 2019-12-27T15:42:22Z

see PR #515

kermitt2 added the enhancement label Jul 5, 2019

kermitt2 mentioned this issue Aug 31, 2020

Docx support #631

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word document instead of PDF #313

Word document instead of PDF #313

sarankup commented Apr 28, 2018

lfoppiano commented May 28, 2018

kermitt2 commented Jun 6, 2018

axfelix commented Jun 20, 2018

kermitt2 commented Jul 5, 2019

kermitt2 commented Dec 27, 2019

Word document instead of PDF #313

Word document instead of PDF #313

Comments

sarankup commented Apr 28, 2018

lfoppiano commented May 28, 2018

kermitt2 commented Jun 6, 2018

axfelix commented Jun 20, 2018

kermitt2 commented Jul 5, 2019

kermitt2 commented Dec 27, 2019