TrueViz extraction fails silently for some PDFs #88

afs25 · 2019-08-20T13:38:47Z

First of all, thank you for developing CERMINE. I am very impressed by what it can do.

One of the projects I am working at the moment relies on identifying some elements of the layout of PDF files, so I am particularly interested in parsing the TrueViz XML output of CERMINE. I noticed that for some PDFs, CERMINE fails silently to output the content in TrueViz format. The resulting .cermstr file does not contain any Zone, Word or Character elements inside each of the Page elements:

Unfortunately I cannot post the problematic PDF here because it is copyrighted (I am happy to send the PDF in a personal message if requested), but I will post an example as soon as I come across one that can be shared.

Is there any way I can inspect debug information from CERMINE to try to understand what is special about this PDF and how I can go about fixing this? In other words, can the verbosity of CERMINE be increased somehow? Perhaps pre-processing the PDF with pdftk or ghostscript might solve the problem, but it is difficult to implement that without understanding the underlying problem.

Thank you in advance for any help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TrueViz extraction fails silently for some PDFs #88

TrueViz extraction fails silently for some PDFs #88

afs25 commented Aug 20, 2019

TrueViz extraction fails silently for some PDFs #88

TrueViz extraction fails silently for some PDFs #88

Comments

afs25 commented Aug 20, 2019