Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TrueViz extraction fails silently for some PDFs #88

Open
afs25 opened this issue Aug 20, 2019 · 0 comments
Open

TrueViz extraction fails silently for some PDFs #88

afs25 opened this issue Aug 20, 2019 · 0 comments

Comments

@afs25
Copy link

afs25 commented Aug 20, 2019

First of all, thank you for developing CERMINE. I am very impressed by what it can do.

One of the projects I am working at the moment relies on identifying some elements of the layout of PDF files, so I am particularly interested in parsing the TrueViz XML output of CERMINE. I noticed that for some PDFs, CERMINE fails silently to output the content in TrueViz format. The resulting .cermstr file does not contain any Zone, Word or Character elements inside each of the Page elements:

Unfortunately I cannot post the problematic PDF here because it is copyrighted (I am happy to send the PDF in a personal message if requested), but I will post an example as soon as I come across one that can be shared.

Is there any way I can inspect debug information from CERMINE to try to understand what is special about this PDF and how I can go about fixing this? In other words, can the verbosity of CERMINE be increased somehow? Perhaps pre-processing the PDF with pdftk or ghostscript might solve the problem, but it is difficult to implement that without understanding the underlying problem.

Thank you in advance for any help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant