Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate quality of metadata extracted via pdftohtml and ocr2xml #69

Open
wo opened this issue Apr 6, 2016 · 0 comments
Open

Evaluate quality of metadata extracted via pdftohtml and ocr2xml #69

wo opened this issue Apr 6, 2016 · 0 comments

Comments

@wo
Copy link
Owner

wo commented Apr 6, 2016

At the moment, paperparser.py invokes ocr2xml if the metadata extracted via pdftohtml has low confidence; it then returns the metadata extracted via ocr. Neither is ideal. Sometimes pdftohtml produces much better results than oxr2xml, sometimes its the other way round, and metadata confidence is not a good way to distinguish between the two cases. It would be better to have a separate quick sanity evaluation of author/title/abstract on the basis of which it is decided (1) whether ocr2xml needs to be invoked, and (2) whether to use the metadata extracted via pdftohtml or via ocr.

Here, for example, extraction via pdftohtml gets the title right, while extraction via ocr yields "9103 ‘9 [inV uo qSanuipg J0 AlissoAiuf] 112 filo'spetuno[plogxo'bd/pduq mos; popeopimoq" (perhaps because of the unusual font): http://pq.oxfordjournals.org/content/early/2016/04/04/pq.pqw028.full.pdf

On the other hand, here are some cases where pdftohtml yields really bad titles or authors:
http://web.ics.purdue.edu/~drkelly/MallonKellyMakingRaceNothing2012.pdf ("Making Race Out O f nO thing: Psych O l O gically cO nst R ained sO cial R O les")
http://www3.nd.edu/~dhoward1/Lost%20Wanderers.pdf ("Lost Wandere rs i n the Fore st o f Knowledg e: S ome Thought s on t he Disco very ) Just ifi cat ion Di sti ncti on")
http://www.consciousness.it/Docs/Lavazza,%20Manzotti%20-%202011%20-%20A%20New%20Mind%20for%20a%20New%20Aesthetics.pdf ("A N d REA L AVA zz A * | R I cc AR d O M AN z OTTI *")

It shouldn't be hard to recognize at least ridiculous cases like these.

@wo wo added this to the someday maybe milestone Apr 6, 2016
@wo wo added the pdfparser label Apr 6, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant