Evaluate quality of metadata extracted via pdftohtml and ocr2xml #69

wo · 2016-04-06T20:08:06Z

At the moment, paperparser.py invokes ocr2xml if the metadata extracted via pdftohtml has low confidence; it then returns the metadata extracted via ocr. Neither is ideal. Sometimes pdftohtml produces much better results than oxr2xml, sometimes its the other way round, and metadata confidence is not a good way to distinguish between the two cases. It would be better to have a separate quick sanity evaluation of author/title/abstract on the basis of which it is decided (1) whether ocr2xml needs to be invoked, and (2) whether to use the metadata extracted via pdftohtml or via ocr.

Here, for example, extraction via pdftohtml gets the title right, while extraction via ocr yields "9103 ‘9 [inV uo qSanuipg J0 AlissoAiuf] 112 filo'spetuno[plogxo'bd/pduq mos; popeopimoq" (perhaps because of the unusual font): http://pq.oxfordjournals.org/content/early/2016/04/04/pq.pqw028.full.pdf

On the other hand, here are some cases where pdftohtml yields really bad titles or authors:
http://web.ics.purdue.edu/~drkelly/MallonKellyMakingRaceNothing2012.pdf ("Making Race Out O f nO thing: Psych O l O gically cO nst R ained sO cial R O les")
http://www3.nd.edu/~dhoward1/Lost%20Wanderers.pdf ("Lost Wandere rs i n the Fore st o f Knowledg e: S ome Thought s on t he Disco very ) Just ifi cat ion Di sti ncti on")
http://www.consciousness.it/Docs/Lavazza,%20Manzotti%20-%202011%20-%20A%20New%20Mind%20for%20a%20New%20Aesthetics.pdf ("A N d REA L AVA zz A * | R I cc AR d O M AN z OTTI *")

It shouldn't be hard to recognize at least ridiculous cases like these.

wo added this to the someday maybe milestone Apr 6, 2016

wo added the pdfparser label Apr 6, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate quality of metadata extracted via pdftohtml and ocr2xml #69

Evaluate quality of metadata extracted via pdftohtml and ocr2xml #69

wo commented Apr 6, 2016

Evaluate quality of metadata extracted via pdftohtml and ocr2xml #69

Evaluate quality of metadata extracted via pdftohtml and ocr2xml #69

Comments

wo commented Apr 6, 2016