Convert all the index files from PDF to machine-readable format #2

siddharthkulkarni · 2016-04-07T00:37:21Z

List of indexes and corresponding years

Index Title	Years
351-85-0041	1958-1978
351-85-0050	1958-1978
351-90-0005	1976-1983
351-94-0005	1984-1989
351-99-002	1990-1993
110-99-002	1994-1995
351-85-0051	N/A

ghost · 2016-04-24T16:23:09Z

Made first attempt to convert PDFs to text file with the command line OCR tesseract 3.03. First converted PDF to tiff image then run the PDF through tesseract-ocr. But I get typos on every other line. So doubled density of tiff, and it's worse. Going to try tesseract 3.04 (v3.03 is two years old), build from source, and then work on more tweaking on the tesseract command line. 20160425 Running tesseract 3.04 now. Any error causes segmentation fault so troubleshooting the commandline is challenging me. Hit snag with training files, but feel I may be close to really accurate, machine readable format from the command line.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert all the index files from PDF to machine-readable format #2

Convert all the index files from PDF to machine-readable format #2

siddharthkulkarni commented Apr 7, 2016

ghost commented Apr 24, 2016 •

edited by ghost

Loading

Convert all the index files from PDF to machine-readable format #2

Convert all the index files from PDF to machine-readable format #2

Comments

siddharthkulkarni commented Apr 7, 2016

ghost commented Apr 24, 2016 • edited by ghost Loading

ghost commented Apr 24, 2016 •

edited by ghost

Loading