Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert all the index files from PDF to machine-readable format #2

Open
siddharthkulkarni opened this issue Apr 7, 2016 · 1 comment

Comments

@siddharthkulkarni
Copy link

List of indexes and corresponding years

Index Title Years
351-85-0041 1958-1978
351-85-0050 1958-1978
351-90-0005 1976-1983
351-94-0005 1984-1989
351-99-002 1990-1993
110-99-002 1994-1995
351-85-0051 N/A
@ghost
Copy link

ghost commented Apr 24, 2016

Made first attempt to convert PDFs to text file with the command line OCR tesseract 3.03. First converted PDF to tiff image then run the PDF through tesseract-ocr. But I get typos on every other line. So doubled density of tiff, and it's worse. Going to try tesseract 3.04 (v3.03 is two years old), build from source, and then work on more tweaking on the tesseract command line. 20160425 Running tesseract 3.04 now. Any error causes segmentation fault so troubleshooting the commandline is challenging me. Hit snag with training files, but feel I may be close to really accurate, machine readable format from the command line.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant