-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read pdf image via tesseract #19
Conversation
fixed @dhowden |
@dhowden pls code review! |
@@ -17,6 +115,11 @@ func ConvertPDF(r io.Reader) (string, map[string]string, error) { | |||
} | |||
defer f.Done() | |||
|
|||
// Verify if pdf has images or is pdf only-text | |||
if PDFHasImage(f.Name()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you missed a comment from an earlier diff:
Does this mean that if a PDF has an embedded image then it will ignore completely the text of the document?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in our case, we use this check for PDFs typically generated by scanners, many of them generate PDF's with each page being the first photo of the original document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so you need to move this so that it's only enabled when OCR is enabled (otherwise PDFs which have images and text will be ignored if OCR hasn't been built in).
👍 |
filepath.Walk(tmpDir, walkFunc) | ||
|
||
var wg sync.WaitGroup | ||
m := make(map[int]string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to create an anonymous type here with the map and mutex inside, rather than having a global pdfMutex
which is only used in one place.
Sorry that it has been a while. The main problem is that PDFs with images aren't parsed at all when OCR support hasn't been enabled, which is a huge problem for us (and anyone else not using the OCR support!). Also: in a number of places you log errors instead of acting on them, so even with OCR support enabled the absence of files/errors parsing images will stop any further work being done on PDF contents. |
@marioidival is it possible to address these issues with the PR? We'd like to resolve this, but can't merge it while it will impact PDF's with images that aren't using OCR. |
What is the status of this PR ? Are you open to contributions to finish the job ? |
@onemartini: Absolutely open to contributions to finish this, just need to address the issues mentioned above. The main concern is that these changes break existing PDF parsing when the tesseract code is not enabled (i.e. when the If this can be resolved and verified, then we'd be happy to merge. |
@dhowden I finally got around to working on this. About ready to open a PR. Cool if I open a new one and reference this one ? |
Closing in favour of #40 |
#18