Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read pdf image via tesseract #19

Closed
wants to merge 7 commits into from

Conversation

marioidival
Copy link
Contributor

#18

@marioidival
Copy link
Contributor Author

fixed @dhowden

@avelino
Copy link

avelino commented Oct 19, 2015

@dhowden pls code review!

@@ -17,6 +115,11 @@ func ConvertPDF(r io.Reader) (string, map[string]string, error) {
}
defer f.Done()

// Verify if pdf has images or is pdf only-text
if PDFHasImage(f.Name()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you missed a comment from an earlier diff:

Does this mean that if a PDF has an embedded image then it will ignore completely the text of the document?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in our case, we use this check for PDFs typically generated by scanners, many of them generate PDF's with each page being the first photo of the original document.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so you need to move this so that it's only enabled when OCR is enabled (otherwise PDFs which have images and text will be ignored if OCR hasn't been built in).

@guilhermebr
Copy link
Contributor

👍

filepath.Walk(tmpDir, walkFunc)

var wg sync.WaitGroup
m := make(map[int]string)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to create an anonymous type here with the map and mutex inside, rather than having a global pdfMutex which is only used in one place.

@dhowden
Copy link
Member

dhowden commented Nov 22, 2015

Sorry that it has been a while. The main problem is that PDFs with images aren't parsed at all when OCR support hasn't been enabled, which is a huge problem for us (and anyone else not using the OCR support!).

Also: in a number of places you log errors instead of acting on them, so even with OCR support enabled the absence of files/errors parsing images will stop any further work being done on PDF contents.

@mish15
Copy link
Member

mish15 commented Sep 28, 2016

@marioidival is it possible to address these issues with the PR? We'd like to resolve this, but can't merge it while it will impact PDF's with images that aren't using OCR.

@onemartini
Copy link
Contributor

What is the status of this PR ? Are you open to contributions to finish the job ?

@dhowden
Copy link
Member

dhowden commented Oct 13, 2017

@onemartini: Absolutely open to contributions to finish this, just need to address the issues mentioned above.

The main concern is that these changes break existing PDF parsing when the tesseract code is not enabled (i.e. when the docconv library is used without the ocr build tag). In this case, the behaviour should be as before, instead it causes parsing of PDFs to fail if they include any images.

If this can be resolved and verified, then we'd be happy to merge.

@onemartini
Copy link
Contributor

@dhowden I finally got around to working on this. About ready to open a PR. Cool if I open a new one and reference this one ?

@onemartini onemartini mentioned this pull request Nov 15, 2017
@onemartini
Copy link
Contributor

@dhowden check out #40 when you get a chance :)

@mish15
Copy link
Member

mish15 commented Nov 21, 2017

Closing in favour of #40

@mish15 mish15 closed this Nov 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants