Read pdf image via tesseract #19

marioidival · 2015-10-07T14:15:21Z

marioidival · 2015-10-13T19:57:43Z

avelino · 2015-10-19T16:50:42Z

@dhowden pls code review!

dhowden · 2015-10-19T21:44:13Z

pdf.go

@@ -17,6 +115,11 @@ func ConvertPDF(r io.Reader) (string, map[string]string, error) {
 	}
 	defer f.Done()

+	// Verify if pdf has images or is pdf only-text
+	if PDFHasImage(f.Name()) {


Looks like you missed a comment from an earlier diff:

Does this mean that if a PDF has an embedded image then it will ignore completely the text of the document?

Yes, in our case, we use this check for PDFs typically generated by scanners, many of them generate PDF's with each page being the first photo of the original document.

Ok, so you need to move this so that it's only enabled when OCR is enabled (otherwise PDFs which have images and text will be ignored if OCR hasn't been built in).

guilhermebr · 2015-11-20T15:55:40Z

👍

dhowden · 2015-11-22T05:31:01Z

pdf.go

+	filepath.Walk(tmpDir, walkFunc)
+
+	var wg sync.WaitGroup
+	m := make(map[int]string)


Better to create an anonymous type here with the map and mutex inside, rather than having a global pdfMutex which is only used in one place.

dhowden · 2015-11-22T05:51:45Z

Sorry that it has been a while. The main problem is that PDFs with images aren't parsed at all when OCR support hasn't been enabled, which is a huge problem for us (and anyone else not using the OCR support!).

Also: in a number of places you log errors instead of acting on them, so even with OCR support enabled the absence of files/errors parsing images will stop any further work being done on PDF contents.

mish15 · 2016-09-28T00:00:39Z

@marioidival is it possible to address these issues with the PR? We'd like to resolve this, but can't merge it while it will impact PDF's with images that aren't using OCR.

onemartini · 2017-10-13T01:04:41Z

What is the status of this PR ? Are you open to contributions to finish the job ?

dhowden · 2017-10-13T01:18:22Z

@onemartini: Absolutely open to contributions to finish this, just need to address the issues mentioned above.

The main concern is that these changes break existing PDF parsing when the tesseract code is not enabled (i.e. when the docconv library is used without the ocr build tag). In this case, the behaviour should be as before, instead it causes parsing of PDFs to fail if they include any images.

If this can be resolved and verified, then we'd be happy to merge.

onemartini · 2017-11-15T18:24:20Z

@dhowden I finally got around to working on this. About ready to open a PR. Cool if I open a new one and reference this one ?

onemartini · 2017-11-16T19:02:32Z

@dhowden check out #40 when you get a chance :)

mish15 · 2017-11-21T06:50:32Z

Closing in favour of #40

marioidival added 4 commits September 28, 2015 15:58

recognizing pdf with images

8729c97

use goroutines to process ConvertImage function

5bc56ae

fixing code review

8c05c20

remove map to race condition

c4246a8

marioidival added 3 commits October 13, 2015 18:46

add Lock on goroutine

80f9689

add sync.Mutex in aanonymous struct

fec59c2

remove old things

5b66790

dhowden reviewed Oct 19, 2015
View reviewed changes

dhowden reviewed Nov 22, 2015
View reviewed changes

Southclaws mentioned this pull request Aug 9, 2017

Extract image-based PDFs #33

Closed

onemartini mentioned this pull request Nov 15, 2017

Convert Images in PDFs #40

Merged

mish15 closed this Nov 21, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read pdf image via tesseract #19

Read pdf image via tesseract #19

marioidival commented Oct 7, 2015

marioidival commented Oct 13, 2015

avelino commented Oct 19, 2015

dhowden Oct 19, 2015

marioidival Oct 28, 2015

dhowden Nov 22, 2015

guilhermebr commented Nov 20, 2015

dhowden Nov 22, 2015

dhowden commented Nov 22, 2015

mish15 commented Sep 28, 2016

onemartini commented Oct 13, 2017

dhowden commented Oct 13, 2017

onemartini commented Nov 15, 2017

onemartini commented Nov 16, 2017

mish15 commented Nov 21, 2017

Read pdf image via tesseract #19

Read pdf image via tesseract #19

Conversation

marioidival commented Oct 7, 2015

marioidival commented Oct 13, 2015

avelino commented Oct 19, 2015

dhowden Oct 19, 2015

Choose a reason for hiding this comment

marioidival Oct 28, 2015

Choose a reason for hiding this comment

dhowden Nov 22, 2015

Choose a reason for hiding this comment

guilhermebr commented Nov 20, 2015

dhowden Nov 22, 2015

Choose a reason for hiding this comment

dhowden commented Nov 22, 2015

mish15 commented Sep 28, 2016

onemartini commented Oct 13, 2017

dhowden commented Oct 13, 2017

onemartini commented Nov 15, 2017

onemartini commented Nov 16, 2017

mish15 commented Nov 21, 2017