Convert Images in PDFs #40

onemartini · 2017-11-15T22:16:49Z

This is based on the work in #19. Many thanks to @marioidival for getting that started.

The objective is to enable this tool to perform character recognition on images within PDFs in addition to its current pdftotext capabilities.

When the project is built with the ocr tag, ConvertPDF will detect images within the document and invoke ConvertImage on each of them.

Note that our gosseract dependency just released a v2 with a breaking change. In order to preserve the current integration, I've updated the import statement to use gosseract/v1/gosseract as recommended in their current README.

mish15 · 2017-11-16T21:06:21Z

@onemartini thank you for this! We're blitzed with a big release currently, but will jump on this in the next few days. Sorry for the delay in advance!

onemartini · 2017-11-20T17:00:46Z

@mish15 no worries. Best of luck with the release !

mish15

This is getting there! A few things to tidy up, but not far away

mish15 · 2017-11-21T03:13:42Z

pdf_tools.go

+
+	wg.Add(len(files))
+	for indx, p := range files {
+		go func(idx int, pathFile string, ww *sync.WaitGroup) {


the WaitGroup doesn't need to be passed into the goroutine as it's the same pointer for all

mish15 · 2017-11-21T03:14:38Z

pdf_tools.go

+		fileMap map[int]string
+	}{}
+
+	m.fileMap = make(map[int]string)


You can move this initialisation into the struct above, no need to separate

mish15 · 2017-11-21T03:16:03Z

pdf_tools.go

+			defer ww.Done()
+			f, err := os.Open(pathFile)
+			if err != nil {
+				log.Println(err)


This will fail below if .Open returns an error. The func below continues with no file. Return?

mish15 · 2017-11-21T03:17:19Z

pdf_tools.go

+		return bodyResult, err
+	}
+	tmpDir := fmt.Sprintf("%s/", tmp)
+	defer os.RemoveAll(tmpDir)


os.RemoveAll returns an error which isn't being handled.

mish15 · 2017-11-21T03:20:41Z

pdf_tools.go

@@ -0,0 +1,180 @@
+package docconv


It would be better if all OCR related funcs were moved to the same file and excluded with the build tags. This file has a mix of both

mish15 · 2017-11-21T03:25:03Z

pdf_ocr.go

+			return "", nil, bodyResult.err
+		}
+
+		return bodyResult.body, nil, nil


meta is being ignored for OCR based PDFs.

mish15 · 2017-11-21T03:27:32Z

pdf_ocr.go

+			return "", nil, bodyResult.err
+		}
+
+		return bodyResult.body, nil, nil


If a PDF has an image, the text is totally ignored. This will fail for mixed PDFs. Needs to handle both or explicitly prevent

mish15 · 2017-11-21T03:30:49Z

pdf_tools.go

+			m.fileMap[idx] = out
+			m.Unlock()
+
+			f.Close()


Needs to be a defer if the file is successfully opened.

mish15 · 2017-11-21T03:48:52Z

pdf_tools.go

+	o := make([]string, len(m.fileMap))
+
+	for i := 0; i < len(m.fileMap); i++ {
+		o = append(o, m.fileMap[i])


No need to create an intermediate slice here. Note: the fileMap itself could also just be a channel to send the body back on, which would also remove the need for the mutex.

for _, str := range m.fileMap { bodyResult += str + " " }

onemartini · 2017-11-22T03:28:00Z

Thanks for the feedback @mish15. Here's another iteration.

maintain meta data for both pure text and PDFs with images
convert text and images in PDFs with images
move all OCR functions to same file
replace anonymous struct and mutex with a channel
don't pass WaitGroup into goroutine
return error on os.Open
defer f.Close

mish15

This is much better. Couple of minor issues.

mish15 · 2017-11-26T22:23:46Z

pdf_ocr.go

+		go func(pathFile string) {
+
+			f, err := os.Open(pathFile)
+


remove empty line

mish15 · 2017-11-26T22:26:43Z

pdf_ocr.go

+				bodyResult.err = err
+			}
+
+			wg.Done()


This should be a defer in the beginning of the func. It's not a problem now, but if any early return is added this can cause a deadlock.

mish15 · 2017-11-26T23:16:08Z

pdf_ocr.go

+				bodyResult.err = err
+			}
+
+			defer f.Close()


If the file failed to open, you can't close it, this will panic. Need an early return

mish15 · 2017-11-26T23:18:14Z

pdf_ocr.go

+
+	wg.Wait()
+
+	go func() {


no need for a goroutine here

onemartini · 2017-11-29T19:35:59Z

@mish15: new iteration:

defer wg.Done in goroutine
return early from goroutine if file cannot be opened
remove unnecessary goroutine around channel close

convert images in pdfs when package is build with ocr tags

e603ee4

onemartini mentioned this pull request Nov 16, 2017

Read pdf image via tesseract #19

Closed

mish15 assigned dhowden Nov 16, 2017

mish15 requested changes Nov 21, 2017

View reviewed changes

mish15 mentioned this pull request Nov 21, 2017

Extract image-based PDFs #33

Closed

onemartini added 2 commits November 21, 2017 16:39

iterates on code review

9243e97

cleanup

b857c25

onemartini added 3 commits November 21, 2017 19:29

defer f.Close

d7bc55d

fixes send closed channel issue

7c7ae5a

removes logs

4b2f10e

mish15 requested changes Nov 26, 2017

View reviewed changes

fixes channel buffering issue

9df101c

mish15 approved these changes Dec 1, 2017

View reviewed changes

mish15 merged commit e675275 into sajari:master Dec 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert Images in PDFs #40

Convert Images in PDFs #40

onemartini commented Nov 15, 2017 •

edited

Loading

mish15 commented Nov 16, 2017

onemartini commented Nov 20, 2017

mish15 left a comment

mish15 Nov 21, 2017

mish15 Nov 21, 2017

mish15 Nov 21, 2017

mish15 Nov 21, 2017

mish15 Nov 21, 2017

mish15 Nov 21, 2017

mish15 Nov 21, 2017

mish15 Nov 21, 2017

mish15 Nov 21, 2017

onemartini commented Nov 22, 2017

mish15 left a comment

mish15 Nov 26, 2017

mish15 Nov 26, 2017

mish15 Nov 26, 2017

mish15 Nov 26, 2017

onemartini commented Nov 29, 2017

Convert Images in PDFs #40

Convert Images in PDFs #40

Conversation

onemartini commented Nov 15, 2017 • edited Loading

mish15 commented Nov 16, 2017

onemartini commented Nov 20, 2017

mish15 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

onemartini commented Nov 22, 2017

mish15 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

onemartini commented Nov 29, 2017

onemartini commented Nov 15, 2017 •

edited

Loading