Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert Images in PDFs #40

Merged
merged 7 commits into from
Dec 1, 2017
Merged

Conversation

onemartini
Copy link
Contributor

@onemartini onemartini commented Nov 15, 2017

This is based on the work in #19. Many thanks to @marioidival for getting that started.

The objective is to enable this tool to perform character recognition on images within PDFs in addition to its current pdftotext capabilities.

When the project is built with the ocr tag, ConvertPDF will detect images within the document and invoke ConvertImage on each of them.

Note that our gosseract dependency just released a v2 with a breaking change. In order to preserve the current integration, I've updated the import statement to use gosseract/v1/gosseract as recommended in their current README.

@mish15
Copy link
Member

mish15 commented Nov 16, 2017

@onemartini thank you for this! We're blitzed with a big release currently, but will jump on this in the next few days. Sorry for the delay in advance!

@onemartini
Copy link
Contributor Author

@mish15 no worries. Best of luck with the release !

Copy link
Member

@mish15 mish15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting there! A few things to tidy up, but not far away

pdf_tools.go Outdated

wg.Add(len(files))
for indx, p := range files {
go func(idx int, pathFile string, ww *sync.WaitGroup) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the WaitGroup doesn't need to be passed into the goroutine as it's the same pointer for all

pdf_tools.go Outdated
fileMap map[int]string
}{}

m.fileMap = make(map[int]string)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can move this initialisation into the struct above, no need to separate

pdf_tools.go Outdated
defer ww.Done()
f, err := os.Open(pathFile)
if err != nil {
log.Println(err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fail below if .Open returns an error. The func below continues with no file. Return?

pdf_tools.go Outdated
return bodyResult, err
}
tmpDir := fmt.Sprintf("%s/", tmp)
defer os.RemoveAll(tmpDir)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.RemoveAll returns an error which isn't being handled.

pdf_tools.go Outdated
@@ -0,0 +1,180 @@
package docconv
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better if all OCR related funcs were moved to the same file and excluded with the build tags. This file has a mix of both

pdf_ocr.go Outdated
return "", nil, bodyResult.err
}

return bodyResult.body, nil, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meta is being ignored for OCR based PDFs.

pdf_ocr.go Outdated
return "", nil, bodyResult.err
}

return bodyResult.body, nil, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a PDF has an image, the text is totally ignored. This will fail for mixed PDFs. Needs to handle both or explicitly prevent

pdf_tools.go Outdated
m.fileMap[idx] = out
m.Unlock()

f.Close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be a defer if the file is successfully opened.

pdf_tools.go Outdated
o := make([]string, len(m.fileMap))

for i := 0; i < len(m.fileMap); i++ {
o = append(o, m.fileMap[i])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to create an intermediate slice here. Note: the fileMap itself could also just be a channel to send the body back on, which would also remove the need for the mutex.

for _, str := range m.fileMap {
    bodyResult += str + " " 
}

@mish15 mish15 mentioned this pull request Nov 21, 2017
@onemartini
Copy link
Contributor Author

Thanks for the feedback @mish15. Here's another iteration.

  • maintain meta data for both pure text and PDFs with images
  • convert text and images in PDFs with images
  • move all OCR functions to same file
  • replace anonymous struct and mutex with a channel
  • don't pass WaitGroup into goroutine
  • return error on os.Open
  • defer f.Close

Copy link
Member

@mish15 mish15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much better. Couple of minor issues.

pdf_ocr.go Outdated
go func(pathFile string) {

f, err := os.Open(pathFile)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove empty line

pdf_ocr.go Outdated
bodyResult.err = err
}

wg.Done()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a defer in the beginning of the func. It's not a problem now, but if any early return is added this can cause a deadlock.

pdf_ocr.go Outdated
bodyResult.err = err
}

defer f.Close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the file failed to open, you can't close it, this will panic. Need an early return

pdf_ocr.go Outdated

wg.Wait()

go func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for a goroutine here

@onemartini
Copy link
Contributor Author

@mish15: new iteration:

  • defer wg.Done in goroutine
  • return early from goroutine if file cannot be opened
  • remove unnecessary goroutine around channel close

@mish15 mish15 merged commit e675275 into sajari:master Dec 1, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants