-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert Images in PDFs #40
Conversation
@onemartini thank you for this! We're blitzed with a big release currently, but will jump on this in the next few days. Sorry for the delay in advance! |
@mish15 no worries. Best of luck with the release ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is getting there! A few things to tidy up, but not far away
pdf_tools.go
Outdated
|
||
wg.Add(len(files)) | ||
for indx, p := range files { | ||
go func(idx int, pathFile string, ww *sync.WaitGroup) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the WaitGroup doesn't need to be passed into the goroutine as it's the same pointer for all
pdf_tools.go
Outdated
fileMap map[int]string | ||
}{} | ||
|
||
m.fileMap = make(map[int]string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can move this initialisation into the struct above, no need to separate
pdf_tools.go
Outdated
defer ww.Done() | ||
f, err := os.Open(pathFile) | ||
if err != nil { | ||
log.Println(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will fail below if .Open
returns an error. The func below continues with no file. Return?
pdf_tools.go
Outdated
return bodyResult, err | ||
} | ||
tmpDir := fmt.Sprintf("%s/", tmp) | ||
defer os.RemoveAll(tmpDir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
os.RemoveAll
returns an error which isn't being handled.
pdf_tools.go
Outdated
@@ -0,0 +1,180 @@ | |||
package docconv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better if all OCR related funcs were moved to the same file and excluded with the build tags. This file has a mix of both
pdf_ocr.go
Outdated
return "", nil, bodyResult.err | ||
} | ||
|
||
return bodyResult.body, nil, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meta is being ignored for OCR based PDFs.
pdf_ocr.go
Outdated
return "", nil, bodyResult.err | ||
} | ||
|
||
return bodyResult.body, nil, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a PDF has an image, the text is totally ignored. This will fail for mixed PDFs. Needs to handle both or explicitly prevent
pdf_tools.go
Outdated
m.fileMap[idx] = out | ||
m.Unlock() | ||
|
||
f.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to be a defer
if the file is successfully opened.
pdf_tools.go
Outdated
o := make([]string, len(m.fileMap)) | ||
|
||
for i := 0; i < len(m.fileMap); i++ { | ||
o = append(o, m.fileMap[i]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to create an intermediate slice here. Note: the fileMap itself could also just be a channel to send the body back on, which would also remove the need for the mutex.
for _, str := range m.fileMap {
bodyResult += str + " "
}
Thanks for the feedback @mish15. Here's another iteration.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much better. Couple of minor issues.
pdf_ocr.go
Outdated
go func(pathFile string) { | ||
|
||
f, err := os.Open(pathFile) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove empty line
pdf_ocr.go
Outdated
bodyResult.err = err | ||
} | ||
|
||
wg.Done() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a defer in the beginning of the func. It's not a problem now, but if any early return is added this can cause a deadlock.
pdf_ocr.go
Outdated
bodyResult.err = err | ||
} | ||
|
||
defer f.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the file failed to open, you can't close it, this will panic. Need an early return
pdf_ocr.go
Outdated
|
||
wg.Wait() | ||
|
||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for a goroutine here
@mish15: new iteration:
|
This is based on the work in #19. Many thanks to @marioidival for getting that started.
The objective is to enable this tool to perform character recognition on images within PDFs in addition to its current
pdftotext
capabilities.When the project is built with the
ocr
tag,ConvertPDF
will detect images within the document and invokeConvertImage
on each of them.Note that our gosseract dependency just released a
v2
with a breaking change. In order to preserve the current integration, I've updated the import statement to usegosseract/v1/gosseract
as recommended in their current README.