how to increase processing speed of tesseract OCR? #160

svijayakumar1 · 2019-08-14T21:00:53Z

Hi Quan,

Hope you're doing good. I have developed tessesract ocr application in spring boot. This application must scan 600,000 pdf scanned images. Currently , I am using tess 4j 4.4.0 version. It is taking 1 hour to process 275 pdfs. Per day it will be 6600 pdfs. I request you kindly provide solution to increase the processing speed of tesseract OCR , so that it scanning part will be completed. I must finish this task at the earliest. Please help me

nguyenq · 2019-08-16T00:44:07Z

More CPU cores, more RAM, multi-threading? Keep an instance of Tesseract engine to process several images instead of repeatedly instantiating for each image. Use GS to convert PDFs for speed.

Other users/developers please charm in.

nguyenq · 2019-10-09T00:12:23Z

New release 4.4.1 bundles tessdata_fast data, which significantly cuts down processing time.

ChristianSchwarz · 2020-01-20T15:51:26Z

@svijayakumar1
You can hack the PdfBox so it renders the pages in parallel to an array of ImageBuffers. Then you can
OCR the pages in parallel (1 page per core). This reduces the OCR time dramatically for me.

Yogeshmsharma-architect · 2021-04-07T19:16:12Z

More CPU cores, more RAM, multi-threading? Keep an instance of Tesseract engine to process several images instead of repeatedly instantiating for each image. Use GS to convert PDFs for speed.

Other users/developers please charm in.

Quan
Keep an instance of Tesseract engine to process: Are you suggesting to avoid new Tesserract1() for each image or you mean something else.
Use GS to convert PDfs: I have tried this but it is taking more time. I am splitting pdf into single pages using pdfbox and then sent for processing, does that sounds good you will still suggest using GS.

nguyenq · 2021-04-25T17:19:49Z

@Yogeshmsharma-architect Yes, setup and shutdown of the OCR engine for each image could take significant amounts of time. If you can send in a list of images to be processed all at once, it could help. There's a doOCR method version that accepts List<IIOImage> as input that you can use.

Or you can extend or come up with an alternative implementation of Tesseract or Tesseract1 to accept list of files or buffered images. Those classes are just applications of the base TessAPI classes.

If PDFBox is faster than GS for you, then, by all means, stick with it. Our own experience showed that GS has generally been faster.

ChristianSchwarz · 2021-04-26T08:35:40Z

Here is what I did: Extract pages from the PDF in parallel, a page per core. Then pass every page image for further processing to the callback onImageExtracted. Note: You should not use more threads than cores, otherwise the whole process will getting slower rather, see: Executors.newFixedThreadPool(...). This helped me to speed up the image extraction by factor ~7.

The following sample is written in Kotlin:

    /**
     * Converts PDF-pages  to BufferedImage's.
      */
    @Throws(IOException::class)
    fun convertPdfToBufferedImages(inputPdfFile: File, onImageExtracted: (BufferedImage, pageIndex:Int)->Unit) {

        val executor = Executors.newFixedThreadPool(8)
        PDDocument.load(inputPdfFile).use { document ->
            val pdfRenderer = PDFRenderer(document)

            val numberOfPages = document.numberOfPages
            val out = Array<BufferedImage?>(numberOfPages) { null }
            out.forEachIndexed { pageIndex, _ ->
                executor.submit {
                    try {
                        val pageImage = pdfRenderer.renderImageWithDPI(pageIndex, 300f, ImageType.GRAY)
                        out[pageIndex] = pageImage
                        onImageExtracted(pageImage,pageIndex)
                    } catch (e: IOException) {
                        logger.error("Error extracting PDF Document pageIndex $pageIndex=> $e", e)
                    }
                }
            }
            executor.shutdown()
            executor.awaitTermination(5, TimeUnit.HOURS)
        }
    }

alisdev · 2024-05-22T15:06:06Z

@ChristianSchwarz Thanks for example! Than you call one instance of tess4j or have you also 8 instances in a pool?

Repository owner deleted a comment from Yogeshmsharma-architect Apr 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to increase processing speed of tesseract OCR? #160

how to increase processing speed of tesseract OCR? #160

svijayakumar1 commented Aug 14, 2019

nguyenq commented Aug 16, 2019

nguyenq commented Oct 9, 2019

ChristianSchwarz commented Jan 20, 2020

Yogeshmsharma-architect commented Apr 7, 2021

nguyenq commented Apr 25, 2021 •

edited

Loading

ChristianSchwarz commented Apr 26, 2021

alisdev commented May 22, 2024 •

edited

Loading

how to increase processing speed of tesseract OCR? #160

how to increase processing speed of tesseract OCR? #160

Comments

svijayakumar1 commented Aug 14, 2019

nguyenq commented Aug 16, 2019

nguyenq commented Oct 9, 2019

ChristianSchwarz commented Jan 20, 2020

Yogeshmsharma-architect commented Apr 7, 2021

nguyenq commented Apr 25, 2021 • edited Loading

ChristianSchwarz commented Apr 26, 2021

alisdev commented May 22, 2024 • edited Loading

nguyenq commented Apr 25, 2021 •

edited

Loading

alisdev commented May 22, 2024 •

edited

Loading