Skip to content

Commit

Permalink
fix: Set OMP_THREAD_LIMIT for better tesseract performance (#185)
Browse files Browse the repository at this point in the history
I've spent some time playing with this var, and I came up with [this
gist](https://gist.github.com/awalker4/8581d76d373c1bc51e0f2676a6ad816c).
I ran this on a 4 core EC2 instance. Processing 3 pages without the
limit takes 153s. With the limit is 5s 😍 . When the number of pages is
higher than number of cores, it just hangs without this var.
  • Loading branch information
awalker4 authored Aug 25, 2023
1 parent 9b6aa8e commit 080ccfa
Show file tree
Hide file tree
Showing 3 changed files with 11 additions and 1 deletion.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
## 0.5.17

* Use `OMP_THREAD_LIMIT` to improve tesseract performance

## 0.5.16

* Fix to no longer create a directory for storing processed images
Expand Down
2 changes: 1 addition & 1 deletion unstructured_inference/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.5.16" # pragma: no cover
__version__ = "0.5.17" # pragma: no cover
6 changes: 6 additions & 0 deletions unstructured_inference/models/tesseract.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import os
from typing import Dict

import pytesseract
Expand All @@ -9,6 +10,11 @@

TesseractError = pytesseract.pytesseract.TesseractError

# Force tesseract to be single threaded,
# otherwise we see major performance problems
if "OMP_THREAD_LIMIT" not in os.environ:
os.environ["OMP_THREAD_LIMIT"] = "1"


def load_agent(languages: str = "eng"):
"""Loads the Tesseract OCR agent as a global variable to ensure that we only load it once.
Expand Down

0 comments on commit 080ccfa

Please sign in to comment.