fix: Set OMP_THREAD_LIMIT for better tesseract performance (#185)

I've spent some time playing with this var, and I came up with [this gist](https://gist.github.com/awalker4/8581d76d373c1bc51e0f2676a6ad816c). I ran this on a 4 core EC2 instance. Processing 3 pages without the limit takes 153s. With the limit is 5s 😍 . When the number of pages is higher than number of cores, it just hangs without this var.
Unstructured-IO · Aug 25, 2023 · 080ccfa · 080ccfa
1 parent 9b6aa8e
commit 080ccfa
Show file tree

Hide file tree

Showing 3 changed files with 11 additions and 1 deletion.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,7 @@
+## 0.5.17
+
+* Use `OMP_THREAD_LIMIT` to improve tesseract performance
+
 ## 0.5.16
 
 * Fix to no longer create a directory for storing processed images

diff --git a/unstructured_inference/__version__.py b/unstructured_inference/__version__.py
@@ -1 +1 @@
-__version__ = "0.5.16"  # pragma: no cover
+__version__ = "0.5.17"  # pragma: no cover
diff --git a/unstructured_inference/models/tesseract.py b/unstructured_inference/models/tesseract.py
@@ -1,3 +1,4 @@
+import os
 from typing import Dict
 
 import pytesseract
@@ -9,6 +10,11 @@
 
 TesseractError = pytesseract.pytesseract.TesseractError
 
+# Force tesseract to be single threaded,
+# otherwise we see major performance problems
+if "OMP_THREAD_LIMIT" not in os.environ:
+    os.environ["OMP_THREAD_LIMIT"] = "1"
+
 
 def load_agent(languages: str = "eng"):
     """Loads the Tesseract OCR agent as a global variable to ensure that we only load it once.
Original file line number	Diff line number	Diff line change
		@@ -1 +1 @@
		__version__ = "0.5.16" # pragma: no cover
		__version__ = "0.5.17" # pragma: no cover