refactor: remove remaining table OCR logic in inference (#302)

### Summary Remove all OCR related code: * table OCR code -> require ocr tokens to pass in for table structure * parameter `extract_tables` -> moved to unst already, unst decide if extract or not and calling table model * function `interpret_table_block` -> this was a wrapper to call table in inference on block level, logic moved to unst * paddle ocr related code and readme instruction ### Test * shouldn't affect anything since its just remove a deprecated logic * added some test for coverage * CCT metrics compare (no change): before (main on core product): ``` metric average sample_sd population_sd count -------------------------------------------------- cct-accuracy 0.665 0.278 0.277 109 cct-%missing 0.094 0.176 0.176 109 ``` after (inference checked out to this branch): ``` metric average sample_sd population_sd count -------------------------------------------------- cct-accuracy 0.665 0.278 0.277 109 cct-%missing 0.094 0.176 0.176 109 ```
Unstructured-IO · Dec 15, 2023 · 4e5c4e6 · 4e5c4e6
1 parent d4785df
commit 4e5c4e6
Show file tree

Hide file tree

Showing 15 changed files with 585 additions and 340 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,7 @@
+## 0.7.19
+
+* refactor: remove all OCR related code
+
 ## 0.7.18
 
 * refactor: remove all image extraction related code

diff --git a/Makefile b/Makefile
@@ -22,7 +22,7 @@ install-base: install-base-pip-packages
 install: install-base-pip-packages install-dev install-detectron2
 
 .PHONY: install-ci
-install-ci: install-base-pip-packages install-test install-paddleocr
+install-ci: install-base-pip-packages install-test
 
 .PHONY: install-base-pip-packages
 install-base-pip-packages:
@@ -32,12 +32,6 @@ install-base-pip-packages:
 install-detectron2:
 	pip install "detectron2@git+https://github.com/facebookresearch/detectron2.git@57bdb21249d5418c130d54e2ebdc94dda7a4c01a"
 
-.PHONY: install-paddleocr
-install-paddleocr:
-	pip install --no-cache-dir paddlepaddle
-	pip install --no-cache-dir paddlepaddle-gpu
-	pip install --no-cache-dir "unstructured.PaddleOCR"
-
 .PHONY: install-test
 install-test: install-base
 	pip install -r requirements/test.txt

diff --git a/README.md b/README.md
@@ -34,24 +34,6 @@ Windows is not officially supported by Detectron2, but some users are able to in
 See discussion [here](https://layout-parser.github.io/tutorials/installation#for-windows-users) for 
 tips on installing Detectron2 on Windows.
 
-### PaddleOCR
-
-[PaddleOCR](https://github.com/Unstructured-IO/unstructured.PaddleOCR) is suggested for table processing. Please set
-environment variable `TABLE_OCR`
-to `paddle` if you wish to use paddle for table processing instead of default `tesseract`.
-
-PaddleOCR may be with installed with:
-
-```shell
-pip install paddlepaddle
-pip install "unstructured.PaddleOCR"
-```
-
-We suggest that you install paddlepaddle-gpu with `pip install paddepaddle-gpu` if you have gpu devices available for better OCR performance.
-
-Please note that **paddlepaddle does not work on MacOS with Apple Silicon**. So if you want it running on Apple M1/M2 chip, we have a custom wheel of paddlepaddle for aarch64 architecture, you can install it with `pip install unstructured.paddlepaddle`, and run it inside a docker container.
-
-
 ### Repository
 
 To install the repository for development, clone the repo and run `make install` to install dependencies.

diff --git a/test_unstructured_inference/inference/test_layout.py b/test_unstructured_inference/inference/test_layout.py
@@ -215,13 +215,11 @@ def __init__(
         number=1,
         image=None,
         model=None,
-        extract_tables=False,
         detection_model=None,
     ):
         self.image = image
         self.layout = layout
         self.model = model
-        self.extract_tables = extract_tables
         self.number = number
         self.detection_model = detection_model
 
@@ -596,7 +594,6 @@ def test_process_file_with_model_routing(monkeypatch, model_type, is_detection_m
             detection_model=detection_model,
             element_extraction_model=element_extraction_model,
             fixed_layouts=None,
-            extract_tables=False,
             pdf_image_dpi=200,
         )
 

diff --git a/test_unstructured_inference/inference/test_layout_element.py b/test_unstructured_inference/inference/test_layout_element.py
@@ -1,33 +1,21 @@
-import pytest
 from layoutparser.elements import TextBlock
 from layoutparser.elements.layout_elements import Rectangle as LPRectangle
 
 from unstructured_inference.constants import Source
 from unstructured_inference.inference.layoutelement import LayoutElement, TextRegion
 
 
-@pytest.mark.parametrize("is_table", [False, True])
 def test_layout_element_extract_text(
     mock_layout_element,
     mock_text_region,
-    mock_pil_image,
-    is_table,
 ):
-    if is_table:
-        mock_layout_element.type = "Table"
-
     extracted_text = mock_layout_element.extract_text(
         objects=[mock_text_region],
-        image=mock_pil_image,
-        extract_tables=True,
     )
 
     assert isinstance(extracted_text, str)
     assert "Sample text" in extracted_text
 
-    if mock_layout_element.type == "Table":
-        assert hasattr(mock_layout_element, "text_as_html")
-
 
 def test_layout_element_do_dict(mock_layout_element):
     expected = {