Skip to content

Commit

Permalink
refactor: remove remaining table OCR logic in inference (#302)
Browse files Browse the repository at this point in the history
### Summary

Remove all OCR related code:
* table OCR code -> require ocr tokens to pass in for table structure
* parameter `extract_tables` -> moved to unst already, unst decide if
extract or not and calling table model
* function `interpret_table_block` -> this was a wrapper to call table
in inference on block level, logic moved to unst
* paddle ocr related code and readme instruction

### Test
* shouldn't affect anything since its just remove a deprecated logic
* added some test for coverage
* CCT metrics compare (no change):

before (main on core product):
```
metric       average sample_sd population_sd count
--------------------------------------------------
cct-accuracy 0.665   0.278     0.277         109  
cct-%missing 0.094   0.176     0.176         109 
```

after (inference checked out to this branch):
```
metric       average sample_sd population_sd count
--------------------------------------------------
cct-accuracy 0.665   0.278     0.277         109  
cct-%missing 0.094   0.176     0.176         109 
```
  • Loading branch information
yuming-long authored Dec 15, 2023
1 parent d4785df commit 4e5c4e6
Show file tree
Hide file tree
Showing 15 changed files with 585 additions and 340 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
## 0.7.19

* refactor: remove all OCR related code

## 0.7.18

* refactor: remove all image extraction related code
Expand Down
8 changes: 1 addition & 7 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ install-base: install-base-pip-packages
install: install-base-pip-packages install-dev install-detectron2

.PHONY: install-ci
install-ci: install-base-pip-packages install-test install-paddleocr
install-ci: install-base-pip-packages install-test

.PHONY: install-base-pip-packages
install-base-pip-packages:
Expand All @@ -32,12 +32,6 @@ install-base-pip-packages:
install-detectron2:
pip install "detectron2@git+https://github.com/facebookresearch/detectron2.git@57bdb21249d5418c130d54e2ebdc94dda7a4c01a"

.PHONY: install-paddleocr
install-paddleocr:
pip install --no-cache-dir paddlepaddle
pip install --no-cache-dir paddlepaddle-gpu
pip install --no-cache-dir "unstructured.PaddleOCR"

.PHONY: install-test
install-test: install-base
pip install -r requirements/test.txt
Expand Down
18 changes: 0 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,24 +34,6 @@ Windows is not officially supported by Detectron2, but some users are able to in
See discussion [here](https://layout-parser.github.io/tutorials/installation#for-windows-users) for
tips on installing Detectron2 on Windows.

### PaddleOCR

[PaddleOCR](https://github.com/Unstructured-IO/unstructured.PaddleOCR) is suggested for table processing. Please set
environment variable `TABLE_OCR`
to `paddle` if you wish to use paddle for table processing instead of default `tesseract`.

PaddleOCR may be with installed with:

```shell
pip install paddlepaddle
pip install "unstructured.PaddleOCR"
```

We suggest that you install paddlepaddle-gpu with `pip install paddepaddle-gpu` if you have gpu devices available for better OCR performance.

Please note that **paddlepaddle does not work on MacOS with Apple Silicon**. So if you want it running on Apple M1/M2 chip, we have a custom wheel of paddlepaddle for aarch64 architecture, you can install it with `pip install unstructured.paddlepaddle`, and run it inside a docker container.


### Repository

To install the repository for development, clone the repo and run `make install` to install dependencies.
Expand Down
3 changes: 0 additions & 3 deletions test_unstructured_inference/inference/test_layout.py
Original file line number Diff line number Diff line change
Expand Up @@ -215,13 +215,11 @@ def __init__(
number=1,
image=None,
model=None,
extract_tables=False,
detection_model=None,
):
self.image = image
self.layout = layout
self.model = model
self.extract_tables = extract_tables
self.number = number
self.detection_model = detection_model

Expand Down Expand Up @@ -596,7 +594,6 @@ def test_process_file_with_model_routing(monkeypatch, model_type, is_detection_m
detection_model=detection_model,
element_extraction_model=element_extraction_model,
fixed_layouts=None,
extract_tables=False,
pdf_image_dpi=200,
)

Expand Down
12 changes: 0 additions & 12 deletions test_unstructured_inference/inference/test_layout_element.py
Original file line number Diff line number Diff line change
@@ -1,33 +1,21 @@
import pytest
from layoutparser.elements import TextBlock
from layoutparser.elements.layout_elements import Rectangle as LPRectangle

from unstructured_inference.constants import Source
from unstructured_inference.inference.layoutelement import LayoutElement, TextRegion


@pytest.mark.parametrize("is_table", [False, True])
def test_layout_element_extract_text(
mock_layout_element,
mock_text_region,
mock_pil_image,
is_table,
):
if is_table:
mock_layout_element.type = "Table"

extracted_text = mock_layout_element.extract_text(
objects=[mock_text_region],
image=mock_pil_image,
extract_tables=True,
)

assert isinstance(extracted_text, str)
assert "Sample text" in extracted_text

if mock_layout_element.type == "Table":
assert hasattr(mock_layout_element, "text_as_html")


def test_layout_element_do_dict(mock_layout_element):
expected = {
Expand Down
Loading

0 comments on commit 4e5c4e6

Please sign in to comment.