Skip to content

Commit

Permalink
Change the text processing to remove empty blocks full of whitespace
Browse files Browse the repository at this point in the history
  • Loading branch information
dmacko232 committed Oct 13, 2023
1 parent 05f91e3 commit 0ca0d9a
Showing 1 changed file with 6 additions and 2 deletions.
8 changes: 6 additions & 2 deletions src/sec_certs/utils/pdf.py
Original file line number Diff line number Diff line change
Expand Up @@ -176,8 +176,12 @@ def segmented_pdf_to_text(segmented_pdf: list[dict[str, Any]]) -> str:
spans = []
for span in line:
spans.append(span.strip())
lines.append(" ".join(spans))
block_texts.append("\n".join(lines)) # lines are separated by "\n"
line = " ".join(spans)
if len(line.strip()) > 0:
lines.append(line)
block_text = "\n".join(lines) # TODO maybe change to " ", depends how we wanna view it
if len(block_text.strip()) > 0:
block_texts.append(block_text) # lines are separated by "\n"
# deal with table which has header and rows
elif block["type"] == "table":
row_texts = []
Expand Down

0 comments on commit 0ca0d9a

Please sign in to comment.