Skip to content

page.get_text("blocks") on the attached PDFs returns empty array of blocks, could you please help me? #4213

Discussion options

You must be logged in to vote

What looks like text is no text! The pages show little vector graphics - 1 for each character. You can imagine the approach like this:
To draw capital letter "A", draw the lines "/", "-", "\" to achieve "/-\". Similar for any character with curved lines ... you get the argument.
The only way to access the text is using OCR.

doc=pymupdf.open("Alcohol.Withdrawal.1-5-2025.pdf")
page=doc[0]
tp = page.get_textpage_ocr(dpi=150, full=True)
print(page.get_text(textpage=tp,sort=True))
eis Advocate Health Care  | © Aurora Health Care

Understanding Alcohol Withdrawal

 Alcohol affects your brain and body. When you stop drinking alcohol after regular or heavy
 drinking, changes happen in your body. T…

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@krish-tech02
Comment options

Answer selected by krish-tech02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants