page.get_text("blocks") on the attached PDFs returns empty array of blocks, could you please help me? #4213
Answered
by
JorjMcKie
krish-tech02
asked this question in
Looking for help
-
@JorjMcKie I am trying to read the text of the attached PDFs by using below code: Extract text blocks and process them
But the output seems to be empty, could you please help? |
Beta Was this translation helpful? Give feedback.
Answered by
JorjMcKie
Jan 8, 2025
Replies: 1 comment 1 reply
-
What looks like text is no text! The pages show little vector graphics - 1 for each character. You can imagine the approach like this: doc=pymupdf.open("Alcohol.Withdrawal.1-5-2025.pdf")
page=doc[0]
tp = page.get_textpage_ocr(dpi=150, full=True)
print(page.get_text(textpage=tp,sort=True))
eis Advocate Health Care | © Aurora Health Care
Understanding Alcohol Withdrawal
Alcohol affects your brain and body. When you stop drinking alcohol after regular or heavy
drinking, changes happen in your body. This can lead to withdrawal symptoms.
Quitting alcohol may be tough. There is supportto help you.
... |
Beta Was this translation helpful? Give feedback.
1 reply
Answer selected by
krish-tech02
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
What looks like text is no text! The pages show little vector graphics - 1 for each character. You can imagine the approach like this:
To draw capital letter "A", draw the lines
"/"
,"-"
,"\"
to achieve"/-\"
. Similar for any character with curved lines ... you get the argument.The only way to access the text is using OCR.