Lots of whitespaces in between words #1067
fintech132
started this conversation in
Ask for help with specific PDFs
Replies: 2 comments 2 replies
-
This PDF uses character sizes that are unusually large: import pdfplumber
from collections import Counter
pdf = pdfplumber.open("./SampleOutput.pdf")
page = pdf.pages[0]
print(Counter(sorted(c["size"] for c in page.chars))) Result:
For that reason, you'll need to use page.extract_text(
layout=True,
x_density=75,
y_density=90,
y_tolerance=10
) ... produces this:
|
Beta Was this translation helpful? Give feedback.
0 replies
-
Thank you very much. But is it possible to set the x_density and y_density dynamically according to the font size? |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Describe the bug
A clear and concise description of what the bug is.
When I try to extract text from the PDF file, there are lots of empty spaces in between words.
Code to reproduce the problem
`
#utility to test pdfplumber
import sys
import pdfplumber
def pdf_to_text(input_path, output_path):
page1_text = ''
with pdfplumber.open(input_path) as pdf:
#deal with multiple pages
i = 1
for p in pdf.pages:
page1_text = '\n\npage ' + str(i) + '\n'
tmp = p.dedupe_chars().extract_text(x_tolerance=3, y_tolerance=3, x_density=7.25, y_density=7.25, layout=True).split('\n')
page1_text += str(tmp)
i = i + 1
if len(sys.argv) != 3:
print("Usage: python script.py ")
sys.exit(1)
pdf_to_text(sys.argv[1], sys.argv[2])
`
PDF file
Please attach any PDFs necessary to reproduce the problem.
See attached pdf file
SampleOutput.pdf
Expected behavior
What did you expect the result should have been?
The sentence should be extracted as the following:
This is a sample employment application form. Please carefully read
This issue is all of the extracted text. I just point out one sentence.
Actual behavior
What actually happened, instead?
The output text is (note there are lots of whitespaces):
This is a sample employment application form. Please carefully read
Screenshots
If applicable, add screenshots to help explain your problem.
Environment
Beta Was this translation helpful? Give feedback.
All reactions