Extracting text through dict gets rid of Spaces #978
-
Hello, I have a little problem regarding the text extraction through a dict. If I simply write the following piece of code
I get as output a sequence of string such as "Comparison of Machine Learning algorithms" If now I try to use dictionnaries instead like this
This text becomes "ComparisonofMachineLearningalgorithms" I need to use dicts as I try to get text base on some rectangles to try to get the right order (which works decently well until now). Is there a way to resolve this? Note that this doesn't happen on all PDF files, but on +/- 25% of the files. Some others also have no spaces at all even with the basic getText() method but in those cases I think the problem comes from the PDF file itself as I also don't get spaces if I try to copy paste the content from Acrobat. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
You need to play with the
|
Beta Was this translation helpful? Give feedback.
-
Yes, PDF offers the option to take control of distances between any two characters - independent from what the chosen font has to say about a character's width. This is widely used by PDF creators - unfortunately, in my opinion.
So, a parser (like PDF readers, and also MuPDF) has to provide some heuristics to make sense out of all that. |
Beta Was this translation helpful? Give feedback.
You need to play with the
flags
argument: