Extracting text through dict gets rid of Spaces #978

tristancatteeuw · 2021-03-29T11:00:42Z

tristancatteeuw
Mar 29, 2021

Hello,

I have a little problem regarding the text extraction through a dict.

If I simply write the following piece of code

doc = fitz.open(path)
for page in doc:
    print(page.getText())

I get as output a sequence of string such as "Comparison of Machine Learning algorithms"

If now I try to use dictionnaries instead like this

doc = fitz.open(path)
for page in doc:
    blocks = page.getText("dict", flags=11)["blocks"]
    blocks.sort(key=lambda block: (block["bbox"][0],block["bbox"][3]))
    for b in blocks:
        for l in b["lines"]:
            for s in l["spans"]:
               if any((a.isdigit() or a.isalpha()) for a in s["text"]):
                    print(s["text"])

This text becomes "ComparisonofMachineLearningalgorithms"

I need to use dicts as I try to get text base on some rectangles to try to get the right order (which works decently well until now). Is there a way to resolve this? Note that this doesn't happen on all PDF files, but on +/- 25% of the files. Some others also have no spaces at all even with the basic getText() method but in those cases I think the problem comes from the PDF file itself as I also don't get spaces if I try to copy paste the content from Acrobat.

Answered by JorjMcKie

Mar 29, 2021

You need to play with the flags argument:

11 = 8 + 2 + 1
fitz.TEXT_INHIBIT_SPACES = 8  # <==  this is the problem!
fitz.TEXT_PRESERVE_WHITESPACE = 2
fitz.TEXT_PRESERVE_LIGATURES = 1

View full answer

JorjMcKie · 2021-03-29T11:52:04Z

JorjMcKie
Mar 29, 2021
Maintainer

You need to play with the flags argument:

11 = 8 + 2 + 1
fitz.TEXT_INHIBIT_SPACES = 8  # <==  this is the problem!
fitz.TEXT_PRESERVE_WHITESPACE = 2
fitz.TEXT_PRESERVE_LIGATURES = 1

1 reply

tristancatteeuw Mar 29, 2021
Author

Oh thank you, I found the 11 in the documentation but didn't know what it standed for!

JorjMcKie · 2021-03-29T12:15:40Z

JorjMcKie
Mar 29, 2021
Maintainer

I need to use dicts as I try to get text base on some rectangles to try to get the right order (which works decently well until now). Is there a way to resolve this? Note that this doesn't happen on all PDF files, but on +/- 25% of the files. Some others also have no spaces at all even with the basic getText() method but in those cases I think the problem comes from the PDF file itself as I also don't get spaces if I try to copy paste the content from Acrobat.

Yes, PDF offers the option to take control of distances between any two characters - independent from what the chosen font has to say about a character's width. This is widely used by PDF creators - unfortunately, in my opinion.
So MuPDF offers a set of indicators to influence how text is parsed. For example, in PDF you can

simply encode "PyMuPDF", or
explicitely tell PDF to reduce the distance between some of those characters, e.g. between "M" and "u", by inserting a negative displacement before the "u".
provide that text as single characters "P", "y", "M", ..., each with its own coordinates.
BTW: in the previous case, you can even specify the characters in any of the 7! (= 5040) different sequences - and more if you do that all over the complete page. By a PDF viewer, the text will still be displayed just fine, but any copy / paste (and PyMuPDF text extraction!) will deliver nonsense. This can be used to protect against unwanted copies ...
... and more of this crab

So, a parser (like PDF readers, and also MuPDF) has to provide some heuristics to make sense out of all that.
This is what you are experiencing ...

1 reply

tristancatteeuw Mar 29, 2021
Author

Thank you for the in-depth explanation! Honestly PDF is really a troublesome format sometimes. Thank you for the amazing library which makes it easier to make sense of this mess

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting text through dict gets rid of Spaces #978

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Extracting text through dict gets rid of Spaces #978

tristancatteeuw Mar 29, 2021

Replies: 2 comments · 2 replies

JorjMcKie Mar 29, 2021 Maintainer

tristancatteeuw Mar 29, 2021 Author

JorjMcKie Mar 29, 2021 Maintainer

tristancatteeuw Mar 29, 2021 Author

tristancatteeuw
Mar 29, 2021

Replies: 2 comments 2 replies

JorjMcKie
Mar 29, 2021
Maintainer

tristancatteeuw Mar 29, 2021
Author

JorjMcKie
Mar 29, 2021
Maintainer

tristancatteeuw Mar 29, 2021
Author