1.24.6&1.24.13 get_text different #4032

tenghian · 2024-11-08T19:41:20Z

tenghian
Nov 8, 2024

Extraction of blocks in version 1.24.6 is perfect. How can I make version 1.24.13 work like 1.24.6? Thank you!

text_blocks = page.get_text("dict", flags=pymupdf.TEXTFLAGS_BLOCKS)["blocks"]

pdf1246-cut.pdf
pdf12413-cut.pdf

JorjMcKie · 2024-11-12T16:13:28Z

JorjMcKie
Nov 12, 2024
Maintainer

You cannot - except with your own code of course.

PyMuPDF is not deciding about block segmentation, this is a result of MuPDF's algorithms. The next MuPDF version 1.25.0 will bring significant improvements here. With a new text extraction option, MuPDF can be asked to search for recognizable page layout segments which will each be turned into a block for PyMuPDF.
That should bring back some of the earlier results.
In yet a subsequent version, MuPDF will also add guessing correct paragraph breaks and other structures.
But this is more in the future.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.24.6&1.24.13 get_text different #4032

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

1.24.6&1.24.13 get_text different #4032

tenghian Nov 8, 2024

Replies: 1 comment

JorjMcKie Nov 12, 2024 Maintainer

tenghian
Nov 8, 2024

JorjMcKie
Nov 12, 2024
Maintainer