PyMuPDF Pro 1.25.0: A 1-page .docx file is split into 4 pages #4157

trianxy · 2024-12-17T13:26:40Z

Description of the bug

For some documents, PyMuPDF Pro splits the document into many more pages than if I open the document with Google Docs (or Mac Pages/libreoffice).

This creates several downstream problems (example: exporting first page as png via page.get_pixmap().tobytes(output="png") won't match the expected first page).

How to reproduce the bug

Download the attached 1page-is-split-into-4pages.docx and run

import pymupdf.pro
pymupdf.pro.unlock()  # use a trial key to see output of 4th page etc.

document = pymupdf.open("1page-is-split-into-4pages.docx")
for page in document:
    print(page)
    print(page.get_text())

and observe that pymupdf recognizes 4 pages, although if you open it in Google Docs (or Mac's Pages, or libreoffice), it shows as 1 page.

PyMuPDF version

1.25.0

Operating system

Linux

Python version

3.9

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyMuPDF Pro 1.25.0: A 1-page .docx file is split into 4 pages #4157

PyMuPDF Pro 1.25.0: A 1-page .docx file is split into 4 pages #4157

trianxy commented Dec 17, 2024

PyMuPDF Pro 1.25.0: A 1-page .docx file is split into 4 pages #4157

PyMuPDF Pro 1.25.0: A 1-page .docx file is split into 4 pages #4157

Comments

trianxy commented Dec 17, 2024

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version