Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hl 1.25.1 #4143

Draft
wants to merge 18 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions changes.txt
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,13 @@ Change Log
* **Fixed** `4004 <https://github.com/pymupdf/PyMuPDF/issues/4004>`_: Segmentation Fault When Updating PDF Form Field Value
* **Fixed** `3751 <https://github.com/pymupdf/PyMuPDF/issues/3751>`_: apply_redactions causes part of the page content to be hidden / transparent

* Other:

* New Page method "recolor" which changes the color component count of text, image and vector graphic objects.
* New Document method "recolor" invokes the same-named "Page" method for all pages in the PDF.
* Image support for "Stamp" annotations.
* Accessing the object definition for an (orphaned) cross reference number no longer raises an exception.


**Changes in version 1.24.14 (2024-11-19)**

Expand Down
22 changes: 19 additions & 3 deletions docs/document.rst
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,7 @@ For details on **embedded files** refer to Appendix 3.
:meth:`Document.pdf_catalog` PDF only: :data:`xref` of catalog (root)
:meth:`Document.pdf_trailer` PDF only: trailer source
:meth:`Document.prev_location` return (chapter, pno) of preceding page
:meth:`Document.recolor` PDF only: execute :meth:`Page.recolor` for all pages
:meth:`Document.reload_page` PDF only: provide a new copy of a page
:meth:`Document.resolve_names` PDF only: Convert destination names into a Python dict
:meth:`Document.save` PDF only: save the document
Expand Down Expand Up @@ -594,6 +595,16 @@ For details on **embedded files** refer to Appendix 3.

To maintain a consistent API, for document types not supporting a chapter structure (like PDFs), :attr:`Document.chapter_count` is 1, and pages can also be loaded via tuples *(0, pno)*. See this [#f3]_ footnote for comments on performance improvements.


.. method:: recolor(components=1)

PDF only: Change the color component counts for all object types text, image and vector graphics for all pages.

:arg int components: desired color space indicated by the number of color components: 1 = DeviceGRAY, 3 = DeviceRGB, 4 = DeviceCMYK.

The typical use case is 1 (DeviceGRAY) which converts the PDF to grayscale.


.. method:: reload_page(page)

* New in v1.16.10
Expand Down Expand Up @@ -924,14 +935,14 @@ For details on **embedded files** refer to Appendix 3.

.. method:: get_page_fonts(pno, full=False)

PDF only: Return a list of all fonts (directly or indirectly) referenced by the page.
PDF only: Return a list of all fonts (directly or indirectly) referenced by the page object definition.

:arg int pno: page number, 0-based, `-∞ < pno < page_count`.
:arg bool full: whether to also include the referencer's :data:`xref`. If *True*, the returned items are one entry longer. Use this option if you need to know, whether the page directly references the font. In this case the last entry is 0. If the font is referenced by an `/XObject` of the page, you will find its :data:`xref` here.

:rtype: list

:returns: a list of fonts referenced by this page. Each entry looks like
:returns: a list of fonts referenced by the object definition of the page. Each entry looks like

**(xref, ext, type, basefont, name, encoding, referencer)**,

Expand Down Expand Up @@ -959,7 +970,12 @@ For details on **embedded files** refer to Appendix 3.

.. note::
* This list has no duplicate entries: the combination of :data:`xref`, *name* and *referencer* is unique.
* In general, this is a superset of the fonts actually in use by this page. The PDF creator may e.g. have specified some global list, of which each page only makes partial use.
* In general, this is a true superset of the fonts actually in use by this page. The PDF creator may e.g. have specified some global list, of which each page make only partial use.
* Be aware that font names returned by some variants of :meth:`Page.get_text` (respectively :ref:`TextPage` methods) need not (exactly) equal the base font name shown here. Reasons for any differences include:

- This method always shows any subset prefixes (the pattern ``ABCDEF+``), whereas text extractions do not do this by default.
- Text extractions use the base library to access the font name, which has a length cap of 31 bytes and generally interrogates the font file binary to access the name. Method ``get_page_fonts()`` however looks at the PDF definition source.
- Text extractions work for all supported document types in exactly the same way -- not just for PDFs. Consequently they do not contain PDF-specifics.

.. method:: get_page_text(pno, output="text", flags=3, textpage=None, sort=False)

Expand Down
Binary file added docs/images/img-imagestamp.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
44 changes: 31 additions & 13 deletions docs/page.rst
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,7 @@ In a nutshell, this is what you can do with PyMuPDF:
:meth:`Page.load_widget` PDF only: load a specific field
:meth:`Page.load_links` return the first link on a page
:meth:`Page.new_shape` PDF only: create a new :ref:`Shape`
:meth:`Page.recolor` PDF only: change the colorspace of objects
:meth:`Page.remove_rotation` PDF only: set page rotation to 0
:meth:`Page.replace_image` PDF only: replace an image
:meth:`Page.search_for` search for a string
Expand Down Expand Up @@ -491,7 +492,7 @@ In a nutshell, this is what you can do with PyMuPDF:
* ``bbox``: the bounding box of the table as a tuple `(x0, y0, x1, y1)`.
* ``cells``: bounding boxes of the table's cells (list of tuples). A cell may also be `None`.
* ``extract()``: this method returns the text content of each table cell as a list of list of strings.
* ``to_markdown()``: this method returns the table as a **string in markdown format** (compatible to Github). Supporting viewers can render the string as a table. This output is optimized for **small token** sizes, which is especially beneficial for LLM/RAG feeds. Pandas DataFrames (see method `to_pandas()` below) offer an equivalent markdown table output which however is better readable for the human eye.
* ``to_markdown()``: this method returns the table as a **string in markdown format** compatible to Github. Supporting viewers can render the string as a table. This output is optimized for **small token sizes**, which is especially beneficial for LLM/RAG feeds. Pandas DataFrame (see method `to_pandas()` below) also offers a markdown output. While better readable for the human eye, it generally is a larger string than produced by the native method.
* `to_pandas()`: this method returns the table as a `pandas <https://pypi.org/project/pandas/>`_ `DataFrame <https://pandas.pydata.org/docs/reference/frame.html>`_. DataFrames are very versatile objects allowing a plethora of table manipulation methods and outputs to almost 20 well-known formats, among them Excel files, CSV, JSON, markdown-formatted tables and more. `DataFrame.to_markdown()` generates a Github-compatible markdown format optimized for human readability. This method however requires the package `tabulate <https://pypi.org/project/tabulate/>`_ to be installed in addition to pandas itself.
* ``header``: a `TableHeader` object containing header information of the table.
* ``col_count``: an integer containing the number of table columns.
Expand All @@ -503,11 +504,11 @@ In a nutshell, this is what you can do with PyMuPDF:
* ``bbox``: the bounding box of the header.
* `cells`: a list of bounding boxes containing the name of the respective column.
* `names`: a list of strings containing the text of each of the cell bboxes. They represent the column names -- which are used when exporting the table to pandas DataFrames, markdown, etc.
* `external`: a bool indicating whether the header bbox is outside the table body (`True`) or not. Table headers are never identified by the `TableFinder` logic. Therefore, if `external` is true, then the header cells are not part of any cell identified by `TableFinder`. If `external == False`, then the first table row is the header.
* `external`: a bool indicating whether the header bbox is outside the table body (`True`) or not. Table headers are never identified by the `TableFinder` logic. Therefore, if `external` is true, then the header cells are not part of any cell identified by `TableFinder`. If `external == False`, then the first original table row is the header.

Please have a look at these `Jupyter notebooks <https://github.com/pymupdf/PyMuPDF-Utilities/tree/master/table-analysis>`_, which cover standard situations like multiple tables on one page or joining table fragments across multiple pages.

.. caution:: The lifetime of the `TableFinder` object, as well as that of all its tables **equals the lifetime of the page**. If the page object is deleted or reassigned, all tables are no longer valid.
.. caution:: The lifetime of the `TableFinder` object, as well as that of all its tables **equals the lifetime of the page**. If the page object is deleted or reassigned, all **table objects are no longer valid.**

The only way to keep table content beyond the page's availability is to **extract it** via methods `Table.to_markdown()`, `Table.to_pandas()` or a copy of `Table.extract()` (e.g. `Table.extract()[:]`).

Expand Down Expand Up @@ -535,24 +536,33 @@ In a nutshell, this is what you can do with PyMuPDF:
There is also the `pdf2docx extract tables method`_ which is capable of table extraction if you prefer.


.. method:: add_stamp_annot(rect, stamp=0)
.. method:: add_stamp_annot(rect, stamp=0, *, image=None)

PDF only: Add a "rubber stamp" like annotation to e.g. indicate the document's intended use ("DRAFT", "CONFIDENTIAL", etc.).
PDF only: Add a "rubber stamp"-like annotation to e.g. indicate the document's intended use ("DRAFT", "CONFIDENTIAL", etc.). Instead of text, an image may also be shown.

:arg rect_like rect: rectangle where to place the annotation.

:arg int stamp: id number of the stamp text. For available stamps see :ref:`StampIcons`.
:arg multiple image: if not ``None``, an image specification is assumed and the ``stamp`` parameter will be ignored. Valid argument types are

* a string specifying an image file path,
* a ``bytes``, ``bytearray`` or ``io.BytesIO`` object for an image in memory, and
* a :ref:`Pixmap`.

1. **Text-based stamps**

.. note::

* The stamp's text and its border line will automatically be sized and be put horizontally and vertically centered in the given rectangle. :attr:`Annot.rect` is automatically calculated to fit the given **width** and will usually be smaller than this parameter.
* :attr:`Annot.rect` is automatically calculated as the largest rectangle with an aspect ratio of ``width/height = 3.8`` that fits in the provided ``rect``. Its position is vertically and horizontally centered.
* The font chosen is "Times Bold" and the text will be upper case.
* The appearance can be changed using :meth:`Annot.set_opacity` and by setting the "stroke" color (no "fill" color supported).
* This can be used to create watermark images: on a temporary PDF page create a stamp annotation with a low opacity value, make a pixmap from it with *alpha=True* (and potentially also rotate it), discard the temporary PDF page and use the pixmap with :meth:`insert_image` for your target PDF.
* The appearance can be modified using :meth:`Annot.set_opacity` and by setting the "stroke" color. By PDF specification, stamp annotations have no "fill" color.

.. image:: images/img-stampannot.*

.. image:: images/img-stampannot.*
:scale: 80
2. **Image-based stamps**

* At first, a rectangle is computed like for text stamps: vertically and horizontally centered, aspect ratio ``width/height = 3.8``.
* Into that rectangle, the image will be inserted aligned left and vertically centered. The resulting image boundary box becomes :attr:`Annot.rect`.
* The annotation can be modified via :meth:`Annot.set_opacity`. This is a way to display images without alpha channel with transparency. Setting colors has no effect on image stamps.

.. image:: images/img-imagestamp.*

.. method:: add_widget(widget)

Expand Down Expand Up @@ -1924,6 +1934,14 @@ In a nutshell, this is what you can do with PyMuPDF:

:arg int rotate: An integer specifying the required rotation in degrees. Must be an integer multiple of 90. Values will be converted to one of 0, 90, 180, 270.

.. method:: recolor(components=1)

PDF only: Change the colorspace components of all objects on page.

:arg int components: The desired count of color components. Must be one of 1, 3 or 4, which results in color spaces DeviceGray, DeviceRGB or DeviceCMYK respectively. The method affects text, images and vector graphics. For instance, with the default value 1, a page will be converted to gray-scale.

The changes made are **permanent** and cannot be reverted.

.. method:: remove_rotation()

PDF only: Set page rotation to 0 while maintaining appearance and page content.
Expand Down
4 changes: 3 additions & 1 deletion docs/pymupdf4llm/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,16 @@ The |PyMuPDF4LLM| API

Prints the version of the library.

.. method:: to_markdown(doc: pymupdf.Document | str, *, pages: list | range | None = None, hdr_info: Any = None, write_images: bool = False, embed_images: bool = False, dpi: int = 150, image_path="", image_format="png", image_size_limit=0.05, force_text=True, margins=(0, 50, 0, 50), page_chunks: bool = False, page_width: float = 612, page_height: float = None, table_strategy="lines_strict", graphics_limit: int = None, ignore_code: bool = False, extract_words: bool = False, show_progress: bool = True) -> str | list[dict]
.. method:: to_markdown(doc: pymupdf.Document | str, *, pages: list | range | None = None, filename=None, hdr_info: Any = None, write_images: bool = False, embed_images: bool = False, dpi: int = 150, image_path="", image_format="png", image_size_limit=0.05, force_text=True, margins=(0, 50, 0, 50), page_chunks: bool = False, page_width: float = 612, page_height: float = None, table_strategy="lines_strict", graphics_limit: int = None, ignore_code: bool = False, extract_words: bool = False, show_progress: bool = True) -> str | list[dict]

Read the pages of the file and outputs the text of its pages in |Markdown| format. How this should happen in detail can be influenced by a number of parameters. Please note that there exists **support for building page chunks** from the |Markdown| text.

:arg Document,str doc: the file, to be specified either as a file path string, or as a |PyMuPDF| Document (created via `pymupdf.open`). In order to use `pathlib.Path` specifications, Python file-like objects, documents in memory etc. you **must** use a |PyMuPDF| Document.

:arg list pages: optional, the pages to consider for output (caution: specify 0-based page numbers). If omitted all pages are processed.

:arg filename: optional. Use this if you want to provide or override the file name. This may especially be useful when the document is opened from memory streams (which have no name and where thus ``doc.name`` is the empty string). This parameter will be used in all places where normally ``doc.name`` would have been used.

:arg hdr_info: optional. Use this if you want to provide your own header detection logic. This may be a callable or an object having a method named `get_header_id`. It must accept a text span (a span dictionary as contained in :meth:`~.extractDICT`) and a keyword parameter "page" (which is the owning :ref:`Page <page>` object). It must return a string "" or up to 6 "#" characters followed by 1 space. If omitted, a full document scan will be performed to find the most popular font sizes and derive header levels based on them. To completely avoid this behavior specify `hdr_info=lambda s, page=None: ""` or `hdr_info=False`.

:arg bool write_images: when encountering images or vector graphics, images will be created from the respective page area and stored in the specified folder. Markdown references will be generated pointing to these images. Any text contained in these areas will not be included in the text output (but appear as part of the images). Therefore, if for instance your document has text written on full page images, make sure to set this parameter to `False`.
Expand Down
Loading
Loading