upload v1.18.11

pymupdf · Apr 10, 2021 · 8537b18 · 8537b18
1 parent a1d8963
commit 8537b18
Show file tree

Hide file tree

Showing 19 changed files with 418 additions and 104 deletions.
diff --git a/PKG-INFO b/PKG-INFO
@@ -10,7 +10,7 @@ Home-page: https://github.com/pymupdf/PyMuPDF
 Download-url: https://github.com/pymupdf/PyMuPDF
 Summary: PyMuPDF is a Python binding for the document renderer and toolkit MuPDF
 Description:
-        Release date: March 26, 2021
+        Release date: April 10, 2021
 
         Authors
         =======
@@ -25,7 +25,7 @@ Description:
 
         MuPDF can access files in PDF, XPS, OpenXPS, CBZ, EPUB and FB2 (e-books) formats, and it is known for its top performance and high rendering quality.
 
-        With PyMuPDF you can access files with extensions like “.pdf”, “.xps”, “.oxps”, “.cbz”, “.fb2” or “.epub”. In addition, about 10 popular image formats can also be opened and handled like documents.
+        With PyMuPDF you can access files with extensions like .pdf, .xps, .oxps, .cbz, .fb2 or .epub. In addition, about 10 popular image formats can also be handled like documents: .png, .bmp, .gif, .tiff, etc..
 
         PyMuPDF should run on all platforms that are supported by both, MuPDF and Python 3.6+. These include, but are not limited to, Windows, Mac OSX and Linux, 32-bit or 64-bit. If you can generate MuPDF on a Python supported platform, then also PyMuPDF can be used there.
 
@@ -59,7 +59,7 @@ Description:
         License and Copyright Information
         ==================================
 
-        In order to comply with MuPDF’s dual licensing model, PyMuPDF has entered into an agreement with Artifex who has the right to sublicense PyMuPDF to third parties.
+        In order to comply with MuPDF's dual licensing model, PyMuPDF has entered into an agreement with Artifex who has the right to sublicense PyMuPDF to third parties.
 
         PyMuPDF and MuPDF are now available under both open-source AGPL and commercial license agreements. Please read the full text of the AGPL license agreement, available in the distribution material (file COPYING) and `here <https://www.gnu.org/licenses/agpl-3.0.html>`_, to ensure that your use case complies with the guidelines of the license. If you determine you cannot meet the requirements of the AGPL, please contact `Artifex <https://artifex.com/contact/>`_ for more information regarding a commercial license.
 

diff --git a/README.md b/README.md
@@ -2,7 +2,7 @@
 
 ![logo](https://github.com/pymupdf/PyMuPDF/blob/master/demo/pymupdf.jpg)
 
-Release date: March 22, 2021
+Release date: April 10, 2021
 
 **Travis-CI:** [![Build Status](https://travis-ci.org/JorjMcKie/py-mupdf.svg?branch=master)](https://travis-ci.org/JorjMcKie/py-mupdf)
 
@@ -19,9 +19,9 @@ PyMuPDF (current version 1.18.11) is a Python binding with support for [MuPDF](h
 
 MuPDF can access files in PDF, XPS, OpenXPS, CBZ, EPUB and FB2 (e-books) formats, and it is known for its top performance and high rendering quality.
 
-With PyMuPDF you can access files with extensions like “.pdf”, “.xps”, “.oxps”, “.cbz”, “.fb2” or “.epub”. In addition, about 10 popular image formats can also be opened and handled like documents: ".png", ".jpg", ".bmp", ".tiff", etc..
+With PyMuPDF you can access files with extensions like ".pdf", ".xps", ".oxps", ".cbz", ".fb2" or ".epub". In addition, about 10 popular image formats can also be handled like documents: ".png", ".jpg", ".bmp", ".tiff", etc..
 
-> In partnership with [Artifex](https://artifex.com/), PyMuPDF is now also available for commercial licensing. This agreement has no impact on use cases, that are compliant with the open-source license AGPL. Please see the “License and Copyright” section below for additional information.
+> In partnership with [Artifex](https://artifex.com/), PyMuPDF is now also available for commercial licensing. This agreement has no impact on use cases, that are compliant with the open-source license AGPL. Please see the "License and Copyright" section below for additional information.
 
 # Usage and Documentation
 For all supported document types (i.e. **_including images_**) you can
@@ -79,7 +79,7 @@ Before you can do that, you must first build MuPDF. For most platforms, the MuPD
   - Now MuPDF can be generated.
 
 * Please note that you will need the interface generator [SWIG](http://www.swig.org/) when building PyMuPDF from the sources of this repository (please refer to issue #312 for some background on this).
-    - PyMuPDF wheels are being generated using **SWIG v4.0.1**.
+    - PyMuPDF wheels are being generated using **SWIG v4.0.2**.
 
 * If you do **not use SWIG**, please download the **sources from PyPI** - they contain sources pre-processed by SWIG, so installation should work like any other Python extension generation on your system.
 

diff --git a/docs/app2.rst b/docs/app2.rst
@@ -135,12 +135,12 @@ To address the font issue, you can use a simple utility script to scan through t
          testn = font_sans                     # use Helvetica
      elif test.endswith(",monospace"):         # monospaced font?
          testn = font_mono                     # becomes Courier
- 
+
      if testn != "":                           # any of the above found?
          otext = otext.replace(test, testn)    # change the source
          found_one = True
          pos1 = 0                              # start over
- 
+
  if found_one:
      ofile = open(filename + ".html", "w")
      ofile.write(otext)
@@ -217,7 +217,7 @@ XML
 ~~~
 
 The :meth:`TextPage.extractXML` (or *Page.get_text("xml")*) version extracts text (no images) with the detail level of RAWDICT::
-  
+
     >>> for line in page.get_text("xml").splitlines():
         print(line)
 
@@ -261,17 +261,19 @@ Text Extraction Flags Defaults
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 *(New in version 1.16.2)* Method :meth:`Page.get_text` supports a keyword parameter *flags* *(int)* to control the amount and the quality of extracted data. The following table shows the defaults settings (flags parameter omitted or None) for each extraction variant. If you specify flags with a value other than *None*, be aware that you must set **all desired** options. A description of the respective bit settings can be found in :ref:`TextPreserve`.
 
-=================== ==== ==== ===== === ==== ======= ===== ======
-Indicator           text html xhtml xml dict rawdict words blocks
-=================== ==== ==== ===== === ==== ======= ===== ======
-preserve ligatures  1    1    1     1   1    1       1     1
-preserve whitespace 1    1    1     1   1    1       1     1
-preserve images     n/a  1    1     n/a 1    1       n/a   0
-inhibit spaces      0    0    0     0   0    0       0     0
-dehyphenate         0    0    0     0   0    0       0     0
-=================== ==== ==== ===== === ==== ======= ===== ======
-
+=================== ==== ==== ===== === ==== ======= ===== ====== ======
+Indicator           text html xhtml xml dict rawdict words blocks search
+=================== ==== ==== ===== === ==== ======= ===== ====== ======
+preserve ligatures  1    1    1     1   1    1       1     1       0
+preserve whitespace 1    1    1     1   1    1       1     1       1
+preserve images     n/a  1    1     n/a 1    1       n/a   0       0
+inhibit spaces      0    0    0     0   0    0       0     0       0
+dehyphenate         0    0    0     0   0    0       0     0       1
+=================== ==== ==== ===== === ==== ======= ===== ====== ======
+
+* **search** refers to the text search function.
 * **"json"** is handled exactly like **"dict"** and is hence left out.
+* **"rawjson"** is handled exactly like **"rawdict"** and is hence left out.
 * An "n/a" specification means a value of 0 and setting this bit never has any effect on the output (but an adverse effect on performance).
 * If you are not interested in images when using an output variant which includes them by default, then by all means set the respective bit off: You will experience a better performance and much lower space requirements.
 
@@ -291,7 +293,7 @@ To show the effect of *TEXT_INHIBIT_SPACES* have a look at this example::
     in English
     ... let's see
     what happens.
-    >>> 
+    >>>
 
 
 Performance

diff --git a/docs/app4.rst b/docs/app4.rst
@@ -3,6 +3,60 @@
 ================================================
 Appendix 4: Assorted Technical Information
 ================================================
+This section deals with various technical topics, that are not necessarily related to each other.
+
+------------
+
+.. _ImageTransformation:
+
+Image Transformation Matrix
+----------------------------
+Starting with version 1.18.11, the image transformation matrix is returned by some methods for text and image extraction: :meth:`Page.get_text` and :meth:`Page.get_image_bbox`.
+
+The transformation matrix contains information about how an image was transformed to fit into the rectangle (its "boundary box" = "bbox") on some document page. By inspecting the image's bbox on the page and this matrix, one can determine for example, whether and how the image is displayed scaled or rotated on a page.
+
+The relationship between image width and height and the bbox on a page is the following:
+
+1. Using the original image's width and height, we can define the image rectangle ``imgrect = fitz.Rect(0, 0, width, height)`` and a "shrink matrix" ``shrink = fitz.Matrix(1/width, 0, 0, 1/height, 0, 0)``.
+2. Transforming the image rectangle with its shrink matrix, will result in the unit rectangle: ``imgrect * shrink = fitz.Rect(0, 0, 1, 1)``.
+3. Using the image **transformation matrix** "transform", the following steps will compute the bbox::
+
+    imgrect = fitz.Rect(0, 0, width, height)
+    shrink = fitz.Matrix(1/width, 0, 0, 1/height, 0, 0)
+    bbox = imgrect * shrink * transform
+
+4. Inspecting the matrix product ``shrink * transform`` will reveal all information about what happened to the image rectangle to make it fit into the bbox on the page: rotation, scaling of its sides and translation of its origin. Let us look at an example:
+
+    >>> imginfo = page.get_images()[0]  # get an image item on a page
+    >>> imginfo
+    (5, 0, 439, 501, 8, 'DeviceRGB', '', 'fzImg0', 'DCTDecode')
+    >>> #------------------------------------------------
+    >>> # define image shrink matrix and rectangle
+    >>> #------------------------------------------------
+    >>> shrink = fitz.Matrix(1 / 439, 0, 0, 1 / 501, 0, 0)
+    >>> imgrect = fitz.Rect(0, 0, 439, 501)
+    >>> #------------------------------------------------
+    >>> # determine image bbox and transformation matrix:
+    >>> #------------------------------------------------
+    >>> bbox, transform = page.get_image_bbox("fzImg0", transform=True)
+    >>> #------------------------------------------------
+    >>> # confirm equality - permitting rounding errors
+    >>> #------------------------------------------------
+    >>> bbox
+    Rect(100.0, 112.37525939941406, 300.0, 287.624755859375)
+    >>> imgrect * shrink * transform
+    Rect(100.0, 112.375244140625, 300.0, 287.6247253417969)
+    >>> #------------------------------------------------
+    >>> shrink * transform
+    Matrix(0.0, -0.39920157194137573, 0.3992016017436981, 0.0, 100.0, 287.6247253417969)
+    >>> #------------------------------------------------
+    >>> # the above shows:
+    >>> # image sides scaled by same factor 0.4
+    >>> # image rotated by 90 degrees anti-clockwise
+    >>> #------------------------------------------------
+
+
+------------
 
 .. _Base-14-Fonts:
 

diff --git a/docs/changes.rst b/docs/changes.rst
@@ -1,6 +1,17 @@
 Change Logs
 ===============
 
+Changes in Version 1.18.11
+---------------------------
+* **Fixed** issue `#972 <https://github.com/pymupdf/PyMuPDF/issues/972>`_. Improved layout of source distribution material.
+* **Fixed** issue `#962 <https://github.com/pymupdf/PyMuPDF/issues/962>`_. Stabilized Linux distribution detection for generating PyMuPDF from sources.
+* **Added:** :meth:`Page.get_xobjects` delivers the result of :meth:`Document.get_page_xobjects`.
+* **Added:** :meth:`Page.get_image_info` delivers meta information for all images shown on the page.
+* **Added:** :meth:`Tools.mupdf_display_warnings` allows setting on / off the display of MuPDF-generated warnings. The default is off.
+* **Added:** :meth:`Document.ez_save` convenience alias of :meth:`Document.save` with some different defaults.
+* **Changed:** Image extractions of document pages now also contain the image's **transformation matrix**. This concerns :meth:`Page.get_image_bbox` and the DICT, JSON, RAWDICT, and RAWJSON variants of :meth:`Page.get_text`.
+
+
 Changes in Version 1.18.10
 ---------------------------
 * **Fixed** issue `#941 <https://github.com/pymupdf/PyMuPDF/issues/941>`_. Added old aliases for :meth:`DisplayList.get_pixmap` and :meth:`DisplayList.get_textpage`.

diff --git a/docs/conf.py b/docs/conf.py
@@ -42,7 +42,7 @@
 # built documents.
 #
 # The full version, including alpha/beta/rc tags.
-release = "1.18.10"
+release = "1.18.11"
 
 # The short X.Y version
 version = release

diff --git a/docs/document.rst b/docs/document.rst
@@ -43,6 +43,7 @@ For details on **embedded files** refer to Appendix 3.
 :meth:`Document.embfile_info`           PDF only: metadata of an embedded file
 :meth:`Document.embfile_names`          PDF only: list of embedded files
 :meth:`Document.embfile_upd`            PDF only: change an embedded file
+:meth:`Document.ez_save`                PDF only: :meth:`Document.save` with different defaults
 :meth:`Document.find_bookmark`          retrieve page location after layouting
 :meth:`Document.fullcopy_page`          PDF only: duplicate a page
 :meth:`Document.get_oc_states`          PDF only: lists of OCGs in ON, OFF, RBGroups
@@ -706,7 +707,7 @@ For details on **embedded files** refer to Appendix 3.
 
       PDF only: Return the PDF dictionary keys of the object provided by its xref number.
 
-      :arg int xref: the :data:`xref`. *(Changed in v1.18.10)* Use ``-1`` if you want to access the special dictionary "PDF trailer" (it has no identifying xref).
+      :arg int xref: the :data:`xref`. *(Changed in v1.18.10)* Use ``-1`` to access the special dictionary "PDF trailer" (it has no identifying xref).
 
       :returns: a tuple of dictionary keys present in object :data:`xref`. Examples:
 
@@ -727,7 +728,7 @@ For details on **embedded files** refer to Appendix 3.
 
       PDF only: Return type and value of a PDF dictionary key of an xref.
 
-      :arg int xref: the :data:`xref`. *(Changed in v1.18.10)* Use ``-1`` if you want to access the special dictionary "PDF trailer" (it has no identifying xref).
+      :arg int xref: the :data:`xref`. *Changed in v1.18.10:* Use ``-1`` to access the special dictionary "PDF trailer" (it has no identifying xref).
       :arg str key: the desired PDF key. Must **exactly** match (case-sensitive) one of the keys contained in :meth:`Document.xref_get_keys`.
 
       :returns: a tuple (type, value), where type is one of "xref", "array", "dict", "int", "float" "null", "bool", "float", "name", "string" or "unknown" (should not occur). Independent of "type", the value of the key is **always** formatted as a string -- see the following example -- and a faithful reflection of what is stored in the PDF. An argument like the return value can be used to modify the value of a key of :data:`xref`.
@@ -739,7 +740,7 @@ For details on **embedded files** refer to Appendix 3.
           Resources = ('xref', '1296 0 R')
           MediaBox = ('array', '[0 0 612 792]')
           Parent = ('xref', '1301 0 R')
-          >>> # no the same thing for the PDF trailer:
+          >>> # same thing for the PDF trailer:
           >>> for key in doc.xref_get_keys(-1):
                   print(key, "=", doc.xref_get_key(-1, key))
           Type = ('name', '/XRef')
@@ -790,17 +791,19 @@ For details on **embedded files** refer to Appendix 3.
 
     .. method:: get_page_xobjects(pno)
 
+      *(Changed in v1.18.11)*
+
       PDF only: *(New in v1.16.13)* Return a list of all XObjects referenced by a page.
 
       :arg int pno: page number, 0-based, *-inf < pno < page_count*.
 
       :rtype: list
-      :returns: a list of (non-image) XObjects. These objects typically represent pages *embedded* (not copied) from other PDFs. For example, :meth:`Page.show_pdf_page` will create this type of object. An item of this list has the following layout: **(xref, name, invoker, bbox)**, where
+      :returns: a list of (non-image) XObjects. These objects typically represent pages *embedded* (not copied) from other PDFs. For example, :meth:`Page.show_pdf_page` will create this type of object. An item of this list has the following layout: ``(xref, name, invoker, bbox)``, where
 
-        * **xref** (*int*) is the XObject's :data:`xref`
-        * **name** (*str*) is the symbolic name to reference the XObject
-        * **invoker** (*int*) the :data:`xref` of the invoking XObject or zero if the page directly invokes it
-        * **bbox** (*tuple*) the boundary box of the XObject's location on the page **in untransformed coordinates**. To get actual, non-rotated page coordinates, multiply with the page's transformation matrix :attr:`Page.transformation_matrix`.
+        * **xref** (*int*) is the XObject's :data:`xref`.
+        * **name** (*str*) is the symbolic name to reference the XObject.
+        * **invoker** (*int*) the :data:`xref` of the invoking XObject or zero if the page directly invokes it.
+        * **bbox** (:ref:`Rect`) the boundary box of the XObject's location on the page **in untransformed coordinates**. To get actual, non-rotated page coordinates, multiply with the page's transformation matrix :attr:`Page.transformation_matrix`. *Changed in v.18.11:* the bbox is now formatted as :ref:`Rect`.
 
 
     .. method:: get_page_images(pno, full=False)
@@ -1095,11 +1098,19 @@ For details on **embedded files** refer to Appendix 3.
 
       :arg str user_pw: *(new in version 1.16.0)* set the document's user password.
 
+    .. method:: ez_save(*args, **kwargs)
+
+      *(New in v1.18.11)*
+
+      PDF only: The same as :meth:`Document.save` but with the changed defaults `deflate=True, garbage=3`.
+
     .. method:: saveIncr()
 
       PDF only: saves the document incrementally. This is a convenience abbreviation for *doc.save(doc.name, incremental=True, encryption=PDF_ENCRYPT_KEEP)*.
 
 
+    .. method:: ez_save()
+
     .. method:: tobytes(garbage=0, clean=False, deflate=False, deflate_images=False, deflate_fonts=False, ascii=False, expand=0, linear=False, pretty=False, encryption=PDF_ENCRYPT_NONE, permissions=-1, owner_pw=None, user_pw=None)
 
       *(Changed in v1.18.7)*
@@ -1397,10 +1408,17 @@ For details on **embedded files** refer to Appendix 3.
 
     .. method:: xref_object(xref, compressed=False, ascii=False)
 
-      *(New in version 1.16.8)*
+      *(New in version 1.16.8, changed in v1.18.10)*
 
       PDF only: Return the definition source of a PDF object.
 
+      :arg int xref: the object's :data`xref`. *Changed in v1.18.10:* A value of -1 returns the PDF trailer source.
+      :arg bool compressed: whether to generate a compact output with no line breaks or spaces.
+      :arg bool: ascii: whether to ASCII-encode binary data.
+
+      :rtype: str
+      :returns: The object definition source.
+
     .. method:: pdf_catalog()
 
       *(New in version 1.16.8)*
@@ -1412,7 +1430,7 @@ For details on **embedded files** refer to Appendix 3.
 
       *(New in version 1.16.8)*
 
-      PDF only: Return the trailer source of the PDF (UTF-8), which is usually located at the PDF file's end. This is similar to :meth:`Document.xref_object` except that this object has no identifier to access it.
+      PDF only: Return the trailer source of the PDF,  which is usually located at the PDF file's end. This is :meth:`Document.xref_object` with an *xref* argument of -1.
 
 
     .. method:: xref_xml_metadata()

diff --git a/docs/images/img-line-dir.png b/docs/images/img-line-dir.png
diff --git a/docs/images/img-textpage.png b/docs/images/img-textpage.png