Use PyMuPDF to solve most congestions issues in /tmp (client & Server) #622

deeplow · 2023-11-16T16:54:13Z

Alternative to #619 but with PyMuPDF. This implements changes in the doc to pixels as well as the pixels to pdf conversion.

Missing

assess the impact of PyMuPDF (performance, security, PDF rendering impact) (see results in PyMuPDF integration #658)
change Dangerzone license due to PyMuPDF being AGPL

dangerzone/conversion/pixels_to_pdf.py

Dockerfile

dangerzone/conversion/doc_to_pixels.py

dangerzone/conversion/pixels_to_pdf.py

pyproject.toml

dangerzone/conversion/doc_to_pixels.py

apyrgio · 2023-12-04T17:49:23Z

While running the tests, I stumbled on another issue:

ImportError while loading conftest '/home/user/dangerzone/tests/conftest.py'.
tests/__init__.py:8: in <module>
    from dangerzone.document import SAFE_EXTENSION
dangerzone/__init__.py:16: in <module>
    from .gui import gui_main as main
dangerzone/gui/__init__.py:28: in <module>
    from ..isolation_provider.qubes import Qubes, is_qubes_native_conversion
dangerzone/isolation_provider/qubes.py:15: in <module>
    from ..conversion.pixels_to_pdf import PixelsToPDF
dangerzone/conversion/pixels_to_pdf.py:16: in <module>
    import fitz
E   ModuleNotFoundError: No module named 'fitz'

We should conditionally import the Qubes isolation provider, so that we don't always import the conversion module, and thus fitz. This will also help with any other Python dependency that exists in the container, but not in the host.

@apyrgio

Now we're using client-side timeouts so the server side-ones are not needed. Implemented following the suggestion from @apyrgio [1]. [1]: #622 (comment)

Implemented per suggestion [1]. [1]: #622 (comment)

deeplow · 2023-12-18T16:45:03Z

We should conditionally import the Qubes isolation provider, so that we don't always import the conversion module, and thus fitz. This will also help with any other Python dependency that exists in the container, but not in the host.

@apyrgio not sure I follow your exact suggestion. Do you mean importing it closer to where the code is called?
Another option would be conditionally import fitz.

install/common/build-image.py

apyrgio · 2023-12-18T21:37:27Z

@apyrgio not sure I follow your exact suggestion. Do you mean importing it closer to where the code is called?
Another option would be conditionally import fitz.

I don't have a strong preference, since on-host conversion will ultimately solve this. Whatever requires the least amount of changes to revert, would be my suggestion. And yes, , my original thought was moving the import statement closer to where the Qubes isolation provider is used.

See discussion in [1]. [1]: #622 (comment)

Solves issues like these: ImportError while loading conftest '/home/user/dangerzone/tests/conftest.py'. tests/__init__.py:8: in <module> from dangerzone.document import SAFE_EXTENSION dangerzone/__init__.py:16: in <module> from .gui import gui_main as main dangerzone/gui/__init__.py:28: in <module> from ..isolation_provider.qubes import Qubes, is_qubes_native_conversion dangerzone/isolation_provider/qubes.py:15: in <module> from ..conversion.pixels_to_pdf import PixelsToPDF dangerzone/conversion/pixels_to_pdf.py:16: in <module> import fitz E ModuleNotFoundError: No module named 'fitz' For context see discussion in [1]. [1]: #622 (comment)

deeplow · 2023-12-19T17:47:12Z

I don't have a strong preference, since on-host conversion will ultimately solve this. Whatever requires the least amount of changes to revert, would be my suggestion. And yes, , my original thought was moving the import statement closer to where the Qubes isolation provider is used.

Let's see if 085411f fixes it.

Solves issues like these: ImportError while loading conftest '/home/user/dangerzone/tests/conftest.py'. tests/__init__.py:8: in <module> from dangerzone.document import SAFE_EXTENSION dangerzone/__init__.py:16: in <module> from .gui import gui_main as main dangerzone/gui/__init__.py:28: in <module> from ..isolation_provider.qubes import Qubes, is_qubes_native_conversion dangerzone/isolation_provider/qubes.py:15: in <module> from ..conversion.pixels_to_pdf import PixelsToPDF dangerzone/conversion/pixels_to_pdf.py:16: in <module> import fitz E ModuleNotFoundError: No module named 'fitz' For context see discussion in [1]. [1]: #622 (comment)

Since PyMuPDF is now used in Pixels to PDF we needed to add it to the qubes development environment.

PyMuPDF can also convert images of the types we already support so we don't need ImageMagick's 'convert'.

The original document was larger in dimensions than the original one due to a mismatch in DPI settings. When converting documents to pixels we were setting the DPI to 150 pixels per inch. Then when converting back into a PDF we were using 70 DPI. This difference would result in an overall larger document in dimensions (though not necessarily in file size). Fixes #626

@apyrgio

Now we're using client-side timeouts so the server side-ones are not needed. Implemented following the suggestion from @apyrgio [1]. [1]: #622 (comment)

We're intentionally bypassing PEP 668 [1], which prevents the installation of non-distro python wheels alongside system packages to avoid incompatibilities at distro-level. We are intentionally bypassing this since our container image is a controlled environment (we only ship a version after rigorous testing). [1]: https://peps.python.org/pep-0668/

Ensure that when the container image is installing pymupdf (unavailable in the repos) with verified hashes. To do so, it has the pymupdf dependency declared in a "container" group in `pyproject.toml`, which then gets exported into a requirements.txt, which is then used for hash-verification when building the container. Because this required modifying the container image build scripts, they were all merged to avoid duplicate code. This was an overdue change anyways.

The build was failing due to a missing kernel libraries. Adding the linux-headers dependency solves the issue.

@apyrgio

Breaks down the container build into multiple stages in order to speed up build times. Building PyMuPDF was taking too long and this way it can be cached. The original version was made by @apyrgio

Due to the new build-image.py, which now uses `poetry export` we need to explicitly install poetry in the CI before building the container image.

Qubes does on-host pixels-to-pdf whereas the containers version doesn't. This leads to an issue where on the containers version it tries to load fitz, which isn't installed there, just because it's trying to check if it should run the Qubes version. The error it was showing was something like this: ImportError while loading conftest '/home/user/dangerzone/tests/conftest.py'. tests/__init__.py:8: in <module> from dangerzone.document import SAFE_EXTENSION dangerzone/__init__.py:16: in <module> from .gui import gui_main as main dangerzone/gui/__init__.py:28: in <module> from ..isolation_provider.qubes import Qubes, is_qubes_native_conversion dangerzone/isolation_provider/qubes.py:15: in <module> from ..conversion.pixels_to_pdf import PixelsToPDF dangerzone/conversion/pixels_to_pdf.py:16: in <module> import fitz E ModuleNotFoundError: No module named 'fitz' For context see discussion in [1]. [1]: #622 (comment)

Some tests [1] lead to the conclusion that ocr_compression does the same to the file (performance and size-wise) to the file as deflating images when saving the file. However, both methods active do add a bit of extra time. For this reason we're disabling the image deflation (default option). [1]: #622 (comment)

PyMuPDF replaced the need for almost all dependencies, which this commit now removes. We are also removing tesseract-ocr as a dependency since (to our surprise) PyMuPDF ships directly with tesseract binaries [1]. However, now that tesseract-ocr is not available directly as a binary tool, the `test_ocr.py` needed to be changed. Fixes #658 [1]: #658 (comment)

This reverts commit f074db0.

Make the compression happen per page when OCR is not enabled [1]. [1]: #622 (comment)

Add the following functionality to the build image script: 1. Let the user choose the container runtime of their choice. In some systems, both Docker and Podman may be available, so we need to let the user choose which runtime they want. 2. Let users choose if they want to save the image. For non-production builds, we may want to simply build the container image, without the time penalty of compression.

deeplow · 2024-01-03T16:03:37Z

Rebased on top of main to get that lint fix and cherry-picked commits made by @apyrgio to fix the security scanning issues.

apyrgio

This is amazing work. I'm stoked for this integration! Feel free to merge.

License change required due to the inclusion of the AGPL-licensed PyMuPDF. This library greatly benefited Dangerzone in many aspects detailed in [1]. Fixes #658 [1]: #658

Make the compression happen per page when OCR is not enabled [1]. [1]: #622 (comment)

Some tests [1] lead to the conclusion that ocr_compression does the same to the file (performance and size-wise) to the file as deflating images when saving the file. However, both methods active do add a bit of extra time. For this reason we're disabling the image deflation (default option). [1]: #622 (comment)

deeplow requested a review from apyrgio November 16, 2023 16:54

deeplow commented Nov 21, 2023

View reviewed changes

dangerzone/conversion/pixels_to_pdf.py Show resolved Hide resolved

This was referenced Nov 24, 2023

On-host pixels to PDF conversion #625

Closed

Containers Page Streaming based on PyMuPDF #627

Merged

apyrgio reviewed Nov 30, 2023

View reviewed changes

Dockerfile Outdated Show resolved Hide resolved

Dockerfile Show resolved Hide resolved

apyrgio reviewed Nov 30, 2023

View reviewed changes

dangerzone/conversion/doc_to_pixels.py Outdated Show resolved Hide resolved

dangerzone/conversion/doc_to_pixels.py Show resolved Hide resolved

dangerzone/conversion/doc_to_pixels.py Outdated Show resolved Hide resolved

dangerzone/conversion/doc_to_pixels.py Outdated Show resolved Hide resolved

apyrgio reviewed Nov 30, 2023

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

apyrgio reviewed Dec 4, 2023

View reviewed changes

dangerzone/conversion/doc_to_pixels.py Show resolved Hide resolved

deeplow added a commit that referenced this pull request Dec 5, 2023

Remove all server-side timeouts from doc to pixels

f3c4d06

Now we're using client-side timeouts so the server side-ones are not needed. Implemented following the suggestion from @apyrgio [1]. [1]: #622 (comment)

apyrgio mentioned this pull request Dec 5, 2023

Defense in Depth - Traceless Sanitization #633

Open

deeplow added a commit that referenced this pull request Dec 5, 2023

FIXUP Remove side-effects from pymupdf alpine install

1aebf39

Implemented per suggestion [1]. [1]: #622 (comment)

apyrgio mentioned this pull request Dec 5, 2023

Sandbox all document processing in gVisor #590

Merged

deeplow mentioned this pull request Dec 12, 2023

Solve most congestions issues in /tmp (client & Server) #619

Closed

deeplow force-pushed the 616-main-pymupdf branch from 46c9f38 to fecf022 Compare December 15, 2023 15:48

apyrgio mentioned this pull request Dec 18, 2023

PyMuPDF integration #658

Closed

apyrgio reviewed Dec 18, 2023

View reviewed changes

install/common/build-image.py Outdated Show resolved Hide resolved

deeplow added a commit that referenced this pull request Dec 19, 2023

FIXUP: explicitly use pymupdf pages generator

b09bb18

See discussion in [1]. [1]: #622 (comment)

deeplow force-pushed the 616-main-pymupdf branch 4 times, most recently from 0525059 to a936545 Compare December 19, 2023 18:54

deeplow force-pushed the 616-main-pymupdf branch 2 times, most recently from f80c90b to 456bf0c Compare December 19, 2023 19:03

deeplow and others added 18 commits January 3, 2024 12:58

Add PyMuPDF to dev env in Qubes

a3a6488

Since PyMuPDF is now used in Pixels to PDF we needed to add it to the qubes development environment.

Replace 'convert' with PyMuPDF for images

e5dbe25

PyMuPDF can also convert images of the types we already support so we don't need ImageMagick's 'convert'.

Remove all server-side timeouts from doc to pixels

b75417b

Now we're using client-side timeouts so the server side-ones are not needed. Implemented following the suggestion from @apyrgio [1]. [1]: #622 (comment)

Bump pymupdf version 1.23.7

2b08291

The build was failing due to a missing kernel libraries. Adding the linux-headers dependency solves the issue.

Bump pymupdf to 1.23.8

1cd87f7

Multi-stage Dockerfile build

e0b0926

Breaks down the container build into multiple stages in order to speed up build times. Building PyMuPDF was taking too long and this way it can be cached. The original version was made by @apyrgio

Remove pre-pymupdf exceptions and detect pymupdf ones

80db7bb

Add poetry as CI container build dependency

773fcfa

Due to the new build-image.py, which now uses `poetry export` we need to explicitly install poetry in the CI before building the container image.

FIXUP Revert "Disable image compression when saving PDF"

e253127

This reverts commit f074db0.

Compress per page when not using OCR

f1d90c6

Make the compression happen per page when OCR is not enabled [1]. [1]: #622 (comment)

ci: Use Docker for building images, instead of Podman

7e21d5e

deeplow force-pushed the 616-main-pymupdf branch from 153493e to 7e21d5e Compare January 3, 2024 15:58

apyrgio approved these changes Jan 3, 2024

View reviewed changes

deeplow force-pushed the 616-main-pymupdf branch from a45189e to c8bea70 Compare January 4, 2024 09:31

Replace MIT license with AGPLv3

f27296c

License change required due to the inclusion of the AGPL-licensed PyMuPDF. This library greatly benefited Dangerzone in many aspects detailed in [1]. Fixes #658 [1]: #658

deeplow force-pushed the 616-main-pymupdf branch from c8bea70 to f27296c Compare January 4, 2024 09:58

deeplow merged commit f27296c into main Jan 4, 2024
12 of 13 checks passed

deeplow added a commit that referenced this pull request Jan 4, 2024

Compress per page when not using OCR

ce616ea

Make the compression happen per page when OCR is not enabled [1]. [1]: #622 (comment)

eloquence added this to the 0.6.0 milestone Jan 11, 2024

deeplow mentioned this pull request Jan 25, 2024

Removing Timeouts #687

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use PyMuPDF to solve most congestions issues in /tmp (client & Server) #622

Use PyMuPDF to solve most congestions issues in /tmp (client & Server) #622

deeplow commented Nov 16, 2023 •

edited

Loading

apyrgio commented Dec 4, 2023

deeplow commented Dec 18, 2023

apyrgio commented Dec 18, 2023

deeplow commented Dec 19, 2023

deeplow commented Jan 3, 2024

apyrgio left a comment

Use PyMuPDF to solve most congestions issues in /tmp (client & Server) #622

Use PyMuPDF to solve most congestions issues in /tmp (client & Server) #622

Conversation

deeplow commented Nov 16, 2023 • edited Loading

apyrgio commented Dec 4, 2023

deeplow commented Dec 18, 2023

apyrgio commented Dec 18, 2023

deeplow commented Dec 19, 2023

deeplow commented Jan 3, 2024

apyrgio left a comment

Choose a reason for hiding this comment

deeplow commented Nov 16, 2023 •

edited

Loading