Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use PyMuPDF to solve most congestions issues in /tmp (client & Server) #622

Merged
merged 23 commits into from
Jan 4, 2024

Conversation

deeplow
Copy link
Contributor

@deeplow deeplow commented Nov 16, 2023

Alternative to #619 but with PyMuPDF. This implements changes in the doc to pixels as well as the pixels to pdf conversion.

Missing

  • assess the impact of PyMuPDF (performance, security, PDF rendering impact) (see results in PyMuPDF integration #658)
  • change Dangerzone license due to PyMuPDF being AGPL

@deeplow deeplow requested a review from apyrgio November 16, 2023 16:54
Dockerfile Outdated Show resolved Hide resolved
Dockerfile Show resolved Hide resolved
dangerzone/conversion/doc_to_pixels.py Outdated Show resolved Hide resolved
dangerzone/conversion/doc_to_pixels.py Show resolved Hide resolved
dangerzone/conversion/doc_to_pixels.py Outdated Show resolved Hide resolved
dangerzone/conversion/doc_to_pixels.py Outdated Show resolved Hide resolved
dangerzone/conversion/pixels_to_pdf.py Show resolved Hide resolved
dangerzone/conversion/pixels_to_pdf.py Outdated Show resolved Hide resolved
dangerzone/conversion/pixels_to_pdf.py Outdated Show resolved Hide resolved
dangerzone/conversion/pixels_to_pdf.py Outdated Show resolved Hide resolved
dangerzone/conversion/pixels_to_pdf.py Outdated Show resolved Hide resolved
dangerzone/conversion/pixels_to_pdf.py Outdated Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
@apyrgio
Copy link
Contributor

apyrgio commented Dec 4, 2023

While running the tests, I stumbled on another issue:

ImportError while loading conftest '/home/user/dangerzone/tests/conftest.py'.
tests/__init__.py:8: in <module>
    from dangerzone.document import SAFE_EXTENSION
dangerzone/__init__.py:16: in <module>
    from .gui import gui_main as main
dangerzone/gui/__init__.py:28: in <module>
    from ..isolation_provider.qubes import Qubes, is_qubes_native_conversion
dangerzone/isolation_provider/qubes.py:15: in <module>
    from ..conversion.pixels_to_pdf import PixelsToPDF
dangerzone/conversion/pixels_to_pdf.py:16: in <module>
    import fitz
E   ModuleNotFoundError: No module named 'fitz'

We should conditionally import the Qubes isolation provider, so that we don't always import the conversion module, and thus fitz. This will also help with any other Python dependency that exists in the container, but not in the host.

deeplow added a commit that referenced this pull request Dec 5, 2023
Now we're using client-side timeouts so the server side-ones are not
needed. Implemented following the suggestion from @apyrgio [1].

[1]: #622 (comment)
deeplow added a commit that referenced this pull request Dec 5, 2023
@apyrgio apyrgio mentioned this pull request Dec 18, 2023
@deeplow
Copy link
Contributor Author

deeplow commented Dec 18, 2023

We should conditionally import the Qubes isolation provider, so that we don't always import the conversion module, and thus fitz. This will also help with any other Python dependency that exists in the container, but not in the host.

@apyrgio not sure I follow your exact suggestion. Do you mean importing it closer to where the code is called?
Another option would be conditionally import fitz.

@apyrgio
Copy link
Contributor

apyrgio commented Dec 18, 2023

@apyrgio not sure I follow your exact suggestion. Do you mean importing it closer to where the code is called?
Another option would be conditionally import fitz.

I don't have a strong preference, since on-host conversion will ultimately solve this. Whatever requires the least amount of changes to revert, would be my suggestion. And yes, , my original thought was moving the import statement closer to where the Qubes isolation provider is used.

deeplow added a commit that referenced this pull request Dec 19, 2023
deeplow added a commit that referenced this pull request Dec 19, 2023
Solves issues like these:

	ImportError while loading conftest '/home/user/dangerzone/tests/conftest.py'.
	tests/__init__.py:8: in <module>
	    from dangerzone.document import SAFE_EXTENSION
	dangerzone/__init__.py:16: in <module>
	    from .gui import gui_main as main
	dangerzone/gui/__init__.py:28: in <module>
	    from ..isolation_provider.qubes import Qubes, is_qubes_native_conversion
	dangerzone/isolation_provider/qubes.py:15: in <module>
	    from ..conversion.pixels_to_pdf import PixelsToPDF
	dangerzone/conversion/pixels_to_pdf.py:16: in <module>
	    import fitz
	E   ModuleNotFoundError: No module named 'fitz'

For context see discussion in [1].

[1]: #622 (comment)
@deeplow
Copy link
Contributor Author

deeplow commented Dec 19, 2023

I don't have a strong preference, since on-host conversion will ultimately solve this. Whatever requires the least amount of changes to revert, would be my suggestion. And yes, , my original thought was moving the import statement closer to where the Qubes isolation provider is used.

Let's see if 085411f fixes it.

@deeplow deeplow force-pushed the 616-main-pymupdf branch 4 times, most recently from 0525059 to a936545 Compare December 19, 2023 18:54
deeplow added a commit that referenced this pull request Dec 19, 2023
Solves issues like these:

	ImportError while loading conftest '/home/user/dangerzone/tests/conftest.py'.
	tests/__init__.py:8: in <module>
	    from dangerzone.document import SAFE_EXTENSION
	dangerzone/__init__.py:16: in <module>
	    from .gui import gui_main as main
	dangerzone/gui/__init__.py:28: in <module>
	    from ..isolation_provider.qubes import Qubes, is_qubes_native_conversion
	dangerzone/isolation_provider/qubes.py:15: in <module>
	    from ..conversion.pixels_to_pdf import PixelsToPDF
	dangerzone/conversion/pixels_to_pdf.py:16: in <module>
	    import fitz
	E   ModuleNotFoundError: No module named 'fitz'

For context see discussion in [1].

[1]: #622 (comment)
@deeplow deeplow force-pushed the 616-main-pymupdf branch 2 times, most recently from f80c90b to 456bf0c Compare December 19, 2023 19:03
deeplow and others added 18 commits January 3, 2024 12:58
Since PyMuPDF is now used in Pixels to PDF we needed to add it to the
qubes development environment.
PyMuPDF can also convert images of the types we already support so we
don't need ImageMagick's 'convert'.
The original document was larger in dimensions than the original one due
to a mismatch in DPI settings. When converting documents to pixels we
were setting the DPI to 150 pixels per inch. Then when converting back
into a PDF we were using 70 DPI. This difference would result in an
overall larger document in dimensions (though not necessarily in file
size).

Fixes #626
Now we're using client-side timeouts so the server side-ones are not
needed. Implemented following the suggestion from @apyrgio [1].

[1]: #622 (comment)
We're intentionally bypassing PEP 668 [1], which prevents the
installation of non-distro python wheels alongside system packages to
avoid incompatibilities at distro-level.

We are intentionally bypassing this since our container image is a
controlled environment (we only ship a version after rigorous testing).

[1]: https://peps.python.org/pep-0668/
Ensure that when the container image is installing pymupdf (unavailable
in the repos) with verified hashes. To do so, it has the pymupdf
dependency declared in a "container" group in `pyproject.toml`, which
then gets exported into a requirements.txt, which is then used for
hash-verification when building the container.

Because this required modifying the container image build scripts, they
were all merged to avoid duplicate code. This was an overdue change
anyways.
The build was failing due to a missing kernel libraries. Adding the
linux-headers dependency solves the issue.
Breaks down the container build into multiple stages in order to speed
up build times. Building PyMuPDF was taking too long and this way it can
be cached.

The original version was made by @apyrgio
Due to the new build-image.py, which now uses `poetry export` we need to
explicitly install poetry in the CI before building the container image.
Qubes does on-host pixels-to-pdf whereas the containers version doesn't.
This leads to an issue where on the containers version it tries to load
fitz, which isn't installed there, just because it's trying to check if
it should run the Qubes version.

The error it was showing was something like this:

    ImportError while loading conftest '/home/user/dangerzone/tests/conftest.py'.
        tests/__init__.py:8: in <module>
            from dangerzone.document import SAFE_EXTENSION
        dangerzone/__init__.py:16: in <module>
            from .gui import gui_main as main
        dangerzone/gui/__init__.py:28: in <module>
            from ..isolation_provider.qubes import Qubes, is_qubes_native_conversion
        dangerzone/isolation_provider/qubes.py:15: in <module>
            from ..conversion.pixels_to_pdf import PixelsToPDF
        dangerzone/conversion/pixels_to_pdf.py:16: in <module>
            import fitz
        E   ModuleNotFoundError: No module named 'fitz'

For context see discussion in [1].

[1]: #622 (comment)
Some tests [1] lead to the conclusion that ocr_compression does the same
to the file (performance and size-wise) to the file as deflating images
when saving the file. However, both methods active do add a bit of extra
time. For this reason we're disabling the image deflation (default
option).

[1]: #622 (comment)
PyMuPDF replaced the need for almost all dependencies, which this commit
now removes.

We are also removing tesseract-ocr as a dependency since
(to our surprise) PyMuPDF ships directly with tesseract binaries [1].
However, now that tesseract-ocr is not available directly as a binary
tool, the `test_ocr.py` needed to be changed.

Fixes #658

[1]: #658 (comment)
Make the compression happen per page when OCR is not enabled [1].

[1]: #622 (comment)
Add the following functionality to the build image script:

1. Let the user choose the container runtime of their choice. In some
   systems, both Docker and Podman may be available, so we need to let
   the user choose which runtime they want.
2. Let users choose if they want to save the image. For non-production
   builds, we may want to simply build the container image, without
   the time penalty of compression.
@deeplow
Copy link
Contributor Author

deeplow commented Jan 3, 2024

Rebased on top of main to get that lint fix and cherry-picked commits made by @apyrgio to fix the security scanning issues.

Copy link
Contributor

@apyrgio apyrgio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is amazing work. I'm stoked for this integration! Feel free to merge.

License change required due to the inclusion of the AGPL-licensed
PyMuPDF. This library greatly benefited Dangerzone in many aspects
detailed in [1].

Fixes #658

[1]: #658
@deeplow deeplow merged commit f27296c into main Jan 4, 2024
12 of 13 checks passed
deeplow added a commit that referenced this pull request Jan 4, 2024
Make the compression happen per page when OCR is not enabled [1].

[1]: #622 (comment)
deeplow added a commit that referenced this pull request Jan 9, 2024
Some tests [1] lead to the conclusion that ocr_compression does the same
to the file (performance and size-wise) to the file as deflating images
when saving the file. However, both methods active do add a bit of extra
time. For this reason we're disabling the image deflation (default
option).

[1]: #622 (comment)
deeplow added a commit that referenced this pull request Jan 9, 2024
Some tests [1] lead to the conclusion that ocr_compression does the same
to the file (performance and size-wise) to the file as deflating images
when saving the file. However, both methods active do add a bit of extra
time. For this reason we're disabling the image deflation (default
option).

[1]: #622 (comment)
@eloquence eloquence added this to the 0.6.0 milestone Jan 11, 2024
@deeplow deeplow mentioned this pull request Jan 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants