-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use PyMuPDF to solve most congestions issues in /tmp (client & Server) #622
Conversation
While running the tests, I stumbled on another issue:
We should conditionally import the Qubes isolation provider, so that we don't always import the |
Now we're using client-side timeouts so the server side-ones are not needed. Implemented following the suggestion from @apyrgio [1]. [1]: #622 (comment)
Implemented per suggestion [1]. [1]: #622 (comment)
46c9f38
to
fecf022
Compare
@apyrgio not sure I follow your exact suggestion. Do you mean importing it closer to where the code is called? |
I don't have a strong preference, since on-host conversion will ultimately solve this. Whatever requires the least amount of changes to revert, would be my suggestion. And yes, , my original thought was moving the import statement closer to where the Qubes isolation provider is used. |
See discussion in [1]. [1]: #622 (comment)
Solves issues like these: ImportError while loading conftest '/home/user/dangerzone/tests/conftest.py'. tests/__init__.py:8: in <module> from dangerzone.document import SAFE_EXTENSION dangerzone/__init__.py:16: in <module> from .gui import gui_main as main dangerzone/gui/__init__.py:28: in <module> from ..isolation_provider.qubes import Qubes, is_qubes_native_conversion dangerzone/isolation_provider/qubes.py:15: in <module> from ..conversion.pixels_to_pdf import PixelsToPDF dangerzone/conversion/pixels_to_pdf.py:16: in <module> import fitz E ModuleNotFoundError: No module named 'fitz' For context see discussion in [1]. [1]: #622 (comment)
Let's see if 085411f fixes it. |
0525059
to
a936545
Compare
Solves issues like these: ImportError while loading conftest '/home/user/dangerzone/tests/conftest.py'. tests/__init__.py:8: in <module> from dangerzone.document import SAFE_EXTENSION dangerzone/__init__.py:16: in <module> from .gui import gui_main as main dangerzone/gui/__init__.py:28: in <module> from ..isolation_provider.qubes import Qubes, is_qubes_native_conversion dangerzone/isolation_provider/qubes.py:15: in <module> from ..conversion.pixels_to_pdf import PixelsToPDF dangerzone/conversion/pixels_to_pdf.py:16: in <module> import fitz E ModuleNotFoundError: No module named 'fitz' For context see discussion in [1]. [1]: #622 (comment)
f80c90b
to
456bf0c
Compare
Since PyMuPDF is now used in Pixels to PDF we needed to add it to the qubes development environment.
PyMuPDF can also convert images of the types we already support so we don't need ImageMagick's 'convert'.
The original document was larger in dimensions than the original one due to a mismatch in DPI settings. When converting documents to pixels we were setting the DPI to 150 pixels per inch. Then when converting back into a PDF we were using 70 DPI. This difference would result in an overall larger document in dimensions (though not necessarily in file size). Fixes #626
Now we're using client-side timeouts so the server side-ones are not needed. Implemented following the suggestion from @apyrgio [1]. [1]: #622 (comment)
We're intentionally bypassing PEP 668 [1], which prevents the installation of non-distro python wheels alongside system packages to avoid incompatibilities at distro-level. We are intentionally bypassing this since our container image is a controlled environment (we only ship a version after rigorous testing). [1]: https://peps.python.org/pep-0668/
Ensure that when the container image is installing pymupdf (unavailable in the repos) with verified hashes. To do so, it has the pymupdf dependency declared in a "container" group in `pyproject.toml`, which then gets exported into a requirements.txt, which is then used for hash-verification when building the container. Because this required modifying the container image build scripts, they were all merged to avoid duplicate code. This was an overdue change anyways.
The build was failing due to a missing kernel libraries. Adding the linux-headers dependency solves the issue.
Breaks down the container build into multiple stages in order to speed up build times. Building PyMuPDF was taking too long and this way it can be cached. The original version was made by @apyrgio
Due to the new build-image.py, which now uses `poetry export` we need to explicitly install poetry in the CI before building the container image.
Qubes does on-host pixels-to-pdf whereas the containers version doesn't. This leads to an issue where on the containers version it tries to load fitz, which isn't installed there, just because it's trying to check if it should run the Qubes version. The error it was showing was something like this: ImportError while loading conftest '/home/user/dangerzone/tests/conftest.py'. tests/__init__.py:8: in <module> from dangerzone.document import SAFE_EXTENSION dangerzone/__init__.py:16: in <module> from .gui import gui_main as main dangerzone/gui/__init__.py:28: in <module> from ..isolation_provider.qubes import Qubes, is_qubes_native_conversion dangerzone/isolation_provider/qubes.py:15: in <module> from ..conversion.pixels_to_pdf import PixelsToPDF dangerzone/conversion/pixels_to_pdf.py:16: in <module> import fitz E ModuleNotFoundError: No module named 'fitz' For context see discussion in [1]. [1]: #622 (comment)
Some tests [1] lead to the conclusion that ocr_compression does the same to the file (performance and size-wise) to the file as deflating images when saving the file. However, both methods active do add a bit of extra time. For this reason we're disabling the image deflation (default option). [1]: #622 (comment)
PyMuPDF replaced the need for almost all dependencies, which this commit now removes. We are also removing tesseract-ocr as a dependency since (to our surprise) PyMuPDF ships directly with tesseract binaries [1]. However, now that tesseract-ocr is not available directly as a binary tool, the `test_ocr.py` needed to be changed. Fixes #658 [1]: #658 (comment)
This reverts commit f074db0.
Make the compression happen per page when OCR is not enabled [1]. [1]: #622 (comment)
Add the following functionality to the build image script: 1. Let the user choose the container runtime of their choice. In some systems, both Docker and Podman may be available, so we need to let the user choose which runtime they want. 2. Let users choose if they want to save the image. For non-production builds, we may want to simply build the container image, without the time penalty of compression.
153493e
to
7e21d5e
Compare
Rebased on top of main to get that lint fix and cherry-picked commits made by @apyrgio to fix the security scanning issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is amazing work. I'm stoked for this integration! Feel free to merge.
a45189e
to
c8bea70
Compare
c8bea70
to
f27296c
Compare
Make the compression happen per page when OCR is not enabled [1]. [1]: #622 (comment)
Some tests [1] lead to the conclusion that ocr_compression does the same to the file (performance and size-wise) to the file as deflating images when saving the file. However, both methods active do add a bit of extra time. For this reason we're disabling the image deflation (default option). [1]: #622 (comment)
Some tests [1] lead to the conclusion that ocr_compression does the same to the file (performance and size-wise) to the file as deflating images when saving the file. However, both methods active do add a bit of extra time. For this reason we're disabling the image deflation (default option). [1]: #622 (comment)
Alternative to #619 but with PyMuPDF. This implements changes in the doc to pixels as well as the pixels to pdf conversion.
Missing