-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Solve most congestions issues in /tmp (client & Server) #619
Conversation
PDFtoPPM was producing RGB files faster than they were getting consumed. Since the RGB files were only getting removed after they were sent, this was leading to /tmp in the server getting clogged. This solution consists in processing and sending images in chunks of 50 pages. This solution is slightly inefficient since it can't process and send data simultaneously. That will be solved in a future commit. Fixes #574
Previously the PDFtoPPM conversion in batches would stop after conversion to send the data. But by sending the data in the following loop, we can perform the data sending at the "same time" as converting a batch.
Storing all RGB files in the host were leading to a fast-filling `/tmp`. This solution essentially converts all the RGB files to PNGs (which are compressed) saving valuable space in the process. This conversion is made with the Pillow (PIL) module, without the need for any external dependencies. Fixes #526
An overdue dead code removal was finally done, combined with the merging of part of the containers and Qubes isolation provider regarding size (and width/height) checks.
In large (1200+) PDFs the PDFunite command would fail on some systems (e.g. Qubes), because it would be called with 1024+ files, leading up to too many files open (`ulimit -n`). This solution splits the merging into batches, accumulating the results in a single PDF and then merging it with the next batch.
num_pages = len(glob.glob(f"{pixel_dir}/page-*.rgb")) | ||
self.verify_received_pixel_files(pixel_dir, num_pages) | ||
|
||
for page in range(1, num_pages + 1): | ||
filename_base = f"{pixel_dir}/page-{page}" | ||
rgb_filename = f"{filename_base}.rgb" | ||
width_filename = f"{filename_base}.width" | ||
height_filename = f"{filename_base}.height" | ||
with open(width_filename) as f: | ||
width = int(f.read().strip()) | ||
with open(height_filename) as f: | ||
height = int(f.read().strip()) | ||
with open(rgb_filename, "rb") as rgb_f: | ||
untrusted_pixels = rgb_f.read() | ||
self.convert_pixels_to_png( | ||
str(pixel_dir), page, width, height, rgb_data=untrusted_pixels | ||
) | ||
|
||
os.remove(rgb_filename) | ||
os.remove(width_filename) | ||
os.remove(height_filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of this should be move to after checking if there was a non-zero exit code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure why this code is necessary in the first place:
- The current containers version, which does not stream the data to the host yet, has to write all the pixels in the temporary directory. This is the worst case scenario, and by converting them to PNG after the fact does not help much.
- The streaming (post-Containers: have progress streamed instead of via mounted volumes (and deprecate doc_to_pixels_qubes_wrapper.py) #443) containers version will piggyback on the Qubes code that converts the RGB images to PNG on the fly, which has already been added in this PR.
Still, the code seems to be necessary because it makes pixels-to-pdf
accept only PNG images, not RGB, thereby simplifying its interface. Is this your rationale as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still, the code seems to be necessary because it makes pixels-to-pdf accept only PNG images, not RGB, thereby simplifying its interface. Is this your rationale as well?
Yes, that's the logic here. And it's not worth leaving the pixels-to-pdf code more complex to make #412 easier.
I haven't finished reviewing the code, but I was thinking that we can dramatically improve the server-side batching part with Consider this snippet: import sys, fitz # import the bindings
fname = sys.argv[1] # get filename from command line
doc = fitz.open(fname) # open document
for page in doc: # iterate through the pages
pix = page.get_pixmap(dpi=150) # render page to an RGB image
rgb_buf = pix.samples_mv # Get a memory view over the pixel data
with open("mu-%i.rgb" % page.number, "wb+") as f:
f.write(rgb_buf)
print(f"Page {page.number}: W: {pix.width}, H: {pix.height}") This snippet:
For the container case, we can simply iterate all the pages and write the data to the filesystem. For the Qubes case though, we can write the data immediately to the stdout, without buffering them locally. This means that we don't need any batch logic, and Unix pipes become the means of congestion control. Also, another improvement that this snippet brings is that it can RGB files 2x faster than Alas, there's a caveat: PyMuPDF is an AGPL project, and Artifex takes its GPL violations seriously. So, what do we have to concede here to make our code faster, simpler, and more secure? My understanding is that we need to license the Thoughts on this? EDIT: Alternatively, we can use the liberally licensed |
|
||
# return True, True | ||
try: | ||
Image.open(ppm_data).save(png_path, "PNG") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had tackled something similar in my 2023-10-test-enospc
branch. You don't need to go through the PPM -> PNG conversion, nor create a bytesIO
object. You can convert it from the existing buffer with:
image = Image.frombuffer("RGB", (width, height), untrusted_pixels, "raw", "RGB", 0, 1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, nice! I though there would be some golden nugget like this, but I had difficulties finding that branch of yours. Thanks! I'll try and add it.
Thanks for the feedback. That's certainly interesting proposal. I did consider some of those PDF parsing libraries, but ultimately though it wouldn't add much benefit due to their disadvantages:
But the speed increase and python-native bindings are certainly. Regarding speed, the current performance seems fast enough to be acceptable and there are slower steps in the pipeline (e.g. OCR) so I don't think it's too critical at the moment. I am unsure if the benefits outweigh the risks. We could have a sync discussion if you have the availability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you see, I have proposed some radical detours from the batching implementation, which use native Python libraries instead. I think at the end of the day, this will lead to a simpler, faster, and more secure implementation, which is worth doing now that we are at the start of the release cycle. I'd like to hear your opinion as well though.
num_pages = len(glob.glob(f"{pixel_dir}/page-*.rgb")) | ||
self.verify_received_pixel_files(pixel_dir, num_pages) | ||
|
||
for page in range(1, num_pages + 1): | ||
filename_base = f"{pixel_dir}/page-{page}" | ||
rgb_filename = f"{filename_base}.rgb" | ||
width_filename = f"{filename_base}.width" | ||
height_filename = f"{filename_base}.height" | ||
with open(width_filename) as f: | ||
width = int(f.read().strip()) | ||
with open(height_filename) as f: | ||
height = int(f.read().strip()) | ||
with open(rgb_filename, "rb") as rgb_f: | ||
untrusted_pixels = rgb_f.read() | ||
self.convert_pixels_to_png( | ||
str(pixel_dir), page, width, height, rgb_data=untrusted_pixels | ||
) | ||
|
||
os.remove(rgb_filename) | ||
os.remove(width_filename) | ||
os.remove(height_filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure why this code is necessary in the first place:
- The current containers version, which does not stream the data to the host yet, has to write all the pixels in the temporary directory. This is the worst case scenario, and by converting them to PNG after the fact does not help much.
- The streaming (post-Containers: have progress streamed instead of via mounted volumes (and deprecate doc_to_pixels_qubes_wrapper.py) #443) containers version will piggyback on the Qubes code that converts the RGB images to PNG on the fly, which has already been added in this PR.
Still, the code seems to be necessary because it makes pixels-to-pdf
accept only PNG images, not RGB, thereby simplifying its interface. Is this your rationale as well?
), | ||
timeout=timeout, | ||
) | ||
for first_page, last_page in batch_iterator(num_pages): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: what do we gain by batching the pdfunite
operations?
At this point, we already have all the pages as PDF files on disk, and that is the main space hog, right? I get that with a single pdfunite
call we need double the space, but that's a requirement for the conversion step as well, which takes the original PDF and creates a compressed one, and they both co-exist in the disk.
I have another suggestion here: using PyPDF2 instead, and perform compression online:
from pypdf import PdfReader, PdfWriter
reader = PdfReader("example.pdf")
writer = PdfWriter()
for page in reader.pages:
writer.add_page(page)
for page in writer.pages:
# ⚠️ This has to be done on the writer, not the reader!
page.compress_content_streams() # This is CPU intensive!
with open("out.pdf", "wb") as f:
writer.write(f)
(snippet taken from the docs)
With PyPDF2, we can open each PDF, compress it, and immediately create the output PDF, without pdfunite
and ps2pdf
. We can even delete the underlying PDFs as we process them. The good thing about PyPDF2
is that it's available in Fedora repos as well, so it will work on Qubes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: what do we gain by batching the pdfunite operations?
It's in the commit message. So it's not size-related but rather ulimit related. We could change the ulimit, but since I had already done the batch logic, I figured it could be reused here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh snap. I didn't expect that Qubes had such a low ulimit (nor that pdfunite
opens the files at the same time, jeez...).
I have now stalled progress on this PR, since I'll be evaluating how well PyMuPDF works for us, especially regarding the need to change the code license (or at least part of it). |
Closing potentially in favor of #622 |
In more resourced limited systems (such as Qubes OS) converting 1000+ page documents would not be possible due to several bottlenecks.
This PR addresses several of these bottlenecks (see commits for details).
Fixes #616 #625 #574
Regressions introduced
The containers version is slightly slower, stopping at the end of each batch for a brief moment. I don't knot yet what's causing the wait, but this will be solved with #443, once containers stream pages.
Remaining bottlenecks
Currently, converting a 1299 page document in Qubes still fails. This is due to the compression step (the reason why this still a draft). The compression creates additional files in
/tmp
which fill it up.