Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding tests for rect extractions #36

Merged
merged 7 commits into from
Feb 21, 2024
Merged

Conversation

kreuzberger
Copy link
Contributor

Added tests for rect extraction for sphinx-simplepdf / weasyprint generated pdf.
Tests checks for textbox extraction from codeblocks, admonitions and tables.

The tests for table did not work as expected. Instead of extracting colored table cells as rect, the 3 table row shown with alternating colors is extracted as whole.
Attached is a picture from visual debug.

The tests now works, asuming the "wrong" number of rects (i would expect 7). See attached file.
All other extractions work like expected.
libpdf_rect_from_table

@kreuzberger
Copy link
Contributor Author

kreuzberger commented Jan 24, 2024

The failing test has nothing to do with the test implemenation, the test is missing a required executable!
This now explains why i had to patch /etc/ImageMagic Policies on my ubuntu machine.!

This seems to be a feature of the visual debug. A hint in the doc would help.

@ubmarco
Copy link
Member

ubmarco commented Feb 3, 2024

Thanks a lot for your PR, I really appreciate new tests for the library.

I cannot push to your branch as the fork is created on your organization, not on your personal account.
So I added a commit to your branch and created a new PR from it to see the changes in CI #37.
The ruff linting is non-voting for now, but I want to enable it over time for more and more files. Once a file is touched I will add it to the files in the tox.ini lint environment.

@@ -31,6 +31,11 @@
# test PDFs from official python documentation
PDF_PYTHON_LOGGING = os.path.join(os.path.dirname(__file__), "pdf", "howto-logging.pdf")

# test PDF for rect extraction generateby by sphinx-simplepdf
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# test PDF for rect extraction generateby by sphinx-simplepdf
# test PDF for rect extraction generated by sphinx-simplepdf

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not urgent, but why not fix it when spotted in a review

tests/test_rects.py Outdated Show resolved Hide resolved
Comment on lines 222 to 223
visual_debug_output_dir=tmpdir.join("visual_debug_dir"),
visual_split_elements=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these 2 should not be needed if you set visual_debug=False

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but just want to change visual_debug to True without chanching the later ones.
Could be set by using a variable?

vs_debug = False

visual_debug = vs_debug
visual_split_elements = not vs_debug
visual_debug_output = tmpdir.join("visual_debug_dir") if vs_debug else ""

Or just leave it 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still don't understand this. Why would you provide fields that have no meaning in this context?
image
The libpdf run will not populate this directory nor split elements, so why passing the params. Maybe I misunderstand your use case here. 🤔

@ubmarco
Copy link
Member

ubmarco commented Feb 3, 2024

PR #37 fails as expected. I propose to cherry-pick my commits into your branch and fix the mentioned issues from my review.

@ubmarco
Copy link
Member

ubmarco commented Feb 3, 2024

For the rect count in your PDF, this is how the PDF is made, the header row actually has 3 rectangles while the row is just one. If you zoom in extremely, you also see it, e.g. here in Firefox's pdf.js based reader:
image
If you look closely, you see bright vertical lines in the header, but not in the row.

@kreuzberger kreuzberger requested a review from ubmarco February 7, 2024 15:32
Copy link
Member

@ubmarco ubmarco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

almost there

@@ -31,6 +31,11 @@
# test PDFs from official python documentation
PDF_PYTHON_LOGGING = os.path.join(os.path.dirname(__file__), "pdf", "howto-logging.pdf")

# test PDF for rect extraction generateby by sphinx-simplepdf
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not urgent, but why not fix it when spotted in a review

Comment on lines 222 to 223
visual_debug_output_dir=tmpdir.join("visual_debug_dir"),
visual_split_elements=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still don't understand this. Why would you provide fields that have no meaning in this context?
image
The libpdf run will not populate this directory nor split elements, so why passing the params. Maybe I misunderstand your use case here. 🤔

@kreuzberger
Copy link
Contributor Author

almost there
If i set visual_debug to True, the debugging is configured. This is my main intention.

@ubmarco ubmarco merged commit 29fe5a3 into useblocks:master Feb 21, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants