Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text is incorrectly extracted from highlights #200

Open
dstillman opened this issue Dec 12, 2024 · 6 comments
Open

Text is incorrectly extracted from highlights #200

dstillman opened this issue Dec 12, 2024 · 6 comments
Milestone

Comments

@dstillman
Copy link
Member

@dstillman dstillman added this to the 1.0 milestone Dec 17, 2024
@Dima-Android
Copy link
Collaborator

It was discovered that issue is reproducible on the PDF document provided by user. Issue is also present in PSPDFKIT's Catalog app.
Reported to PSPDFKIT and also provided them with the PDF and video recording.

@Dima-Android
Copy link
Collaborator

PSPDFKIT's reply was following:

"Upon looking into the text highlighting behavior you reported, I've found that this issue appears to be specific to your PDF document. I've tested the same document in other PDF viewers, including Adobe, and observed the same highlighting behavior.

I'll continue to investigate this further and will update you if we find any additional insights that could help improve the highlighting accuracy for your document."

@dstillman
Copy link
Member Author

I'm a bit confused, because copying the text of that paragraph works fine, in Zotero and (more or less) in all other readers I tested. Why is highlighting different?

@Dima-Android
Copy link
Collaborator

A new update from PSPDFKIT:

After investigating the text selection and highlighting behavior in your PDF, we've determined that this is actually related to how the text positions are reported within the PDF file itself. The observed offset in text selection is a direct result of the PDF's internal structure and content positioning. This behavior is consistent with Adobe's PDF viewer on desktop as well, which exhibits the same text selection characteristics.

Rest assured that we continuously work to improve our text selection algorithms, even through addressing these specific edge cases. Therefore, I have raised your request with our Product Team as a feature request. While we cannot guarantee implementation or provide specific timelines, please rest assured that we carefully consider all suggestions we receive.

@dstillman
Copy link
Member Author

That doesn't really address my question, though. Text selection is fine. Highlighting is not. I believe that it's a problem with the PDF, but is there a reason text selection works but highlighting doesn't?

@dstillman
Copy link
Member Author

dstillman commented Jan 11, 2025

And I'm not seeing the same problems with highlighting in Acrobat Reader (on desktop or iOS), so I'm not sure what they're referring to there. Are they looking at the sample extracted text from the forums thread, with all the duplicated text? That's what we're referring to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants