ligature question / annotation preservation when converting to html #861

DookTibs · 2021-01-22T16:54:22Z

DookTibs
Jan 22, 2021

Hi - I'm new to PyMuPDF and having some trouble that I think is relating to ligatures.

What I'm trying to do:

convert a pdf to html
search for and mark up certain words and phrases within particular sentences in the generated html

But certain pdf's are giving me difficulty. For instance, see this tiny screenshot of a sample pdf I'm working with:

In this sentence, the words "polyfluoroalkyl" and "fluoroether" have what I think is called a ligature - the "fl" portion does not appear to be two separate characters (for instance, if I use my mouse to highlight letters in those worse, the "f" and the "l" cannot be highlighted independently).

I found an issue where you discuss something related:
#745

and so I'm explicitly passing in flags=fitz.TEXT_PRESERVE_WHITESPACE+fitz.TEXT_PRESERVE_IMAGES in my call to page.getText, so as to not preserve ligatures. The converted html that PyMuPDF produces for this sentence then looks like this (added some extra whitespace between spans to make it more readable [edit - and attaching as an image as I can't get the html to not render in this comment]):

The "fl" is getting wrapped in a span with a slightly different font. And when viewing this html in either a text editor or a web browser, the "f" and "l" are now separate characters.

So the generated html looks great - basically indistinguishable from the source pdf. But if I wanted to, for example, search for "polyfloroalkyl", it's complicated by that extra span thrown in there. (for this specific example of course I could get around it, but I need to handle many such cases so I'm trying to figure out a general solution). Is there a way to get words like "polyfluoroalkyl" to not be split over multiple spans in the generated text?

(And similarly the word "BACKGROUND" at the start of the sentence is wrapped in two separate spans, to account for the larger "B" at the start of the word. That's going to also cause me potential issues when searching)

What PyMuPDF is doing here makes a lot of sense to me, but I'm wondering if there's a general approach that I'm missing that might be better for my particular goal.

Or as an alternate approach, I briefly experimented with annotating the pdf instead of in the html...so using page.getTextWords(), getting the Rect from the match, and addHighlightAnnot on that. That worked well and I could find words like polyfluoroalkyl, but then when converting to html those annotations were not carried through. Is there a way to preserve pdf annotations when converting to another format?

JorjMcKie · 2021-01-22T17:58:53Z

JorjMcKie
Jan 22, 2021
Maintainer

Uff - a long question!
Preliminary:

(X)HTML and XML output directly uses underlying MuPDF features which PyMuPDF just wraps. For this reason, there is no way to influence what is happening there - except for the flags, which you already found out.
All other text output options are genuine PyMuPDF code (using only the most basic MuPDF access methods).

As per the text marker annotations:

no, to my knowledge there is no similar thing in HTML documents.
but you can "permantly" highlight / underline / ... stuff as well, using genuine PyMuPDF features:
- e.g. use page.drawRect(...), page.drawLine(...), etc., potentially with option overlay=False, or some opacity.
- this should then be visible in the HTML output

1 reply

DookTibs Jan 22, 2021
Author

Thanks for the suggestions and explanation, much appreciated. I've been experimenting with the "permanent highlight" option but I've had pretty different results...using page.drawRect and page.newShape I'm able to get "stuff" drawn onto the pdf, but it does not appear in the generated html. Just as a sanity check, I am able to use page.insertText and page.insertImage and I see those modifications in both the pdf and the generated html.

BAsed on your comment it certainly sounds like drawRect should persist into the generated html but it just isn't for me? I'll see if I can pull together a barebones example showing what I mean.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ligature question / annotation preservation when converting to html #861

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

ligature question / annotation preservation when converting to html #861

DookTibs Jan 22, 2021

Replies: 1 comment · 1 reply

JorjMcKie Jan 22, 2021 Maintainer

DookTibs Jan 22, 2021 Author

DookTibs
Jan 22, 2021

Replies: 1 comment 1 reply

JorjMcKie
Jan 22, 2021
Maintainer

DookTibs Jan 22, 2021
Author