Extracting comments, highlights according to the colors #820

dummifiedme · 2021-01-10T16:00:48Z

dummifiedme
Jan 10, 2021

I had raised an issue #819. Would like to continue the discuss here as @JorjMcKie suggested :)

So the idea is to have the Comments, Highlights or any other annotations (or the text below them) in a document.

I am thinking of atomized notes out of a pdf directly. I am aiming to create a system for myself where I can read a pdf, annotate in the pdf itself, take short notes or maybe questions (for Active Recall), and then when I am done reading the pdf, extract them all in a file. But since, I am interested in atomic notes, I need to have them in separate files all linked in a single file.

The reason for having atomized notes are two fold - I get to use same fact/point at multiple places (I am using obsidian) and I can have them directly inserted into ANKI.

Currently, with the help of @JorjMcKie , I am able to understand a few things about comments what I had planned to use as the main tool for capturing my notes (irrespective of the device I am on). But I think it can be expanded to have more formatted notes utilizing coloured annotations and even the highlights (the text under them, as I came to understand :) )

So, here are the next questions I have:

How to extract the text under a highlight?
How to identify which colour that annotation is?
SideQuestion (unrelated to PyMuPDF): I don't know much about Regex, but I know what can it be used for. In my case, I wanted to grab the first line of a comment text as title of the note file I intend to create for each annotation. So, if someone can help me with it, I will be thankful.

Let's have a discussion on how I can proceed or even if its good idea to go for this kind of a system.

Answered by dummifiedme

Jan 10, 2021

One quick question.
I had added some images inside the "TEXT" type annotations, are they accessible by any chance? The code breaks whenever it hits an image, I think. How can I include those images?

View full answer

JorjMcKie · 2021-01-10T16:09:59Z

JorjMcKie
Jan 10, 2021
Maintainer

To continue on the text extraction topic: we had been touching Popup annotstions.

they do not occur in the page.annots() iterator. Instead they are an extra sort of possible property of several other annotation types.
If an annot has a popup, then annot.has_popup is true, and popup-related annot properties have a meaning (an own xref for example).
A popup annot has no own text. If a user entered text in the popup, then that text is reflected in the parent annot as annot.info["content"]. The only popup-specific info units are xref, rect and the parent's property is_open. There are also the popup properties author and date, which I - as yet - did not bother to reflect.

0 replies

JorjMcKie · 2021-01-10T16:12:35Z

JorjMcKie
Jan 10, 2021
Maintainer

So, @dummifiedme - you can check the popup existence for every annotation type (via has_popup), except PDF_ANNOT_TEXT (which is its own popup if you like), and if true take annot.info["content"] as the text to extract.

2 replies

dummifiedme Jan 10, 2021
Author

Yes tried and implemented it in my code. I used it in an if condition and it worked well :)
I am not finalizing things at this moment though.

dummifiedme Jan 10, 2021
Author

Now, should I grab code from the discussion on highlight extraction? Or is there a direct function for it too now?

JorjMcKie · 2021-01-10T17:01:00Z

JorjMcKie
Jan 10, 2021
Maintainer

Let me help out:
For annot types PDF_ANNOT_HIGHLIGHT, PDF_ANNOT_UNDERLINE, PDF_ANNOT_SQUIGGLY, PDF_ANNOT_STRIKE_OUT ("text marker annotations"), you most appropriately should extract the text of the PDF page that is being marked - not any of the properties of the annotations.
Use the annotation's rectangle annot.rect to extract page text: text = page.getTextbox(annot.rect).

1 reply

dummifiedme Jan 10, 2021
Author

Yes, I went through this example here.

I tried to copy out snippets and update the code you had previously provided.

import fitz
import os
import re

def make_text(words):
    """Return textstring output of getText("words").
    Word items are sorted for reading sequence left to right,
    top to bottom.
    """
    line_dict = {}  # key: vertical coordinate, value: list of words
    words.sort(key=lambda w: w[0])  # sort by horizontal coordinate
    for w in words:  # fill the line dictionary
        y1 = round(w[3], 1)  # bottom of a word: don't be too picky!
        word = w[4]  # the text of the word
        line = line_dict.get(y1, [])  # read current line content
        line.append(word)  # append new word
        line_dict[y1] = line  # write back to dict
    lines = list(line_dict.items())
    lines.sort()  # sort vertically
    return "\n".join([" ".join(line[1]) for line in lines])


filelist = os.listdir()  # a folder where the PDFs live
textfile = open(
    "collected-comments.md", "w"
)  # a simple text output for the extracted comments
for filename in filelist:  # loop thru the PDFs
    if not filename.endswith(".pdf"):
        continue  # exclude files that are no PDFs
    doc = fitz.open(filename)
    textfile.write("Comments in PDF '%s'.\n" % doc.name)
    for page in doc:  # loop thru pages of current PDF
        words = page.getText("words")
        textfile.write("Comments on page %i:\n" % page.number)
        for annot in page.annots(
            types=[fitz.PDF_ANNOT_HIGHLIGHT]
        ):  # loop thru freetext annots
            rect = annot.rect
            print(rect)
            mywords = [w for w in words if fitz.Rect(w[:4]).intersects(rect)]
            print(mywords)
            textfile.write(make_text(mywords))
            # if annot.has_popup:
            # #text2=_extract_annot(annot, )
            #     text = annot.info ["content"]  # extract the text
            #     textfile.write("\n\"%s\"\n" %text)
            # other annot information could be extracted
            
    doc.close()
    
textfile.close()

I get no output if I use "fully contained" option mywords = [w for w in words if fitz.Rect(w[:4]) in rect].
While with "intersection", I get an output which also includes texts from non-highlighted words in the line above.

Comments in PDF 'test.pdf'.
Comments on page 0:
is a test PDF document.
Adobe Acrobat Reader installedcan read this,

The test file is here:
test.pdf

JorjMcKie · 2021-01-10T17:29:44Z

JorjMcKie
Jan 10, 2021
Maintainer

You could have used page.getTextbox() as I wrote. But anyway:
A peculiarity of deciding what is contained or not, depends on what size of the character bbox height do you assume.
You have the option to assume either (1) the full lineheight or (2) a shrinked down height, which delivers a height of fontsize.

The first is the default. To set the second, use fitz.TOOLS.set_small_glyph_heights(True) before extracting text. This should solve your problem.

3 replies

dummifiedme Jan 10, 2021
Author

I tried using page.getTextbox() as well, it isn't able to capture anything.

EDIT: Why does it fail? I tried a few times
EDIT: I tried something stupid as well. Tried to multiple rect by 5. Still not capturing anything.

dummifiedme Jan 10, 2021
Author

The first is the default. To set the second, use fitz.TOOLS.set_small_glyph_heights(True) before extracting text. This should solve your problem.

Yes, it did solve the problem. Thanks!!

How can I decide on the sequence it prints first?

dummifiedme Jan 10, 2021
Author

PS. I do feel, I am asking a lot, to an extent of being spoon fed. I am happy to be learning this tool, but do stop me if you think I should try before asking. :)

JorjMcKie · 2021-01-10T18:59:03Z

JorjMcKie
Jan 10, 2021
Maintainer

How can I decide on the sequence it prints first?

best memorize the coordinates where you found the text and store it together in an intermediate list. When thru with the page, sort that list to your liking and write the text to its destination.

5 replies

JorjMcKie Jan 10, 2021
Maintainer

the reason for this complication is that everything is extracted from a document in the sequence as stored (by the document creator). Because of locale differences you then never know, what a "natural" reading sequence could be.
Our - Western - topleft to bottomright cannot be imposed on some Asian locales, or think of Arabian top-right to bottom-left.

So one has no choice other than sorting stuff as is desired / adequate.

dummifiedme Jan 10, 2021
Author

If I want to use right to left and top to bottom by default, is there a way to direct the program to soft all the annots that way? Or we have to store them first, then sort it and then print them in a file?

JorjMcKie Jan 10, 2021
Maintainer

The latter I am afraid. You know how things may go: you store a few annotations, later, when having more information, another annot must be stored at the top of the page ...
So yes, to be sure use an intermediate list with enough position information and sort the stuff when everything has been extracted.
After all, sorting in Python is one of the best among all (scripting) languages - in terms of both, performance and flexibility.
Suppose you have stored extracted text as a list of triples mytext = [(x, y, text), ...], then your sort could look like one statement:

mylist.sort(key=lambda item: (item[1], -item[0]))

So sort by ascending vertical, then by ascending horizontal coordinate. Because of the negative sign in front of the x values, they will sorted from right to left.

dummifiedme Jan 10, 2021
Author

This seems to go beyond me. I think I should first try to learn python a bit more, then some best practices, and then dig into anything. For the basic level implementation, I have something at the moment. I wanted to go a bit more customized, but I don't see that happening with the current level of my understanding of programming in general 😞

Thanks for all the help anyway :)

JorjMcKie Jan 10, 2021
Maintainer

This seems to go beyond me.

I left out a few steps, sorry.

no matter what type of annot you dealt with, you also get the coordinates of the rectangle inside which the text was sitting.
choose one of the 4 rect corners to symbolize the rect position. E.g. the bottom left one. Call its coordinates (x, y).
append the triple (x, y, text) to your intermediate list that you later want to sort.
when done with extracting from that page, sort mylist - as you requested first by vertical (ascending), then by horizeontal (descending).
the sort method of lists accepts a function under argument key which has to spit out a sort key. Only one is allowed. So we need to hand over (y, -x) as the key to sort ascendingly - you follow? Descending x is equivalent to ascending -x.
we use the Python lambda expression feature to define our sort-key function: in quasi-mathematical formulation the function would be f(x, y, text) = (y, -x). This is what you saw: sort(key=lambda item: (item[1], -item[0])). Doing so saves us defining an extra Python function via

def f(item):
    return (item[1], -item[0])
# and then
mylist.sort(key=f)

no magic at all.

dummifiedme · 2021-01-10T22:03:08Z

dummifiedme
Jan 10, 2021
Author

One quick question.
I had added some images inside the "TEXT" type annotations, are they accessible by any chance? The code breaks whenever it hits an image, I think. How can I include those images?

7 replies

JorjMcKie Jan 11, 2021
Maintainer

please send me the PDF if possible.

I don't know how it does it, but it does.

I haven't been quite precise in my post: Annotations consist usually of more than one PDF object. One of the dependent objects to the annotation is / are the so-called "appearance" object(s) (to be found under key /AP, there may be more than one depending on the "selected" state of the annot).
The PDF creator generally is free to do anything it likes inside such an /AP - it also could cause an image to be displayed, among a myriad of other things.
But that would be outside the standard, and be unsupported by MuPDF, and also PyMuPDF. If that was done, then this image would not be extractable ... well at least not by the current features.

dummifiedme Jan 11, 2021
Author

Ah, I see. Then maybe its doing the same. I created a sample file to show you the image inside the note.
Sample File

JorjMcKie Jan 11, 2021
Maintainer

As suspected: it is a highlight annotation, which an /AP that displays a Form XObject (similar to an image).
Outside standard. Could only be extracted with special code.

dummifiedme Jan 11, 2021
Author

So, I should just stop this practice of adding images into an annotation. And rather have another way to have images in the notes, if at all :)

JorjMcKie Jan 11, 2021
Maintainer

Yes, you are right. Of course you could write an extension for PyMuPDF to cope with this type of situation ... but that would have to be written in C, at least for the better part 🙄.

dummifiedme · 2021-01-11T10:43:25Z

dummifiedme
Jan 11, 2021
Author

I guess, if I just drop the image somewhere in the PDF, it can be picked up. Also, if I draw a rectangle around a region in pdf, the image under it could be extracted and even the text under it (if any , not OCR)?

@JorjMcKie

2 replies

JorjMcKie Jan 11, 2021
Maintainer

If you mean a rectangle annotation by "drawing a rectangle": yes to both.
It all depends on your code / script: if you want to handle encountered "Square" annotations this way, then nothing would keep you from doing so.

dummifiedme Jan 11, 2021
Author

Okay. I will keep trying to add features to my basic script

JorjMcKie · 2021-01-11T14:36:20Z

JorjMcKie
Jan 11, 2021
Maintainer

As per your regex point:
As far as I have understood your intentions, you do not need it. There is no problem identifyng the first line in some text. Just use text.splitlines() and take the first item of the resulting list.

My personal position towards regex is more of the type: (1) avoid using it, (2) if you think you absolutely need it: think again. Or, as the Python documentation words it:
"In short, before turning to the re module, consider whether your problem can be solved with a faster and simpler string method."

1 reply

dummifiedme Jan 11, 2021
Author

Yes. I saw those functions recently. And they can bee used to work for me. Thanks for clarifying it for me.

I do have some other ideas as well, as to what else can be done right there in the pdf itself. For example, I can directly grab the content in a certain format and use them as I like in my notes. For example, some annotations could be my personal views or notes, which arent supposed to go my study section, some note might be a question type note, some could be a quote and further something could be an example. Similarly for many things.

I don't yet know if its really a good thing to do, but for me right now, it's really great if I can read my book without having to leave it. Immersed and pdfs are easy to sync, notes can't get lost (unless stupidity) and when I reread my notes are there!

With the questions or the examples, I can automate it to go into my anki system as well (which is already there, the sync). I use obsidian which has the linking plus a place to refine and compile my notes.

Right now I have a working system, all thanks to you, that would create a markdown file with all the notes Ina single file. I did try to have all of them separately, but it created some problems plus, it was really unfruitful (not the idea, but the way I have taken notes in pdf till now).

The main problem was to have a question mark in the first line. Since the first line was supposed to be the file name for that note, the Question mark throws an error. I can maybe do something about it, but only after learning about some functions here that python has. As I said, I need to explore.

Plus, the questions like 'what', 'when', 'how' etc could be avoided from the title all together (they make the title quite lengthy).

dummifiedme · 2021-01-12T11:40:13Z

dummifiedme
Jan 12, 2021
Author

With help from @JorjMcKie and #318, I am able to fix myself with a way to extract comments(text annotations) and highlighted text from a pdf.

I still would like to implement two more things:

Colour based organisation
Capture images under a "SQUARE"

As for point 1, as @JorjMcKie explained, I can see the color property of an annotation (gives both stroke and filled colour) but I dont yet know how to classify them (the colours) in categories (such as red, orange, blue, yellow, blue etc). Not sure, but I think I can define a colour using maybe a dict? But still, I woud like to have a range such that light red to dark red would be "red" and similarly violet to dark blue should be "blue". How can I do that?

For point 2, I can see the type of all the annotations, but dont yet know how to capture the image under a rect bounded by it. The text under a box is understood (#318) but what about the images? If I just want to capture a screenshot of anything that is inside the "Square" or any shape.

Also, can we capture the "ink" type annotations in an image form? If point 2 is satisfied, we can even draw a square round our ink annots and get them inserted in the note! Seems awesome to me :p

7 replies

dummifiedme Jan 12, 2021
Author

Look more closely at the documentation (😉): there is a parameter that lets you exclude annotations from being included in the rendered page pixmap. So you can create this sophisticated pixmap: pix = page.getPixmap(clip=annot.rect, annots=False), which will make an image of only the page content under the annot and without the annot itself.

Yes, I did have a look at it. But since I am a student, for me its better to have annotationed stuff. But yes, that super usefull feature when planning to have a neat set up of notes. If it comes to a better version of my code at some point, i can maybe use it and provide user with an option :p (Dreams, hah!)

@JorjMcKie, since you are here (awake I mean :)), I had a question.
How can I improve the resolution or the quality of the images I get from the getPixmap(). I see that there is a function, but I don't seem to know is resolution is equivalent to quality? And, how it relates to size (dimansions) of the image.

Also, might be a bit off topic (of PyMuPDF), if I intend to use the images inside a markdown file, what should be the format should choose? I would like to have a good quality picture in a reasonable size which doesn't distort too much depending on the zoom levels (I mean, it would be nice to have a little zoom if possible, not while storing, but after its stored.)

EDIT:
I read though the documentation and found a way to improve the resolution using "matrix" in getPixmap(). I still don't know how to know if my image really needs an improvement. Some boxes might be small or some might be big.

Should I use the "Areas" under the rect to set a logic to decide on the improvement or there is some sort of a convention already present how to tackle it (PyMuPDF or otherwise)?

JorjMcKie Jan 12, 2021
Maintainer

Annotation colors have 3 components: red, green, blue given as a triple (r, g, b). All three are floats in range 0<=float<=1.
White is (1,1,1) and black is (0,0,0). If all 3 are equal, some shade of gray is the result.
How to classify an arbitray given triple ... I have no advice for that, except the obvious: the closer e.g. r is to 1 the redder, etc.
There is "color database" in PyMuPDF with about 500 named color variations, however in integer version (R, G, B), where each item is an integer between 0 and 255 (transition between this and the floats version is simply mutliplication / division by 255 and rounding appropriately). fitz.utils.getColorList() or fitz.utils.getColorInfoList():

>>> fitz.utils.getColorList()[:20]
['ALICEBLUE', 'ANTIQUEWHITE', 'ANTIQUEWHITE1', 'ANTIQUEWHITE2', 'ANTIQUEWHITE3', 'ANTIQUEWHITE4', 'AQUAMARINE', 'AQUAMARINE1', 'AQUAMARINE2', 'AQUAMARINE3', 'AQUAMARINE4', 'AZURE', 'AZURE1', 'AZURE2', 'AZURE3', 'AZURE4', 'BEIGE', 'BISQUE', 'BISQUE1', 'BISQUE2']
>>> fitz.utils.getColorInfoList()[:20]
[('ALICEBLUE', 240, 248, 255), ('ANTIQUEWHITE', 250, 235, 215), ('ANTIQUEWHITE1', 255, 239, 219), ('ANTIQUEWHITE2', 238, 223, 204), ('ANTIQUEWHITE3', 205, 192, 176), ('ANTIQUEWHITE4', 139, 131, 120), ('AQUAMARINE', 127, 255, 212), ('AQUAMARINE1', 127, 255, 212), ('AQUAMARINE2', 118, 238, 198), ('AQUAMARINE3', 102, 205, 170), ('AQUAMARINE4', 69, 139, 116), ('AZURE', 240, 255, 255), ('AZURE1', 240, 255, 255), ('AZURE2', 224, 238, 238), ('AZURE3', 193, 205, 205), ('AZURE4', 131, 139, 139), ('BEIGE', 245, 245, 220), ('BISQUE', 255, 228, 196), ('BISQUE1', 255, 228, 196), ('BISQUE2', 238, 213, 183)]
>>>

JorjMcKie Jan 12, 2021
Maintainer

How can I improve the resolution or the quality of the images I get from the getPixmap().

There is the Pixmap parameter matrix. A fitz.Matrix is a mathematical 3 x 3 matrix providing a map between coordinate systems. It can be used to "zoom" into a page's pixmap like this: mat = fitz.Matrix(2, 2) defines a matrix that zooms by a factor 2 in both dimensions x and y. The resulting pixmsp pix = page.getPixmap(matrix=mat, ...) is hence 4 times larger and correspondingly moe precise.

dummifiedme Jan 12, 2021
Author

How can I improve the resolution or the quality of the images I get from the getPixmap().

There is the Pixmap parameter matrix. A fitz.Matrix is a mathematical 3 x 3 matrix providing a map between coordinate systems. It can be used to "zoom" into a page's pixmap like this: mat = fitz.Matrix(2, 2) defines a matrix that zooms by a factor 2 in both dimensions x and y. The resulting pixmsp pix = page.getPixmap(matrix=mat, ...) is hence 4 times larger and correspondingly moe precise.

Yes, I implemented it. Thanks 👍🏼
I still am confused between 'dimension' and the 'resolution'. If I set a matrix, it would zoom or shrink the image. I would like to have better resolution at the same size of the image (the dimensions - height and width). As far as I understand, both the things are different or am I wrong?

I know it's maybe a general question, but if you could help, it would be great :)

dummifiedme Jan 12, 2021
Author

Annotation colors have 3 components: red, green, blue given as a triple (r, g, b). All three are floats in range 0<=float<=1.
White is (1,1,1) and black is (0,0,0). If all 3 are equal, some shade of gray is the result.
How to classify an arbitray given triple ... I have no advice for that, except the obvious: the closer e.g. r is to 1 the redder, etc.
There is "color database" in PyMuPDF with about 500 named color variations, however in integer version (R, G, B), where each item is an integer between 0 and 255 (transition between this and the floats version is simply mutliplication / division by 255 and rounding appropriately). fitz.utils.getColorList() or fitz.utils.getColorInfoList():

Got it. I had read the documentation about the colors. Your reply clears many doubts I had. Thanks.

I don't really mean, arbitrary grouping. Thing is, if I am on a device say iPad, I have a colour picker, there are shades of each colour. While if I change system, say on a PC, the colour might be a different shade of the same colour (the human accuracy will vary).

Though, I will try to find how that can be dealt with. The functionality that PyMuPDF already provides is amazing at the moment!

Thanks for this beautiful tool!

Extracting comments, highlights according to the colors #820

dummifiedme Jan 10, 2021

Replies: 9 comments · 28 replies

JorjMcKie Jan 10, 2021 Maintainer

JorjMcKie Jan 10, 2021 Maintainer

dummifiedme Jan 10, 2021 Author

dummifiedme Jan 10, 2021 Author

JorjMcKie Jan 10, 2021 Maintainer

dummifiedme Jan 10, 2021 Author

JorjMcKie Jan 10, 2021 Maintainer

dummifiedme Jan 10, 2021 Author

dummifiedme Jan 10, 2021 Author

dummifiedme Jan 10, 2021 Author

JorjMcKie Jan 10, 2021 Maintainer

JorjMcKie Jan 10, 2021 Maintainer

dummifiedme Jan 10, 2021 Author

JorjMcKie Jan 10, 2021 Maintainer

dummifiedme Jan 10, 2021 Author

JorjMcKie Jan 10, 2021 Maintainer

dummifiedme Jan 10, 2021 Author

JorjMcKie Jan 11, 2021 Maintainer

dummifiedme Jan 11, 2021 Author

JorjMcKie Jan 11, 2021 Maintainer

dummifiedme Jan 11, 2021 Author

JorjMcKie Jan 11, 2021 Maintainer

dummifiedme Jan 11, 2021 Author

JorjMcKie Jan 11, 2021 Maintainer

dummifiedme Jan 11, 2021 Author

JorjMcKie Jan 11, 2021 Maintainer

dummifiedme Jan 11, 2021 Author

dummifiedme Jan 12, 2021 Author

dummifiedme Jan 12, 2021 Author

JorjMcKie Jan 12, 2021 Maintainer

JorjMcKie Jan 12, 2021 Maintainer

dummifiedme Jan 12, 2021 Author

dummifiedme Jan 12, 2021 Author

dummifiedme
Jan 10, 2021

Replies: 9 comments 28 replies

JorjMcKie
Jan 10, 2021
Maintainer

JorjMcKie
Jan 10, 2021
Maintainer

dummifiedme Jan 10, 2021
Author

dummifiedme Jan 10, 2021
Author

JorjMcKie
Jan 10, 2021
Maintainer

dummifiedme Jan 10, 2021
Author

JorjMcKie
Jan 10, 2021
Maintainer

dummifiedme Jan 10, 2021
Author

dummifiedme Jan 10, 2021
Author

dummifiedme Jan 10, 2021
Author

JorjMcKie
Jan 10, 2021
Maintainer

JorjMcKie Jan 10, 2021
Maintainer

dummifiedme Jan 10, 2021
Author

JorjMcKie Jan 10, 2021
Maintainer

dummifiedme Jan 10, 2021
Author

JorjMcKie Jan 10, 2021
Maintainer

dummifiedme
Jan 10, 2021
Author

JorjMcKie Jan 11, 2021
Maintainer

dummifiedme Jan 11, 2021
Author

JorjMcKie Jan 11, 2021
Maintainer

dummifiedme Jan 11, 2021
Author

JorjMcKie Jan 11, 2021
Maintainer

dummifiedme
Jan 11, 2021
Author

JorjMcKie Jan 11, 2021
Maintainer

dummifiedme Jan 11, 2021
Author

JorjMcKie
Jan 11, 2021
Maintainer

dummifiedme Jan 11, 2021
Author

dummifiedme
Jan 12, 2021
Author

dummifiedme Jan 12, 2021
Author

JorjMcKie Jan 12, 2021
Maintainer

JorjMcKie Jan 12, 2021
Maintainer

dummifiedme Jan 12, 2021
Author

dummifiedme Jan 12, 2021
Author