Obtaining end page number for each bookmark in ToC #764

andrei-volkau · 2020-12-12T10:51:11Z

andrei-volkau
Dec 12, 2020

PyMuPDF provides a great .getToC() method. The method provides bookmark level, title, and start page number for each bookmark. I just failed to create a reliable algorithm on how to obtain the last page number for each bookmark in ToC. I would appreciate a hint if possible.

Let me consider the following document as an example.
test_doc.pdf

Chapter "1: Foundations" starts on page 11. The end page should be 32 since it is the last page in the chapter.

Another example is the "1.1 What is Law?" bookmark. The end page for it should be 5 since it is the last page before section "1.2 Roman law" begins.

Answered by JorjMcKie

Dec 12, 2020

I think I understand.
The one thing that makes a "reliable" algorithm a bit complex is that the items in TOC need not point to page numbers in an ascending (or at least not descending) sequence, iaw item[i][2] <= item[i + 1][2] cannot be assumed to be true - although probable.
But how about this snippet:

>>> toc = doc.getToC()
>>> item = [i for i in toc if i[1].startswith("1.1 ")][0]  # find item whose end page is desired
>>> level=item[0]  # its level
>>> pno=item[2]  # its page number
>>> toclist = [i for i in toc if i[0] <= level and i[2] >= pno]  # list of bookmark candidates
>>> toclist.sort(key=lambda i: i[2])  # sort by page number to be sure
>>> idx = toclist.index(item)  # our it…

View full answer

JorjMcKie · 2020-12-12T12:11:48Z

JorjMcKie
Dec 12, 2020
Maintainer

I think I understand.
The one thing that makes a "reliable" algorithm a bit complex is that the items in TOC need not point to page numbers in an ascending (or at least not descending) sequence, iaw item[i][2] <= item[i + 1][2] cannot be assumed to be true - although probable.
But how about this snippet:

>>> toc = doc.getToC()
>>> item = [i for i in toc if i[1].startswith("1.1 ")][0]  # find item whose end page is desired
>>> level=item[0]  # its level
>>> pno=item[2]  # its page number
>>> toclist = [i for i in toc if i[0] <= level and i[2] >= pno]  # list of bookmark candidates
>>> toclist.sort(key=lambda i: i[2])  # sort by page number to be sure
>>> idx = toclist.index(item)  # our item is part of that list
>>> toclist[idx+1]  # its page number -1 is the desired one
[2, '1.2 Roman Law', 15]
>>>

Item number idx + 1 has a page number not lower than that of item - if it exists, otherwise take last document page.

Of course, there are complications:

one cannot be sure that toc items always start on a new page: in the above we just cannot know if the answer really is 14 or 15.
what else can go wrong?

1 reply

andrei-volkau Dec 17, 2020
Author

Many thanks for the code sample! It works as expected. My code was failing because of the following issue. A PDF might contain bookmarks that are not sorted by page numbers.

You are sorting them to be sure. So that solves the problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Obtaining end page number for each bookmark in ToC #764

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Obtaining end page number for each bookmark in ToC #764

andrei-volkau Dec 12, 2020

Replies: 1 comment · 1 reply

JorjMcKie Dec 12, 2020 Maintainer

andrei-volkau Dec 17, 2020 Author

andrei-volkau
Dec 12, 2020

Replies: 1 comment 1 reply

JorjMcKie
Dec 12, 2020
Maintainer

andrei-volkau Dec 17, 2020
Author