open pdf warning #1153

zyc130130 · 2021-07-16T08:36:47Z

zyc130130
Jul 16, 2021

Hi,
I try to open a pdf and get it's table of contents, but screen outputs some infomation "mupdf: invalid key in dict", what happen?
By the way, method of "get_toc" it's result is right.

use code:
with open(file_name, "rb") as fr:
file_name = fr.read()
doc = fitz.open(stream=file_name, filetype="pdf")
print(doc.get_toc())

screen output:
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict

The pdf has attached.
876912f8d1749473cefb41681c51fdbf.pdf

Except your replay.

Answered by zyc130130

Jul 19, 2021

Yes, as you said, this is a complicated and hard task. There are many condition need to consider, like number of columns and so on. Through image classification and pumupdf, I can extract some table of contents, I only extract part of the all PDF files due to the diversity of pdf table of contents, and there are some details problem which made it inaccurate. I have try to use ocr, it is helped to improve the extraction range of all PDF files, but it was too slow, and some problems such as incorrect text might also be generated. Thanks for you replay again. Best wish to you!

View full answer

JorjMcKie · 2021-07-16T11:26:48Z

JorjMcKie
Jul 16, 2021
Maintainer

Very simple answer: the PDF is damaged in various places.
Try to repair it, e.g. via ´mutool clean -gggz file.pdf`, or simply live with the error messages.

0 replies

zyc130130 · 2021-07-19T02:33:12Z

zyc130130
Jul 19, 2021
Author

Very simple answer: the PDF is damaged in various places.
Try to repair it, e.g. via ´mutool clean -gggz file.pdf`, or simply live with the error messages.

Thanks for your replay.
The pdf is broken, which sounds sad.
Now I try to get table of contents and index which in some pdfs, I used some method like "doc.get_toc", I know there is not bookmark in every pdf, even bookmark is not exactly equivalent to table of contents, so I use image classification technology to process images and locate the table of contents and index location. Do you have any suggestions for this requirement or a better way.

1 reply

JorjMcKie Jul 19, 2021
Maintainer

I don't understand: Bookmark that is not part of a table of contents? What is that?
And how can you locate a TOC via "image classification technology" that is not accessible via the PDF itself?
Are we talking about the same things at all?

maybe you want to create a TOC?
or insert links?

zyc130130 · 2021-07-19T08:49:10Z

zyc130130
Jul 19, 2021
Author

Yes, I want to creat TOC via some page named "contents". I want to get some pages which are named "contents" or "index" and parse their text. I want to get contents-text-tree-structure which like bookmark(get via doc.get_toc). Using image classification to get which page num is page named "contents" or "index". I means some bookmark is part of page "contents", example, bookmark is only have level 1, but pdf "contents" have level 2 and level 3. I know bookmark is created by somebody, he maybe don't write level 2.

…

------------------ 原始邮件 ------------------ 发件人: "pymupdf/PyMuPDF" ***@***.***>; 发送时间: 2021年7月19日(星期一) 下午4:02 ***@***.***>; ***@***.******@***.***>; 主题: Re: [pymupdf/PyMuPDF] open pdf warning (#1153) I don't understand: Bookmark that is not part of a table of contents? What is that? And how can you locate a TOC via "image classification technology" that is not accessible via the PDF itself? Are we talking about the same things at all? maybe you want to create a TOC? or insert links? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

1 reply

JorjMcKie Jul 19, 2021
Maintainer

Ok. This is not an easy task, because you must extract and interpret text.
And a lot of detail problems may occur here: "Contents" pages may be more than one, you have to ignore header lines and footer lines, etc.
If you in addition want to add more detailed bookmarks (like level 2+), things are getting even more complicated.
It will probably be better to scan the text of all document pages and try to find text headers yourself ... in many cases they have a different font or are bold, etc. or are prefixed with paragraph and section numbers like "3.4.2 Section ...", and so on.
Good luck!

zyc130130 · 2021-07-19T10:57:15Z

zyc130130
Jul 19, 2021
Author

Yes, as you said, this is a complicated and hard task. There are many condition need to consider, like number of columns and so on. Through image classification and pumupdf, I can extract some table of contents, I only extract part of the all PDF files due to the diversity of pdf table of contents, and there are some details problem which made it inaccurate. I have try to use ocr, it is helped to improve the extraction range of all PDF files, but it was too slow, and some problems such as incorrect text might also be generated. Thanks for you replay again. Best wish to you!

…

------------------ 原始邮件 ------------------ 发件人: "pymupdf/PyMuPDF" ***@***.***>; 发送时间: 2021年7月19日(星期一) 晚上6:16 ***@***.***>; ***@***.******@***.***>; 主题: Re: [pymupdf/PyMuPDF] open pdf warning (#1153) Ok. This is not an easy task, because you must extract and interpret text. And a lot of detail problems may occur here: "Contents" pages may be more than one, you have to ignore header lines and footer lines, etc. If you in addition want to add more detailed bookmarks (like level 2+), things are getting even more complicated. It will probably be better to scan the text of all document pages and try to find text headers yourself ... in many cases they have a different font or are bold, etc. or are prefixed with paragraph and section numbers like "3.4.2 Section ...", and so on. Good luck! — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

open pdf warning #1153

{{title}}

Replies: 4 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

open pdf warning #1153

zyc130130 Jul 16, 2021

Replies: 4 comments · 2 replies

JorjMcKie Jul 16, 2021 Maintainer

zyc130130 Jul 19, 2021 Author

JorjMcKie Jul 19, 2021 Maintainer

zyc130130 Jul 19, 2021 Author

JorjMcKie Jul 19, 2021 Maintainer

zyc130130 Jul 19, 2021 Author

zyc130130
Jul 16, 2021

Replies: 4 comments 2 replies

JorjMcKie
Jul 16, 2021
Maintainer

zyc130130
Jul 19, 2021
Author

JorjMcKie Jul 19, 2021
Maintainer

zyc130130
Jul 19, 2021
Author

JorjMcKie Jul 19, 2021
Maintainer

zyc130130
Jul 19, 2021
Author