Replies: 4 comments 2 replies
-
Very simple answer: the PDF is damaged in various places. |
Beta Was this translation helpful? Give feedback.
-
Thanks for your replay. |
Beta Was this translation helpful? Give feedback.
-
Yes, I want to creat TOC via some page named "contents".
I want to get some pages which are named "contents" or "index" and parse their text. I want to get contents-text-tree-structure which like bookmark(get via doc.get_toc). Using image classification to get which page num is page named "contents" or "index".
I means some bookmark is part of page "contents", example, bookmark is only have level 1, but pdf "contents" have level 2 and level 3. I know bookmark is created by somebody, he maybe don't write level 2.
…------------------ 原始邮件 ------------------
发件人: "pymupdf/PyMuPDF" ***@***.***>;
发送时间: 2021年7月19日(星期一) 下午4:02
***@***.***>;
***@***.******@***.***>;
主题: Re: [pymupdf/PyMuPDF] open pdf warning (#1153)
I don't understand: Bookmark that is not part of a table of contents? What is that?
And how can you locate a TOC via "image classification technology" that is not accessible via the PDF itself?
Are we talking about the same things at all?
maybe you want to create a TOC?
or insert links?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Beta Was this translation helpful? Give feedback.
-
Yes, as you said, this is a complicated and hard task. There are many condition need to consider, like number of columns and so on.
Through image classification and pumupdf, I can extract some table of contents, I only extract part of the all PDF files due to the diversity of pdf table of contents, and there are some details problem which made it inaccurate.
I have try to use ocr, it is helped to improve the extraction range of all PDF files, but it was too slow, and some problems such as incorrect text might also be generated.
Thanks for you replay again.
Best wish to you!
…------------------ 原始邮件 ------------------
发件人: "pymupdf/PyMuPDF" ***@***.***>;
发送时间: 2021年7月19日(星期一) 晚上6:16
***@***.***>;
***@***.******@***.***>;
主题: Re: [pymupdf/PyMuPDF] open pdf warning (#1153)
Ok. This is not an easy task, because you must extract and interpret text.
And a lot of detail problems may occur here: "Contents" pages may be more than one, you have to ignore header lines and footer lines, etc.
If you in addition want to add more detailed bookmarks (like level 2+), things are getting even more complicated.
It will probably be better to scan the text of all document pages and try to find text headers yourself ... in many cases they have a different font or are bold, etc. or are prefixed with paragraph and section numbers like "3.4.2 Section ...", and so on.
Good luck!
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Beta Was this translation helpful? Give feedback.
-
Hi,
I try to open a pdf and get it's table of contents, but screen outputs some infomation "mupdf: invalid key in dict", what happen?
By the way, method of "get_toc" it's result is right.
use code:
with open(file_name, "rb") as fr:
file_name = fr.read()
doc = fitz.open(stream=file_name, filetype="pdf")
print(doc.get_toc())
screen output:
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
mupdf: invalid key in dict
The pdf has attached.
876912f8d1749473cefb41681c51fdbf.pdf
Except your replay.
Beta Was this translation helpful? Give feedback.
All reactions