Skip to content

Using setMetadata() and setToC()

Jorj X. McKie edited this page Jul 25, 2016 · 4 revisions

These methods allow changing meta information of a PDF document (only). Like the earlier introduced method select(), they are methods in the Document class. Both as well support the incremental save technique.

Maintaining Meta Data

For every MuPDF-supported document type, doc.metadata is a Python dictionary with keys format, author, creator, producer, creationDate, modDate, subject, title, encryption and keywords. This information may or may not (completely) exist for any given document.

Except format and encryption, all of these data can be changed.

All you have to do is preparing a Python dictionary m with some or all or the above key-value pairs and invoke doc.setMetadata(m).

If you provide an empty dictionray m = {}, all information will be cleared to contain the value none.

Any above key not contained in this dictionary, will also receive a value of none.

If you want to clear meta data for data protection / data security reasons, please make sure you save your PDF to a new file using save option garbage. This makes sure the old information is physically removed from the file (incremental save does not do that).

If you want to change only selected values of a PDF, take a modified doc.metadata and directly use it as a parameter. Obviously, PDF format and encryption cannot be changed. If these keys are present in m, they will be silently ignored.

Except for the dates keys, any unicode value is acceptable. See section PDF String Handling.

Maintaining Bookmarks

Bookmarks or outlines form a quite complex forward-backward chained set of objects in PDFs. Together they are known as table of contents (TOC).

A TOC structure as found in books is much simpler: it just contains a list of lines with titles, page references and hierarchy levels. Relationship between such lines is only implicitly established by their sequence of occurrence.

Maintaining a complete TOC (instead of single, separate bookmark items) is therefore exactly what we have decided to implement in PyMuPDF. Changing anything in a TOC means changing the complete TOC. A TOC will be inserted, changed or deleted as one single item with this function. We believe that this approach meets both, practical requirements and intuitive handling:

  • everyone knows what TOCs in books are and how to use them
  • hierarchy relations between lines in a TOC can simply be expressed by the entry's hierachy level
  • forward / backward relationships between entries are established implicitely by the sequence in which they occur

In addition, previously existing method doc.getToC() already provides an intuitive picture of all document bookmark items of a document in exactly the way described above. So, maintaining a TOC of a PDF could occur in the following simple steps:

  1. toc = doc.getToC(simple = True or False)
  2. Modify toc as required ...
  3. doc.setToC(toc)

In step 3, behind the scenes, a new outline chain will be created using toc to completely replace the old one. If you wish to delete an existing TOC, you can also set toc = [].

If you wish to give a PDF a completely new TOC, provide a list of lists like toc = [[lvl1, title1, page1], [lvl2, title2, page2], ...].

As with meta data above, title entries may be provided using the full unicode character set (see following section).

Example program PDFoutline.py implements all of the above using the wxPython GUI.

PDF String Handling

Outside document content text, PDF support two sets of character encoding, namely PDFDocEncoding and Unicode (see appendix D of the Adobe manual). Both are now fully implemented in PyMuPDF for use in methods setMetadata() and setToC() in the following way (applies to the above mentioned metadata fields and the TOC title entries):

  • if an entry contains only ASCII characters (ord(c) <= 127), it will be used unchanged / as is;
  • else, any character 127 < ord(c) <= 255 will be replaced by the string \nnn, where nnn is the octal representation of ord(c); the resulting string will be used;
  • else, if a string contains any character with ord(c) > 255, the complete string is encoded using UTF-16BE, prefixed with 0xfeff and this result, converted to its hexadecimal representation, will be used.

Differences and similarities of string handling between Python 2 and Python 3 are covered in the following way:

  • The argument will be decoded with UTF-8.
  • If it was bytes or bytearray, it will be converted to unicode (Python 2 and Python 3)
  • A str in Python 2 will become unicode, a unicode (Python 2) and a str (Python 3) will remain unaffected (i.e. stay unicode).
  • The resulting str / unicode will then be treated as mentioned above.

All of the above results in a considerable flexibility: metadata and title fields can be provided as strings, unicode, bytes or bytearray objects!

Clone this wiki locally