低级接口#

提供了多种方法来访问和操作 PDF 文件，处于相当低的级别。诚然，“低级”功能和“正常”功能之间的明确区分并不总是可行或取决于个人偏好。

有时，以前被认为是低级的功能后来被评估为正常接口的一部分。在 v1.14.0 中，Tools 类就发生了这种情况 - 您现在可以在 Classes 章节中找到它。

您在文档的哪个章节中找到所需内容仅是文档组织的问题。所有功能都可用，并且始终通过相同的接口访问。

如何迭代 `xref` 表#

PDF 的 xref 表是文件中定义的所有对象的列表。此表可能轻松包含数千个条目 - 例如，Adobe PDF References 手册有 127,000 个对象。表项“0”是保留的，不得修改。以下脚本遍历 xref 表并打印每个对象的定义

>>> xreflen = doc.xref_length()  # length of objects table
>>> for xref in range(1, xreflen):  # skip item 0!
        print("")
        print("object %i (stream: %s)" % (xref, doc.xref_is_stream(xref)))
        print(doc.xref_object(xref, compressed=False))

这会产生以下输出

object 1 (stream: False)
<<
    /ModDate (D:20170314122233-04'00')
    /PXCViewerInfo (PDF-XChange Viewer;2.5.312.1;Feb  9 2015;12:00:06;D:20170314122233-04'00')
>>

object 2 (stream: False)
<<
    /Type /Catalog
    /Pages 3 0 R
>>

object 3 (stream: False)
<<
    /Kids [ 4 0 R 5 0 R ]
    /Type /Pages
    /Count 2
>>

object 4 (stream: False)
<<
    /Type /Page
    /Annots [ 6 0 R ]
    /Parent 3 0 R
    /Contents 7 0 R
    /MediaBox [ 0 0 595 842 ]
    /Resources 8 0 R
>>
...
object 7 (stream: True)
<<
    /Length 494
    /Filter /FlateDecode
>>
...

PDF 对象定义是普通的 ASCII 字符串。

如何处理对象流#

某些对象类型除了对象定义外，还包含额外数据。示例包括图像、字体、嵌入文件或描述页面外观的命令。

这些类型的对象称为“流对象”。PyMuPDF 允许通过方法 Document.xref_stream() 读取对象的流，使用对象的 xref 作为参数。也可以使用 Document.update_stream() 写回流的修改版本。

假设以下代码片段出于某种原因想要读取 PDF 中的所有流

>>> xreflen = doc.xref_length() # number of objects in file
>>> for xref in range(1, xreflen): # skip item 0!
        if stream := doc.xref_stream(xref):
            # do something with it (it is a bytes object or None)
            # e.g. just write it back:
            doc.update_stream(xref, stream)

Document.xref_stream() 会自动返回解压缩为字节对象的流 – 而 Document.update_stream() 会在有利时自动压缩它。

如何处理页面内容#

PDF 页面可以有零个或多个 contents 对象。这些是描述页面上显示什么、在哪里以及如何显示的流对象（如文本和图像）。它们使用一种特殊的微型语言编写，例如在 Adobe PDF References 的第 643 页“附录 A - 运算符摘要”一章中进行了描述。

每个 PDF 阅读器应用程序都必须能够解释内容语法，以重现页面应有的外观。

如果提供了多个 contents 对象，则必须按照指定的顺序进行解释，其方式与它们作为多个对象的串联提供完全相同。

使用多个 contents 对象有很好的技术原因

添加新的 contents 对象比维护一个大的对象容易得多且更快（维护大对象涉及每次更改时进行读取、解压缩、修改、重新压缩和重写）。
在处理增量更新时，修改后的大 contents 对象会膨胀更新增量，从而很容易抵消增量保存的效率。

例如，PyMuPDF 在方法 Page.insert_image()、Page.show_pdf_page() 和 Shape 方法中添加了新的、小的 contents 对象。

然而，在某些情况下，单个 contents 对象更为有利：它比多个较小的对象更容易解释且更易压缩。

以下是合并页面多个内容对象的两种方法

>>> # method 1: use the MuPDF clean function
>>> page.clean_contents()  # cleans and combines multiple Contents
>>> xref = page.get_contents()[0]  # only one /Contents now!
>>> cont = doc.xref_stream(xref)
>>> # this has also reformatted the PDF commands

>>> # method 2: extract concatenated contents
>>> cont = page.read_contents()
>>> # the /Contents source itself is unmodified

clean 函数 Page.clean_contents() 不仅仅是连接 contents 对象：它还修正和优化页面的 PDF 运算符语法，并移除与页面对象定义之间的任何不一致。

如何访问 PDF 目录#

这是 PDF 的中心（“根”）对象。它作为起点用于访问其他重要对象，并且还包含 PDF 的一些全局选项。

>>> import pymupdf
>>> doc=pymupdf.open("PyMuPDF.pdf")
>>> cat = doc.pdf_catalog()  # get xref of the /Catalog
>>> print(doc.xref_object(cat))  # print object definition
<<
    /Type/Catalog                 % object type
    /Pages 3593 0 R               % points to page tree
    /OpenAction 225 0 R           % action to perform on open
    /Names 3832 0 R               % points to global names tree
    /PageMode /UseOutlines        % initially show the TOC
    /PageLabels<</Nums[0<</S/D>>2<</S/r>>8<</S/D>>]>> % labels given to pages
    /Outlines 3835 0 R            % points to outline tree
>>

注意

这里的缩进、换行和注释仅用于澄清目的，通常不会出现。有关 PDF 目录的更多信息，请参阅 Adobe PDF References 第 71 页的 7.7.2 节。

如何访问 PDF 文件尾部#

PDF 文件的尾部是一个 dictionary，位于文件末尾附近。它包含特殊对象以及指向其他重要信息的指针。请参阅 Adobe PDF References 第 42 页。以下是概览：

键	类型	值
Size	int	交叉引用表中的条目数 + 1。
Prev	int	指向前一个 `xref` 部分的偏移量（指示增量更新）。
Root	dictionary	（间接）指向目录的指针。参见上一节。
Encrypt	dictionary	指向加密对象的指针（仅限加密文件）。
Info	dictionary	（间接）指向信息（元数据）的指针。
ID	array	由两个字节字符串组成的文件标识符。
XRefStm	int	交叉引用流的偏移量。请参阅 Adobe PDF References 第 49 页。

通过 PyMuPDF 使用 Document.pdf_trailer() 或等效地使用 Document.xref_object() 并使用 -1 代替有效的 xref 编号来访问此信息。

>>> import pymupdf
>>> doc=pymupdf.open("PyMuPDF.pdf")
>>> print(doc.xref_object(-1))  # or: print(doc.pdf_trailer())
<<
/Type /XRef
/Index [ 0 8263 ]
/Size 8263
/W [ 1 3 1 ]
/Root 8260 0 R
/Info 8261 0 R
/ID [ <4339B9CEE46C2CD28A79EBDDD67CC9B3> <4339B9CEE46C2CD28A79EBDDD67CC9B3> ]
/Length 19883
/Filter /FlateDecode
>>
>>>

如何访问 XML 元数据#

PDF 除了标准的元数据格式外，还可能包含 XML 元数据。实际上，大多数 PDF 阅读器或修改软件在保存 PDF 时都会添加此类信息（如 Adobe, Nitro PDF, PDF-XChange 等）。

PyMuPDF 无法直接解释或更改此信息，因为它不包含 XML 功能。但是，XML 元数据作为 stream 对象存储，因此可以使用适当的软件进行读取、修改并写回。

>>> xmlmetadata = doc.get_xml_metadata()
>>> print(xmlmetadata)
<?xpacket begin="\ufeff" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="3.1-702">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
...
omitted data
...
<?xpacket end="w"?>

使用某些 XML 包，可以解释和/或修改 XML 数据，然后将其存储回去。即使 PDF 之前没有 XML 元数据，以下方法也适用。

>>> # write back modified XML metadata:
>>> doc.set_xml_metadata(xmlmetadata)
>>>
>>> # XML metadata can be deleted like this:
>>> doc.del_xml_metadata()

如何扩展 PDF 元数据#

属性 Document.metadata 设计为对所有支持的文档类型以相同方式工作：它是一个带有固定键值对集合的 Python dictionary。相应地，Document.set_metadata() 只接受标准键。

然而，PDF 可能包含无法以这种方式访问的项目。此外，可能出于某些原因需要存储额外信息，例如版权。以下是使用 PyMuPDF 低级函数处理任意元数据项的方法。

例如，请看某个 PDF 的标准元数据输出。

# ---------------------
# standard metadata
# ---------------------
pprint(doc.metadata)
{'author': 'PRINCE',
 'creationDate': "D:2010102417034406'-30'",
 'creator': 'PrimoPDF http://www.primopdf.com/',
 'encryption': None,
 'format': 'PDF 1.4',
 'keywords': '',
 'modDate': "D:20200725062431-04'00'",
 'producer': 'macOS Version 10.15.6 (Build 19G71a) Quartz PDFContext, '
             'AppendMode 1.1',
 'subject': '',
 'title': 'Full page fax print',
 'trapped': ''}

使用以下代码查看存储在元数据对象中的所有项。

# ----------------------------------
# metadata including private items
# ----------------------------------
metadata = {}  # make my own metadata dict
what, value = doc.xref_get_key(-1, "Info")  # /Info key in the trailer
if what != "xref":
    pass  # PDF has no metadata
else:
    xref = int(value.replace("0 R", ""))  # extract the metadata xref
    for key in doc.xref_get_keys(xref):
        metadata[key] = doc.xref_get_key(xref, key)[1]
pprint(metadata)
{'Author': 'PRINCE',
 'CreationDate': "D:2010102417034406'-30'",
 'Creator': 'PrimoPDF http://www.primopdf.com/',
 'ModDate': "D:20200725062431-04'00'",
 'PXCViewerInfo': 'PDF-XChange Viewer;2.5.312.1;Feb  9 '
                 "2015;12:00:06;D:20200725062431-04'00'",
 'Producer': 'macOS Version 10.15.6 (Build 19G71a) Quartz PDFContext, '
             'AppendMode 1.1',
 'Title': 'Full page fax print'}
# ---------------------------------------------------------------
# note the additional 'PXCViewerInfo' key - ignored in standard!
# ---------------------------------------------------------------

反之，您也可以在 PDF 中存储私有元数据项。您有责任确保这些项符合 PDF 规范 - 特别是它们必须是（unicode）字符串。有关详细信息和注意事项，请查阅 Adobe PDF References 的 14.3 节（第 548 页）。

what, value = doc.xref_get_key(-1, "Info")  # /Info key in the trailer
if what != "xref":
    raise ValueError("PDF has no metadata")
xref = int(value.replace("0 R", ""))  # extract the metadata xref
# add some private information
doc.xref_set_key(xref, "mykey", pymupdf.get_pdf_str("北京 is Beijing"))
#
# after executing the previous code snippet, we will see this:
pprint(metadata)
{'Author': 'PRINCE',
 'CreationDate': "D:2010102417034406'-30'",
 'Creator': 'PrimoPDF http://www.primopdf.com/',
 'ModDate': "D:20200725062431-04'00'",
 'PXCViewerInfo': 'PDF-XChange Viewer;2.5.312.1;Feb  9 '
                  "2015;12:00:06;D:20200725062431-04'00'",
 'Producer': 'macOS Version 10.15.6 (Build 19G71a) Quartz PDFContext, '
             'AppendMode 1.1',
 'Title': 'Full page fax print',
 'mykey': '北京 is Beijing'}

要删除选定的键，请使用 doc.xref_set_key(xref, "mykey", "null")。如下一节所述，字符串“null”是 PDF 中相当于 Python 的 None。具有该值的键将被视为未指定，并在垃圾回收时物理移除。

如何读取和更新 PDF 对象#

还存在精细、优雅的方法来访问和操作选定的 PDF dictionary 键。

Document.xref_get_keys() 返回 xref 处对象的 PDF 键

In [1]: import pymupdf
In [2]: doc = pymupdf.open("pymupdf.pdf")
In [3]: page = doc[0]
In [4]: from pprint import pprint
In [5]: pprint(doc.xref_get_keys(page.xref))
('Type', 'Contents', 'Resources', 'MediaBox', 'Parent')

与完整的对象定义进行比较。

In [6]: print(doc.xref_object(page.xref))
<<
  /Type /Page
  /Contents 1297 0 R
  /Resources 1296 0 R
  /MediaBox [ 0 0 612 792 ]
  /Parent 1301 0 R
>>

单个键也可以通过 Document.xref_get_key() 直接访问。该值始终是字符串，并带有类型信息，有助于解释它
```
In [7]: doc.xref_get_key(page.xref, "MediaBox")
Out[7]: ('array', '[0 0 612 792]')
```

以下是上述页面键的完整列表。

In [9]: for key in doc.xref_get_keys(page.xref):
...:        print("%s = %s" % (key, doc.xref_get_key(page.xref, key)))
...:
Type = ('name', '/Page')
Contents = ('xref', '1297 0 R')
Resources = ('xref', '1296 0 R')
MediaBox = ('array', '[0 0 612 792]')
Parent = ('xref', '1301 0 R')

查询未定义的键将返回 ('null', 'null') – PDF 对象类型 null 对应于 Python 中的 None。布尔值 true 和 false 类似。

让我们向页面定义添加一个新键，将其旋转角度设置为 90 度（您知道实际上存在 Page.set_rotation() 方法用于此目的吗？）。

In [11]: doc.xref_get_key(page.xref, "Rotate")  # no rotation set:
Out[11]: ('null', 'null')
In [12]: doc.xref_set_key(page.xref, "Rotate", "90")  # insert a new key
In [13]: print(doc.xref_object(page.xref))  # confirm success
<<
  /Type /Page
  /Contents 1297 0 R
  /Resources 1296 0 R
  /MediaBox [ 0 0 612 792 ]
  /Parent 1301 0 R
  /Rotate 90
>>

此方法也可用于通过将其值设置为 null 从 xref dictionary 中移除键：以下代码将从页面中移除旋转规范：doc.xref_set_key(page.xref, "Rotate", "null")。类似地，要从页面中移除所有链接、注释和字段，请使用 doc.xref_set_key(page.xref, "Annots", "null")。因为根据定义，Annots 是一个 array，在这种情况下，使用语句 doc.xref_set_key(page.xref, "Annots", "[]") 设置一个空数组也会达到同样的效果。

PDF dictionaries 可以分层嵌套。在以下页面对象定义中，Font 和 XObject 都是 Resources 的子 dictionaries。

In [15]: print(doc.xref_object(page.xref))
<<
  /Type /Page
  /Contents 1297 0 R
  /Resources <<
    /XObject <<
      /Im1 1291 0 R
    >>
    /Font <<
      /F39 1299 0 R
      /F40 1300 0 R
    >>
  >>
  /MediaBox [ 0 0 612 792 ]
  /Parent 1301 0 R
  /Rotate 90
>>

上述情况受 Document.xref_set_key() 和 Document.xref_get_key() 方法支持：使用类似路径的表示法指向所需的键。例如，要检索上面的键 Im1 的值，请在键参数中指定其“上方”的完整 dictionaries 链：“Resources/XObject/Im1”。
```
In [16]: doc.xref_get_key(page.xref, "Resources/XObject/Im1")
Out[16]: ('xref', '1291 0 R')
```

路径表示法也可用于直接设置值：使用以下代码让 Im1 指向不同的对象

In [17]: doc.xref_set_key(page.xref, "Resources/XObject/Im1", "9999 0 R")
In [18]: print(doc.xref_object(page.xref))  # confirm success:
<<
  /Type /Page
  /Contents 1297 0 R
  /Resources <<
    /XObject <<
      /Im1 9999 0 R
    >>
    /Font <<
      /F39 1299 0 R
      /F40 1300 0 R
    >>
  >>
  /MediaBox [ 0 0 612 792 ]
  /Parent 1301 0 R
  /Rotate 90
>>

请注意，这里不会进行任何语义检查：如果 PDF 没有 xref 9999，在此刻不会检测到。

如果键不存在，则通过设置其值来创建它。此外，如果任何中间键也不存在，也会根据需要自动创建。以下代码在现有 dictionary A 下创建多个级别的 array D。中间 dictionaries B 和 C 会自动创建。

In [5]: print(doc.xref_object(xref))  # some existing PDF object:
<<
  /A <<
  >>
>>
In [6]: # the following will create 'B', 'C' and 'D'
In [7]: doc.xref_set_key(xref, "A/B/C/D", "[1 2 3 4]")
In [8]: print(doc.xref_object(xref))  # check out what happened:
<<
  /A <<
    /B <<
      /C <<
        /D [ 1 2 3 4 ]
      >>
    >>
  >>
>>

设置键值时，MuPDF 会进行基本的 PDF 语法检查。例如，新键只能在 dictionary 下创建。以下代码尝试在之前创建的 array D 下创建一个新的 string 项 E。

In [9]: # 'D' is an array, no dictionary!
In [10]: doc.xref_set_key(xref, "A/B/C/D/E", "(hello)")
mupdf: not a dict (array)
--- ... ---
RuntimeError: not a dict (array)

此外，如果某个更高级别的键是“间接”对象（即 xref），则也无法创建键。换句话说，xrefs 只能直接修改，不能通过引用它们的其他对象隐式修改。

In [13]: # the following object points to an xref
In [14]: print(doc.xref_object(4))
<<
  /E 3 0 R
>>
In [15]: # 'E' is an indirect object and cannot be modified here!
In [16]: doc.xref_set_key(4, "E/F", "90")
mupdf: path to 'F' has indirects
--- ... ---
RuntimeError: path to 'F' has indirects

警告

这些是专家级函数！不验证是否指定了有效的 PDF 对象、xrefs 等。与其他低级方法一样，存在导致 PDF 或其部分内容无法使用的风险。

本软件“按原样”提供，不提供任何明示或暗示的保证。本软件根据许可分发，除非该许可条款明确授权，否则不得复制、修改或分发。有关许可信息，请参阅 artifex.com 或联系 Artifex Software Inc., 39 Mesa Street, Suite 108A, San Francisco CA 94129, United States 获取更多信息。

低级接口#

如何迭代 xref 表#

如何处理对象流#

如何处理页面内容#

如何访问 PDF 目录#

如何访问 PDF 文件尾部#

如何访问 XML 元数据#

如何扩展 PDF 元数据#

如何读取和更新 PDF 对象#

如何迭代 `xref` 表#