Start making all APIs use iterator protocol instead of bespoke method…

…s/classes ad infinitum (#11) * feat!: always in-memory parser and use iterator protocol (mostly) * fix: avoid error if x was a tuple for some reason * test: fix tests * fix: minor tweaks * ci: benchmark * chore: ruff * ci: make benchmark a separate job * ci: make benchmark a separate workflow * ci: report ccoverage * refactor!: make lines/revlines behave the same way * refactor!: remove the utterly useless PDFResourceManager * chore: ruff * fix: tolerate mangled PDF headers * refactor!: nexttoken redundant for lexer * refactor!: PDFEliminate PDFExtra PDFCharacters PDFEverwhere PDFWe PDFHave PDFNamespaces PDFAfter PDFAll * refactor!: there can be only one (parser) * refactor!: page indices (0-based), PDFRemove PDFMore PDFPrefixes * docs: describe the desired API * fix: seek 0 in iter * feat: iterator-based layout API * chore: ruff it up * fix(tests): test layout against pdfminer.six * fix: error consistent with pdfminer * fix: ensure xobjects actually work * fix: validate against pdfminer * fix: STRICT breaks things * fix(test): extra-dependencies
dhdaines · Nov 1, 2024 · 9c0e217 · 9c0e217
1 parent 6f80c3d
commit 9c0e217
Show file tree

Hide file tree

Showing 26 changed files with 1,977 additions and 2,567 deletions.
diff --git a/.github/workflows/benchmarks.yml b/.github/workflows/benchmarks.yml
@@ -0,0 +1,21 @@
+name: Benchmark
+on:
+  push:
+    branches: [ "main" ]
+  pull_request:
+    branches: [ "main" ]
+
+jobs:
+  benchmark:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.10"
+      - name: Install Hatch
+        uses: pypa/hatch@install
+      - name: Run benchmarks
+        run: |
+          hatch run bench:all
diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -1,4 +1,4 @@
-name: Run all tests
+name: Test
 on:
   push:
     branches: [ "main" ]
@@ -17,4 +17,4 @@ jobs:
       - name: Install Hatch
         uses: pypa/hatch@install
       - name: Run tests
-        run: hatch test
+        run: hatch test --cover
diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-# PLAYA Ain't a LAYout Analyzer 🏖️
+# **P**LAYA ain't a **LAY**out **A**nalyzer 🏖️
 
 ## About
 
@@ -28,7 +28,110 @@ Notably this does *not* include the largely undocumented heuristic
 to understand due to a Java-damaged API based on deeply nested class
 hierarchies, and because layout analysis is best done
 probabilistically/visually.  Also, pdfplumber does its own, much
-nicer, layout analysis.
+nicer, layout analysis.  Also, if you just want to extract text from a
+PDF, there are a lot of better and faster tools and libraries out
+there, see [benchmarks]() for a summary (TL;DR pypdfium2 is probably
+what you want, but pdfplumber does a nice job of converting PDF to
+ASCII art).
+
+## Usage
+
+Do you want to get stuff out of a PDF?  You have come to the right
+place!  Let's open up a PDF and see what's in it:
+
+```python
+pdf = playa.open("my_awesome_document.pdf")
+raw_byte_stream = pdf.buffer
+a_bunch_of_tokens = list(pdf.tokens)
+a_bunch_of_objects = list(pdf)
+a_particular_indirect_object = pdf[42]
+```
+
+The raw PDF tokens and objects are probably not terribly useful to
+you, but you might find them interesting.
+
+It probably has some pages.  How many?  What are their numbers/labels?
+(they could be things like "xviii", 'a", or "42", for instance)
+
+```python
+npages = len(pdf.pages)
+page_numbers = [page.label for page in pdf.pages]
+```
+
+What's in the table of contents?
+
+```python
+for entry in pdf.outlines:
+    ...
+```
+
+If you are lucky it has a "logical structure tree".  The elements here
+might even be referenced from the table of contents!  (or, they might
+not... with PDF you never know)
+
+```python
+structure = pdf.structtree
+for element in structure:
+   for child in element:
+       ...
+```
+
+Now perhaps we want to look at a specific page.  Okay!
+```python
+page = pdf.pages[0]        # they are numbered from 0
+page = pdf.pages["xviii"]  # but you can get them by label
+page = pdf.pages["42"]  # or "logical" page number (also a label)
+a_few_content_streams = list(page.contents)
+raw_bytes = b"".join(stream.buffer for stream in page.contents)
+```
+
+This page probably has text, graphics, etc, etc, in it.  Remember that
+**P**LAYA ain't a **LAY**out **A**nalyzer!  You can either look at the
+stream of tokens or mysterious PDF objects:
+```python
+for token in page.tokens:
+    ...
+for object in page:
+    ...
+```
+
+Or you can access individual characters, lines, curves, and rectangles
+(if you wanted to, for instance, do layout analysis):
+```python
+for item in page.layout:
+    ...
+```
+
+Do we make you spelunk in a dank class hierarchy to know what these
+items are?  No, we do not! They are just NamedTuples with a very
+helpful field *telling* you what they are, as a string.
+
+In particular you can also extract all these items into a dataframe
+using the library of your choosing (I like [Polars]()) and I dunno do
+some Artifishul Intelligents or something with them:
+```python
+```
+
+Or just write them to a CSV file:
+```python
+```
+
+Note again that PLAYA doesn't guarantee that these characters come at
+you in anything other than the order they occur in the file (but it
+does guarantee that).  It does, however, put them in (hopefully) the
+right absolute positions on the page, and keep track of the clipping
+path and the graphics state, so yeah, you *could* "render" them like
+`pdfminer.six` pretended to do.
+
+Certain PDF tools and/or authors are notorious for using "whiteout"
+(set the color to the background color) or "scissors" (the clipping
+path) to hide arbitrary text that maybe *you* don't want to see
+either. PLAYA gives you some rudimentary tools to detect this:
+```python
+```
+
+For everything else, there's pdfplumber, pdfium2, pikepdf, pypdf,
+borb, pydyf, etc, etc, etc.
 
 ## Acknowledgement
 

diff --git a/playa/__init__.py b/playa/__init__.py
@@ -10,7 +10,7 @@
 from os import PathLike
 from typing import Union
 
-from playa.pdfdocument import PDFDocument
+from playa.document import PDFDocument
 
 __version__ = "0.0.1"
 

diff --git a/playa/cmapdb.py b/playa/cmapdb.py
@@ -32,8 +32,8 @@
 )
 
 from playa.encodingdb import name2unicode
-from playa.exceptions import PSEOF, PDFException, PDFTypeError, PSSyntaxError
-from playa.psparser import KWD, PSKeyword, PSLiteral, PSStackParser, literal_name
+from playa.exceptions import PDFException, PDFTypeError, PSSyntaxError
+from playa.parser import KWD, Parser, PSKeyword, PSLiteral, literal_name
 from playa.utils import choplist, nunpack
 
 log = logging.getLogger(__name__)
@@ -275,7 +275,7 @@ def get_unicode_map(cls, name: str, vertical: bool = False) -> UnicodeMap:
         return cls._umap_cache[name][vertical]
 
 
-class CMapParser(PSStackParser[PSKeyword]):
+class CMapParser(Parser[PSKeyword]):
     def __init__(self, cmap: CMapBase, data: bytes) -> None:
         super().__init__(data)
         self.cmap = cmap
@@ -284,10 +284,7 @@ def __init__(self, cmap: CMapBase, data: bytes) -> None:
         self._warnings: Set[str] = set()
 
     def run(self) -> None:
-        try:
-            self.nextobject()
-        except PSEOF:
-            pass
+        next(self, None)
 
     KEYWORD_BEGINCMAP = KWD(b"begincmap")
     KEYWORD_ENDCMAP = KWD(b"endcmap")

diff --git a/playa/pdfcolor.py → playa/color.py b/playa/pdfcolor.py → playa/color.py
@@ -1,7 +1,7 @@
 import collections
 from typing import Dict
 
-from playa.psparser import LIT
+from playa.parser import LIT
 
 LITERAL_DEVICE_GRAY = LIT("DeviceGray")
 LITERAL_DEVICE_RGB = LIT("DeviceRGB")